CN103577391A - Chinese retrieval based bidirectional word-segmentation method and device - Google Patents
Chinese retrieval based bidirectional word-segmentation method and device Download PDFInfo
- Publication number
- CN103577391A CN103577391A CN201210241471.2A CN201210241471A CN103577391A CN 103577391 A CN103577391 A CN 103577391A CN 201210241471 A CN201210241471 A CN 201210241471A CN 103577391 A CN103577391 A CN 103577391A
- Authority
- CN
- China
- Prior art keywords
- participle
- equal
- full
- forwards step
- storer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 230000002457 bidirectional effect Effects 0.000 title abstract 3
- 238000012545 processing Methods 0.000 claims description 34
- 230000008878 coupling Effects 0.000 claims description 30
- 238000010168 coupling process Methods 0.000 claims description 30
- 238000005859 coupling reaction Methods 0.000 claims description 30
- 230000011218 segmentation Effects 0.000 claims description 24
- 238000000605 extraction Methods 0.000 claims description 4
- 239000000284 extract Substances 0.000 claims description 2
- 230000008569 process Effects 0.000 description 8
- 230000006870 function Effects 0.000 description 7
- 238000011161 development Methods 0.000 description 4
- 230000018109 developmental process Effects 0.000 description 4
- 238000013500 data storage Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000010365 information processing Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 241000239290 Araneae Species 0.000 description 1
- 241000270322 Lepidosauria Species 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a Chinese retrieval based bidirectional word-segmentation method and device. Accordingly, Chinese word strings can be segmented accurately without semantic losing, the word-segmentation result is accurate especially when a sentence is long, and accordingly the accuracy of the query can be improved and the Chinese retrieval based bidirectional word-segmentation method and device can be applied to an automatic abstraction and automatic classification system in the field of information retrieval.
Description
Technical field
The present invention relates to webpage Word message process field in computer network, particularly the method and apparatus of the retrieval of the Chinese web page in search engine.
Background technology
Along with the development of internet, Web page quantity sharply significantly increases, and webpage has become the hugest and information source widely of people.Many Useful Informations are submerged in the vast as the open sea Web page, and in the face of magnanimity information, people can not be simply by manually having processed all information.The search of text is one of important application technology of extensive field of information processing, is also the important research direction of field of information processing.Along with going deep into of the classified search technical research to text, the search technique of text is more and more extensive in the application of infotech.And participle technique is any one search engine " soul ", the key word in retrieval request correctly can be cut out from sentence by correct understanding and carry out matched and searched, be the critical process that can a search engine obtain correct result.The whether accurate accuracy that can directly have influence on result for retrieval of participle.Existing conventional participle technique has individual character to divide morphology, dichotomy and maximum length matching algorithm.
(1) individual character divides morphology
It is to using single character to carry out cutting as unit that individual character divides morphology, each character segmentation of text out, then sets up in this manner index.For example, when " People's Republic of China (PRC) " is used individual character to divide morphology to carry out participle, the word cutting out is: " in ", " China ", " people ", " people ", " being total to " ,“ He”,“ state ".Visible, it is fairly simple that individual character divides morphology to realize, but the word cutting out is nonsensical, lost the original semanteme of text.
(2) dichotomy
Dichotomy is usingd two words and is carried out cutting as a unit, and adjacent two character segmentations in text out, are then set up to index in this manner.For example " People's Republic of China (PRC) " is used dichotomy to carry out participle, and the word cutting out is: " China ", " Chinese ", " people ", " people are common ", " republicanism ”,“He state ".Divide morphology to compare with individual character, although dichotomy will got well aspect processing words position, this method is syncopated as a lot of useless entries, thereby produces index redundancy.
(3) maximum length matching algorithm
A relatively more conventional minute word algorithm is maximum matching method (claiming again MM method) at present.MM method is a kind of widely used segmenting method, and in its participle process, it only uses a participle dictionary.The basic thought of MM method is as follows: establish word the longest in dictionary for a word, all from sentence starting position, get the word string that a length is K at every turn, successively it is mated with the word in dictionary, if truly have such word that word length is K in dictionary, successfully mate, just this word string is syncopated as from sentence as a word.Then find the reference position of sentence remainder, according to upper method, get equally the word string that another word length is K, repeat above method, until sentence cutting is complete.The entry that can not find a current word string of coupling in dictionary, should delete word of this word string afterbody, separately generates the word string of a K-1 word length, then inquires about in dictionary.As successfully mated, the word string that is K-1 using this word length is as a word, and from sentence, cutting is gone out; As it fails to match, repeat above step, the word string afterbody of this K-1 word length is left out to a word again, the word string that generates a K 1 goes to mate in dictionary, until successfully mate.
Suppose the character string that need to carry out cutting a: C1C2C3C4 ... Cn, wherein Ci (i=1,2,3 ..., n) representing individual Chinese character character, in dictionary, the longest word word length is i, with i word length, carries out cutting for the first time, as successfully mated, result is as follows:
S=S1+S2,S1={C1…Ci},S2={Ci+1…Cn}
As it fails to match, with i-1 word length, mate again, as the match is successful, its result is as follows:
S=S1+S2,S1={C1…Ci-1},S2={Ci…Cn}
By that analogy, after S1 is cut out, more same Recursion process S2.
The feature of maximum matching algorithm is " priority of long word ", compares with common coupling, can improve well retrieval rate.With forward maximum length matching algorithm get word and step is identical, just from last of character string from starting to get forward word coupling, this matching algorithm is called reverse maximum length matching algorithm, i.e. RMM.RMM matching way present stage is also more and more subject to the favor of people when design participle mode, because its participle in some special retrievals can reach more accurate object.
In common maximum length matching algorithm, no matter be forward or reverse, be all the maximum coupling in the subrange starting, front M character or rear M the character of each sentence mate.In the middle of short sentence, such matching way can also be dealt with problems well, if but sentence length to be matched is far longer than the length of maximum coupling word, the drawback of so this matching way is just apparent, be that local coupling can not be included whole sentence completely, make vocabulary in the middle of the sentence there is no that the match is successful, thereby cause leakage word in retrieval and retrieval not exclusively.
Summary of the invention
Vertical search engine is for certain industry or tissue, take the network information resource storehouse of constructing certain industry or tissue meets industry demand of specialty or organizes a certain business demand as target, collect on the internet intelligently the information resources that meet a certain special topic needs, refinement and the extension of universal search engine, it is integrated certain the class specialized information in web page library, minute field extracts the data that user needs, and after processing, with certain form, returns to user.Can, for comprising the information user of Subject Information Gateway, specialized information mechanism, specific industry field, company information center, industry specialists etc., provide a whole set of network information resource development plan.As can be seen here, vertical search engine is search engine more targetedly, and it only searches for the information of particular topic.
The present invention is directed to the problem that independent forward or reverse maximum matching algorithm exist, in especially vertical search system of search system, make full use of professional class environment, in machine dictionary storehouse, set up professional class dictionary, first according to the maximum length of the proper noun in dictionary, determine the value of MAX_Length, solved the problem of blindly selecting maximum length in matching algorithm, and the mode of being combined with reverse maximum matching algorithm by forward forms positive and negative bi-directional matching algorithm, allow positive and negative maximum matching algorithm can in matching process, make up deficiency separately, from improving to a great extent the degree of accuracy of retrieval.Here, when proposing bi-directional matching algorithm, for the segmentation ambiguity in forward and reverse matching process and the full problem of coupling, the matching algorithm of maximum length is proposed to improve, the thought that proposes " window " coupling further promotes participle matching effect.
A kind of two-way segmenting method based on Chinese retrieval proposed by the invention adopts the maximum length matching algorithm of forward and reverse combination to carry out participle, and the device that wherein forward segmenting method (improved MM method) passes through to be comprised of central processing unit, input-output device, register, machine dictionary, window enumeration device, storer is to Chinese character string S=C
1c
2c
3c
4c
ncarry out participle, comprise the following steps:
(1) central processing unit reads in Chinese character string S=C from input-output device
1c
2c
3c
4c
n, put into register, if the value of n is less than or equal to 1, participle coupling finishes, word segmentation result is kept in storer, directly forwards step (7) to; When n > 1, forward step (2) to;
(2) central processing unit is searched the value M=MAX_Length of maximum length coupling word from machine dictionary, and the first character that the reference position of window enumeration device is sentence to be matched, is designated as j=0;
(3) when n-j is more than or equal to M and M and is more than or equal to 1, from the position of j, start the substring that intercepted length is M backward register and mate, if mate, unsuccessfully forward step (4) to, if the match is successful, forward step (5) to; When n-j is less than M and M and is more than or equal to 1, forward step (6) to; When M is less than 1, forward step (7) to;
(4) j=j+1, window enumeration device adds 1, forwards step (3) to;
(5) participle that the match is successful is taken out, deposit storer in, j=j+M, forwards step (3) to;
(6) M value subtracts 1, j=0, forwards step (3) to;
(7) word segmentation result being kept in storer is exported by input-output device;
A kind of idiographic flow of forward participle of the two-way segmenting method based on Chinese retrieval as shown in Figure 1.
In original RMM algorithm, its matching way and forward matching way are basic identical, reciprocal in the direction of just advancing in coupling, and this point difference can form larger result difference in matching process.And in improved matching algorithm, due to adding of window, a difficult point of reverse matching way is exactly the location of window starting point and the control of moving direction, this is also one of important difference under comparing with forward matching way.
Reverse segmentation method (improved RMM method), by the device that formed by central processing unit, input-output device, register, machine dictionary, window enumeration device, storer to Chinese character string S=C
1c
2c
3c
4c
ncarry out participle, comprise the following steps:
(8) central processing unit reads in Chinese character string S=C from input-output device
1c
2c
3c
4c
n, put into register, if the value of n is less than or equal to 1, participle coupling finishes, word segmentation result is kept in storer, directly forwards step (14) to; When n > 1, forward step (9) to;
(9) central processing unit is searched the value M=MAX_Length of maximum length coupling word from machine dictionary, and the last character that the reference position of window enumeration device is sentence to be matched, is designated as n-P, wherein P=0;
(10) if n-P is less than 0, coupling finishes, and directly forwards step (14) to, when P is more than or equal to M and M and is more than or equal to 1, from the position of n-P, starts the substring that intercepted length is M forward and mates, if mate the unsuccessful step (11) that forwards to; If the match is successful, forward step (12) to; When being less than, P when M and M are more than or equal to 1, forwards step (13) to; When being less than 1, M forwards step (14) to;
(11) p=p+1, forwards step (10) to;
(12) participle that the match is successful is taken out, deposit storer in, P=P+M, forwards step (10) to;
(13) M value subtracts 1, P=0, forwards step (10) to;
(14) word segmentation result being kept in storer is exported by input-output device.
A kind of idiographic flow of reverse participle of the two-way segmenting method based on Chinese retrieval as shown in Figure 2.
A kind of two-way participle device based on Chinese retrieval proposed by the invention, is comprised of central processing unit, input-output device, register, machine dictionary, window enumeration device, storer, to Chinese character string S=C
1c
2c
3c
4c
ncarrying out participle comprises the following steps:
(1) central processing unit reads in Chinese character string S=C from input-output device
1c
2c
3c
4c
n, put into register, if the value of n is less than or equal to 1, participle coupling finishes, word segmentation result is kept in storer, directly forwards step (7) to; When n > 1, forward step (2) to;
(2) central processing unit is searched the value M=MAX_Length of maximum length coupling word from machine dictionary, and the first character that the reference position of window enumeration device is sentence to be matched, is designated as j=0;
(3) when n-j is more than or equal to M and M and is more than or equal to 1, from the position of j, start the substring that intercepted length is M backward register and mate, if mate, unsuccessfully forward step (4) to, if the match is successful, forward step (5) to; When n-j is less than M and M and is more than or equal to 1, forward step (6) to; When M is less than 1, forward step (7) to;
(4) j=j+1, window enumeration device adds 1, forwards step (3) to;
(5) participle that the match is successful is taken out, deposit storer in, j=j+M, forwards step (3) to;
(6) M value subtracts 1, j=0, forwards step (3) to;
(7) word segmentation result being kept in storer is exported by input-output device;
(8) central processing unit reads in Chinese character string S=C from input-output device
1c
2c
3c
4c
n, put into register, if the value of n is less than or equal to 1, participle coupling finishes, word segmentation result is kept in storer, directly forwards step (14) to; When n > 1, forward step (9) to;
(9) central processing unit is searched the value M=MAX_Length of maximum length coupling word from machine dictionary, and the last character that the reference position of window enumeration device is sentence to be matched, is designated as n-P, wherein P=0;
(10) if n-P is less than 0, coupling finishes, and directly forwards step (14) to, when P is more than or equal to M and M and is more than or equal to 1, from the position of n-P, starts the substring that intercepted length is M forward and mates, if mate the unsuccessful step (11) that forwards to; If the match is successful, forward step (12) to; When being less than, P when M and M are more than or equal to 1, forwards step (13) to; When being less than 1, M forwards step (14) to;
(11) p=p+1, forwards step (10) to;
(12) participle that the match is successful is taken out, deposit storer in, P=P+M, forwards step (10) to;
(13) M value subtracts 1, P=0, forwards step (10) to;
(14) word segmentation result being kept in storer is exported by input-output device.
accompanying drawing explanation
Fig. 1 is the forward participle particular flow sheet of a kind of two-way segmenting method based on Chinese retrieval of the present invention;
Fig. 2 is the reverse participle particular flow sheet of a kind of two-way segmenting method based on Chinese retrieval of the present invention;
Fig. 3 is the Web full-text search middleware framework model building on basis of the present invention;
Fig. 4 is the document search system framework model that Web full-text search middleware builds as shown in Figure 3.
Embodiment
Take the word segmentation result of sentence 1 " people of developing country " and sentence 2 " the university student movable point " improvement effect in object lesson illustrates Chinese two-way segmenting method below.When using improved MM algorithm to carry out participle, sentence 1 can be obtained a result " development/China/household/people ", and sentence 2 can obtain a result " university student/activity/point ".When using improved RMM algorithm, sentence 1 can be obtained a result " in development/country/people ", and sentence 2 can obtain a result " university/life/move point ".In this case, obviously human brain can clearly pick out, and the word segmentation result that the improved RMM algorithm of sentence 1 use is made is correct, and the improved MM algorithm of sentence 2 use is obtained a result correctly.
Here can learn, when the improved MM algorithm of independent use or improved RMM algorithm carry out participle, if this two classes sentence in participle work, occur simultaneously, must have a class and can not make correct word segmentation result.And if now carry out participle with the maximum length matching algorithm of improved forward and reverse combination, can in contrast specialized dictionary, find out any matching way the most suitable, thereby draw correct word segmentation result, in coupling, judge the sentence improved RMM algorithm of 1 use and the improved MM algorithm of sentence 2 use.
In specific implementation process, the present invention can also be applied to information retrieval field, constructs following a kind of Web full-text search middleware and a kind of document search system.
Fig. 3 has provided the framework model of Web full-text search middleware.Whole middleware is comprised of information acquisition module, message processing module and full-text search module.Being briefly described below of modules.
1) information acquisition module.This module is to be mainly responsible for Web webpage to carry out multithreading crawl and carry out duplicate removal processing to capturing resulting URL.In this module, acquisition interface, towards Web website, only needs the initial URL of given crawl, can complete by breadth-first search strategy the crawl of all webpages of whole website.
2) message processing module.This module comprises two main contents, first the web page contents collecting is carried out to text extraction; Utilize above-mentioned a kind of two-way segmenting method based on Chinese retrieval carry out participle and set up index extracting result afterwards, wherein participle function is by realizing with above-mentioned a kind of two-way participle device based on Chinese retrieval.
3) full-text search module.Full-text search module provides the interface of user search function, its inside encapsulated full-text search, user search condition resolve, to the individual operation function that result for retrieval sorts and some raising users experience, as searched key word intelligent prompt, associative key search and advanced search etc.
Fig. 4 has provided the framework model of document search system.Document search system adopts J2EE technology in conjunction with MVC framework, utilizes Web full-text search middleware, adopts Java language exploitation to realize.
1) presentation layer.For generating the Web page of user access, comprise the search interface of document search engine, the back page of result, advanced search page, search engine carries out initial setting up or adjusts some pages of server capability, all concentrates on presentation layer.Briefly, presentation layer is exactly this system and various users' man-machine interface.
2) logical layer.Logical layer is positioned at the server end of system, comprises numerous functional modules, is the core level of realizing document search system and search service function.The various functions that propose in presentation layer all realize by the corresponding code module of logical layer.The design of logical layer comprises two main contents: the one, and towards the automatic acquisition function of info web of Internet, by a special multithreading reptile program, realize, and the page info gathering is stored in data storage layer below; Another is analysis user condition, carries out combinatorial search, and result for retrieval is carried out to buffer memory according to specific cache policy, simultaneously to give result that user shows according to time ask sequence or relevancy ranking.The design of logical layer is to realize system robustness, reusability, extensibility and maintainable key factor.
3) data storage layer.Data storage layer is mainly that the html page of being responsible for spiders to collect carries out URL duplicate removal, then page body is extracted, extraction result is packaged into the form of object, utilize above-mentioned a kind of two-way segmenting method based on Chinese retrieval carry out participle and set up index extracting result, wherein participle function, by realizing with above-mentioned a kind of two-way participle device based on Chinese retrieval, is stored corresponding data in indexed file.
Claims (8)
1. the two-way segmenting method based on Chinese retrieval, is characterized in that, by the device that formed by central processing unit, input-output device, register, machine dictionary, window enumeration device, storer to Chinese character string S=C
1c
2c
3c
4c
ncarry out participle, comprise the following steps:
(1) central processing unit reads in Chinese character string S=C from input-output device
1c
2c
3c
4c
n, put into register, if the value of n is less than or equal to 1, participle coupling finishes, word segmentation result is kept in storer, directly forwards step (7) to; When n > 1, forward step (2) to;
(2) central processing unit is searched the value M=MAX_Length of maximum length coupling word from machine dictionary, and the first character that the reference position of window enumeration device is sentence to be matched, is designated as j=0;
(3) when n-j is more than or equal to M and M and is more than or equal to 1, from the position of j, start the substring that intercepted length is M backward register and mate, if mate, unsuccessfully forward step (4) to, if the match is successful, forward step (5) to; When n-j is less than M and M and is more than or equal to 1, forward step (6) to; When M is less than 1, forward step (7) to;
(4) j=j+1, window enumeration device adds 1, forwards step (3) to;
(5) participle that the match is successful is taken out, deposit storer in, j=j+M, forwards step (3) to;
(6) M value subtracts 1, j=0, forwards step (3) to;
(7) word segmentation result being kept in storer is exported by input-output device;
(8) central processing unit reads in Chinese character string S=C from input-output device
1c
2c
3c
4c
n, put into register, if the value of n is less than or equal to 1, participle coupling finishes, word segmentation result is kept in storer, directly forwards step (14) to; When n > 1, forward step (9) to;
(9) central processing unit is searched the value M=MAX_Length of maximum length coupling word from machine dictionary, and the last character that the reference position of window enumeration device is sentence to be matched, is designated as n-P, wherein P=0;
(10) if n-P is less than 0, coupling finishes, and directly forwards step (14) to, when P is more than or equal to M and M and is more than or equal to 1, from the position of n-P, starts the substring that intercepted length is M forward and mates, if mate the unsuccessful step (11) that forwards to; If the match is successful, forward step (12) to; When being less than, P when M and M are more than or equal to 1, forwards step (13) to; When being less than 1, M forwards step (14) to;
(11) p=p+1, forwards step (10) to;
(12) participle that the match is successful is taken out, deposit storer in, P=P+M, forwards step (10) to;
(13) M value subtracts 1, P=0, forwards step (10) to;
(14) word segmentation result being kept in storer is exported by input-output device.
2. the two-way participle device based on Chinese retrieval, is characterized in that, consists of, to Chinese character string S=C central processing unit, input-output device, register, machine dictionary, window enumeration device, storer
1c
2c
3c
4c
ncarrying out participle comprises the following steps:
(1) central processing unit reads in Chinese character string S=C from input-output device
1c
2c
3c
4c
n, put into register, if the value of n is less than or equal to 1, participle coupling finishes, word segmentation result is kept in storer, directly forwards step (7) to; When n > 1, forward step (2) to;
(2) central processing unit is searched the value M=MAX_Length of maximum length coupling word from machine dictionary, and the first character that the reference position of window enumeration device is sentence to be matched, is designated as j=0;
(3) when n-j is more than or equal to M and M and is more than or equal to 1, from the position of j, start the substring that intercepted length is M backward register and mate, if mate, unsuccessfully forward step (4) to, if the match is successful, forward step (5) to; When n-j is less than M and M and is more than or equal to 1, forward step (6) to; When M is less than 1, forward step (7) to;
(4) j=j+1, window enumeration device adds 1, forwards step (3) to;
(5) participle that the match is successful is taken out, deposit storer in, j=j+M, forwards step (3) to;
(6) M value subtracts 1, j=0, forwards step (3) to;
(7) word segmentation result being kept in storer is exported by input-output device;
(8) central processing unit reads in Chinese character string S=C from input-output device
1c
2c
3c
4c
n, put into register, if the value of n is less than or equal to 1, participle coupling finishes, word segmentation result is kept in storer, directly forwards step (14) to; When n > 1, forward step (9) to;
(9) central processing unit is searched the value M=MAX_Length of maximum length coupling word from machine dictionary, and the last character that the reference position of window enumeration device is sentence to be matched, is designated as n-P, wherein P=0;
(10) if n-P is less than 0, coupling finishes, and directly forwards step (14) to, when P is more than or equal to M and M and is more than or equal to 1, from the position of n-P, starts the substring that intercepted length is M forward and mates, if mate the unsuccessful step (11) that forwards to; If the match is successful, forward step (12) to; When being less than, P when M and M are more than or equal to 1, forwards step (13) to; When being less than 1, M forwards step (14) to;
(11) p=p+1, forwards step (10) to;
(12) participle that the match is successful is taken out, deposit storer in, P=P+M, forwards step (10) to;
(13) M value subtracts 1, P=0, forwards step (10) to;
(14) word segmentation result being kept in storer is exported by input-output device.
3. a Web full-text search middleware, by information acquisition interface, information acquisition module, message processing module, full-text search module and Retrieval Interface, formed, wherein, information acquisition module captures the Web webpage from information acquisition interface and carries out duplicate removal processing to capturing resulting URL; The web page contents that message processing module first collects information acquisition module carries out text extraction, to extracting result, carries out participle and sets up index afterwards; Full-text search inside modules has encapsulated full-text search, user search condition is resolved, result for retrieval is sorted and improves the individual operation function that user experiences, Retrieval Interface is externally provided, it is characterized in that, in message processing module, text being extracted to result, to carry out the device of participle be a kind of two-way participle device based on Chinese retrieval as claimed in claim 2.
4. a kind of Web full-text search middleware as claimed in claim 3, is characterized in that, information acquisition module adopts multithreading to capture when the Web webpage from information acquisition interface is captured.
5. a Web text searching method for middleware, this middleware is comprised of information acquisition interface, information acquisition module, message processing module, full-text search module and Retrieval Interface, and full-text search comprises the following steps:
(1) information acquisition module captures the Web webpage from information acquisition interface and carries out duplicate removal processing to capturing resulting URL;
(2) web page contents that message processing module first collects information acquisition module carries out text extraction, to extracting result, carries out participle and sets up index afterwards;
(3) full-text search inside modules has encapsulated full-text search, user search condition is resolved, result for retrieval is sorted and improves the individual operation function that user experiences, and Retrieval Interface is externally provided;
It is characterized in that, in step (2), in message processing module to text extract result carry out participle method be a kind of two-way segmenting method based on Chinese retrieval as claimed in claim 1.
6. the building method of a kind of Web full-text search middleware as claimed in claim 5, is characterized in that, information acquisition module adopts multithreading to capture when the Web webpage from information acquisition interface is captured.
7. a document search system, it is characterized in that, man-machine interface and a kind of Web full-text search middleware as claimed in claim 3, consist of, man-machine interface provides the information acquisition interface of initial URL to middleware, and shows the result for retrieval that middleware is exported by Retrieval Interface.
8. a method for document searching, searches for by man-machine interface and Web full-text search Middleware implementation, comprises the following steps:
(1) man-machine interface provides initial URL;
(2) Web full-text search middleware carries out full-text search;
(3) middleware is exported result for retrieval be presented at man-machine interface;
It is characterized in that, in step (2), the method that Web full-text search middleware carries out full-text search is the Web text searching method of a kind of middleware as claimed in claim 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210241471.2A CN103577391A (en) | 2012-07-28 | 2012-07-28 | Chinese retrieval based bidirectional word-segmentation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210241471.2A CN103577391A (en) | 2012-07-28 | 2012-07-28 | Chinese retrieval based bidirectional word-segmentation method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103577391A true CN103577391A (en) | 2014-02-12 |
Family
ID=50049200
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210241471.2A Pending CN103577391A (en) | 2012-07-28 | 2012-07-28 | Chinese retrieval based bidirectional word-segmentation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103577391A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649251A (en) * | 2015-10-30 | 2017-05-10 | 北京国双科技有限公司 | Method and device for Chinese word segmentation |
CN106653005A (en) * | 2016-11-23 | 2017-05-10 | 河池学院 | A voice processing method for robots |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6879951B1 (en) * | 1999-07-29 | 2005-04-12 | Matsushita Electric Industrial Co., Ltd. | Chinese word segmentation apparatus |
CN101510221A (en) * | 2009-02-17 | 2009-08-19 | 北京大学 | Enquiry statement analytical method and system for information retrieval |
CN102063424A (en) * | 2010-12-24 | 2011-05-18 | 上海电机学院 | Method for Chinese word segmentation |
CN102402502A (en) * | 2011-11-24 | 2012-04-04 | 北京趣拿信息技术有限公司 | Word segmentation processing method and device for search engine |
-
2012
- 2012-07-28 CN CN201210241471.2A patent/CN103577391A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6879951B1 (en) * | 1999-07-29 | 2005-04-12 | Matsushita Electric Industrial Co., Ltd. | Chinese word segmentation apparatus |
CN101510221A (en) * | 2009-02-17 | 2009-08-19 | 北京大学 | Enquiry statement analytical method and system for information retrieval |
CN102063424A (en) * | 2010-12-24 | 2011-05-18 | 上海电机学院 | Method for Chinese word segmentation |
CN102402502A (en) * | 2011-11-24 | 2012-04-04 | 北京趣拿信息技术有限公司 | Word segmentation processing method and device for search engine |
Non-Patent Citations (3)
Title |
---|
张维刚等: "Web 全文检索中间件的设计与应用", 《计算机应用》 * |
邹嵩等: "垂直搜索引擎中分词技术的算法研究", 《计算机技术与发展》 * |
邹嵩等: "垂直搜索引擎中分词技术的算法研究", 《计算机技术与发展》, vol. 22, no. 2, 28 February 2012 (2012-02-28) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649251A (en) * | 2015-10-30 | 2017-05-10 | 北京国双科技有限公司 | Method and device for Chinese word segmentation |
CN106653005A (en) * | 2016-11-23 | 2017-05-10 | 河池学院 | A voice processing method for robots |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN100405371C (en) | Method and system for abstracting new word | |
US7984032B2 (en) | Iterators for applying term occurrence-level constraints in natural language searching | |
CN110399457A (en) | An intelligent question answering method and system | |
CN102929902A (en) | Character splitting method and device based on Chinese retrieval | |
US20150178273A1 (en) | Unsupervised Relation Detection Model Training | |
CN103544255A (en) | Text semantic relativity based network public opinion information analysis method | |
CN103678412B (en) | A kind of method and device of file retrieval | |
CN106776711A (en) | A kind of Chinese medical knowledge mapping construction method based on deep learning | |
CN103425640A (en) | Multimedia questioning-answering system and method | |
CN107690634B (en) | Automatic query pattern generation method and system | |
CN110909170B (en) | Interest point knowledge graph construction method and device, electronic equipment and storage medium | |
CN104239286A (en) | Method and device for mining synonymous phrases and method and device for searching related contents | |
CN102779135B (en) | Method and device for obtaining cross-linguistic search resources and corresponding search method and device | |
CN103886020B (en) | A kind of real estate information method for fast searching | |
CN113761890A (en) | A Multi-level Semantic Information Retrieval Method Based on BERT Context Awareness | |
CN103294820B (en) | WEB page classifying method and system based on semantic extension | |
CN119698608A (en) | A terminal-based question-answering method and device | |
CN109522396B (en) | Knowledge processing method and system for national defense science and technology field | |
CN103310013A (en) | Subject-oriented web page collection system | |
CN110209781A (en) | A kind of text handling method, device and relevant device | |
CN102646124A (en) | Method for automatically identifying address information | |
CN109948154B (en) | A system and method for character acquisition and relationship recommendation based on mailbox name | |
CN106649557A (en) | Semantic association mining method for defect report and mail list | |
CN103020083B (en) | The automatic mining method of demand recognition template, demand recognition methods and corresponding device | |
CN103544167A (en) | Backward word segmentation method and device based on Chinese retrieval |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20140212 |