CN103577391A

CN103577391A - Chinese retrieval based bidirectional word-segmentation method and device

Info

Publication number: CN103577391A
Application number: CN201210241471.2A
Authority: CN
Inventors: 刘迎春; 魏华峰; 方筠捷
Original assignee: JIANGSU XINRUIFENG INFORMATION TECHNOLOGY Co Ltd
Current assignee: JIANGSU XINRUIFENG INFORMATION TECHNOLOGY Co Ltd
Priority date: 2012-07-28
Filing date: 2012-07-28
Publication date: 2014-02-12

Abstract

The invention provides a Chinese retrieval based bidirectional word-segmentation method and device. Accordingly, Chinese word strings can be segmented accurately without semantic losing, the word-segmentation result is accurate especially when a sentence is long, and accordingly the accuracy of the query can be improved and the Chinese retrieval based bidirectional word-segmentation method and device can be applied to an automatic abstraction and automatic classification system in the field of information retrieval.

Description

A kind of two-way segmenting method and device based on Chinese retrieval

Technical field

The present invention relates to webpage Word message process field in computer network, particularly the method and apparatus of the retrieval of the Chinese web page in search engine.

Background technology

Along with the development of internet, Web page quantity sharply significantly increases, and webpage has become the hugest and information source widely of people.Many Useful Informations are submerged in the vast as the open sea Web page, and in the face of magnanimity information, people can not be simply by manually having processed all information.The search of text is one of important application technology of extensive field of information processing, is also the important research direction of field of information processing.Along with going deep into of the classified search technical research to text, the search technique of text is more and more extensive in the application of infotech.And participle technique is any one search engine " soul ", the key word in retrieval request correctly can be cut out from sentence by correct understanding and carry out matched and searched, be the critical process that can a search engine obtain correct result.The whether accurate accuracy that can directly have influence on result for retrieval of participle.Existing conventional participle technique has individual character to divide morphology, dichotomy and maximum length matching algorithm.

(1) individual character divides morphology

It is to using single character to carry out cutting as unit that individual character divides morphology, each character segmentation of text out, then sets up in this manner index.For example, when " People's Republic of China (PRC) " is used individual character to divide morphology to carry out participle, the word cutting out is: " in ", " China ", " people ", " people ", " being total to " ,“ He”,“ state ".Visible, it is fairly simple that individual character divides morphology to realize, but the word cutting out is nonsensical, lost the original semanteme of text.

(2) dichotomy

Dichotomy is usingd two words and is carried out cutting as a unit, and adjacent two character segmentations in text out, are then set up to index in this manner.For example " People's Republic of China (PRC) " is used dichotomy to carry out participle, and the word cutting out is: " China ", " Chinese ", " people ", " people are common ", " republicanism ”,“He state ".Divide morphology to compare with individual character, although dichotomy will got well aspect processing words position, this method is syncopated as a lot of useless entries, thereby produces index redundancy.

(3) maximum length matching algorithm

A relatively more conventional minute word algorithm is maximum matching method (claiming again MM method) at present.MM method is a kind of widely used segmenting method, and in its participle process, it only uses a participle dictionary.The basic thought of MM method is as follows: establish word the longest in dictionary for a word, all from sentence starting position, get the word string that a length is K at every turn, successively it is mated with the word in dictionary, if truly have such word that word length is K in dictionary, successfully mate, just this word string is syncopated as from sentence as a word.Then find the reference position of sentence remainder, according to upper method, get equally the word string that another word length is K, repeat above method, until sentence cutting is complete.The entry that can not find a current word string of coupling in dictionary, should delete word of this word string afterbody, separately generates the word string of a K-1 word length, then inquires about in dictionary.As successfully mated, the word string that is K-1 using this word length is as a word, and from sentence, cutting is gone out; As it fails to match, repeat above step, the word string afterbody of this K-1 word length is left out to a word again, the word string that generates a K 1 goes to mate in dictionary, until successfully mate.

Suppose the character string that need to carry out cutting a: C1C2C3C4 ... Cn, wherein Ci (i=1,2,3 ..., n) representing individual Chinese character character, in dictionary, the longest word word length is i, with i word length, carries out cutting for the first time, as successfully mated, result is as follows:

S=S1+S2，S1=｛C1…Ci｝，S2=｛Ci+1…Cn｝

As it fails to match, with i-1 word length, mate again, as the match is successful, its result is as follows:

S=S1+S2，S1=｛C1…Ci-1｝，S2=｛Ci…Cn｝

By that analogy, after S1 is cut out, more same Recursion process S2.

The feature of maximum matching algorithm is " priority of long word ", compares with common coupling, can improve well retrieval rate.With forward maximum length matching algorithm get word and step is identical, just from last of character string from starting to get forward word coupling, this matching algorithm is called reverse maximum length matching algorithm, i.e. RMM.RMM matching way present stage is also more and more subject to the favor of people when design participle mode, because its participle in some special retrievals can reach more accurate object.

In common maximum length matching algorithm, no matter be forward or reverse, be all the maximum coupling in the subrange starting, front M character or rear M the character of each sentence mate.In the middle of short sentence, such matching way can also be dealt with problems well, if but sentence length to be matched is far longer than the length of maximum coupling word, the drawback of so this matching way is just apparent, be that local coupling can not be included whole sentence completely, make vocabulary in the middle of the sentence there is no that the match is successful, thereby cause leakage word in retrieval and retrieval not exclusively.

Summary of the invention

Vertical search engine is for certain industry or tissue, take the network information resource storehouse of constructing certain industry or tissue meets industry demand of specialty or organizes a certain business demand as target, collect on the internet intelligently the information resources that meet a certain special topic needs, refinement and the extension of universal search engine, it is integrated certain the class specialized information in web page library, minute field extracts the data that user needs, and after processing, with certain form, returns to user.Can, for comprising the information user of Subject Information Gateway, specialized information mechanism, specific industry field, company information center, industry specialists etc., provide a whole set of network information resource development plan.As can be seen here, vertical search engine is search engine more targetedly, and it only searches for the information of particular topic.

The present invention is directed to the problem that independent forward or reverse maximum matching algorithm exist, in especially vertical search system of search system, make full use of professional class environment, in machine dictionary storehouse, set up professional class dictionary, first according to the maximum length of the proper noun in dictionary, determine the value of MAX_Length, solved the problem of blindly selecting maximum length in matching algorithm, and the mode of being combined with reverse maximum matching algorithm by forward forms positive and negative bi-directional matching algorithm, allow positive and negative maximum matching algorithm can in matching process, make up deficiency separately, from improving to a great extent the degree of accuracy of retrieval.Here, when proposing bi-directional matching algorithm, for the segmentation ambiguity in forward and reverse matching process and the full problem of coupling, the matching algorithm of maximum length is proposed to improve, the thought that proposes " window " coupling further promotes participle matching effect.

A kind of two-way segmenting method based on Chinese retrieval proposed by the invention adopts the maximum length matching algorithm of forward and reverse combination to carry out participle, and the device that wherein forward segmenting method (improved MM method) passes through to be comprised of central processing unit, input-output device, register, machine dictionary, window enumeration device, storer is to Chinese character string S=C ₁c ₂c ₃c ₄c _ncarry out participle, comprise the following steps:

(1) central processing unit reads in Chinese character string S=C from input-output device ₁c ₂c ₃c ₄c _n, put into register, if the value of n is less than or equal to 1, participle coupling finishes, word segmentation result is kept in storer, directly forwards step (7) to; When n > 1, forward step (2) to;

(2) central processing unit is searched the value M=MAX_Length of maximum length coupling word from machine dictionary, and the first character that the reference position of window enumeration device is sentence to be matched, is designated as j=0;

(3) when n-j is more than or equal to M and M and is more than or equal to 1, from the position of j, start the substring that intercepted length is M backward register and mate, if mate, unsuccessfully forward step (4) to, if the match is successful, forward step (5) to; When n-j is less than M and M and is more than or equal to 1, forward step (6) to; When M is less than 1, forward step (7) to;

(4) j=j+1, window enumeration device adds 1, forwards step (3) to;

(5) participle that the match is successful is taken out, deposit storer in, j=j+M, forwards step (3) to;

(6) M value subtracts 1, j=0, forwards step (3) to;

(7) word segmentation result being kept in storer is exported by input-output device;

A kind of idiographic flow of forward participle of the two-way segmenting method based on Chinese retrieval as shown in Figure 1.

In original RMM algorithm, its matching way and forward matching way are basic identical, reciprocal in the direction of just advancing in coupling, and this point difference can form larger result difference in matching process.And in improved matching algorithm, due to adding of window, a difficult point of reverse matching way is exactly the location of window starting point and the control of moving direction, this is also one of important difference under comparing with forward matching way.

Reverse segmentation method (improved RMM method), by the device that formed by central processing unit, input-output device, register, machine dictionary, window enumeration device, storer to Chinese character string S=C ₁c ₂c ₃c ₄c _ncarry out participle, comprise the following steps:

(8) central processing unit reads in Chinese character string S=C from input-output device ₁c ₂c ₃c ₄c _n, put into register, if the value of n is less than or equal to 1, participle coupling finishes, word segmentation result is kept in storer, directly forwards step (14) to; When n > 1, forward step (9) to;

(9) central processing unit is searched the value M=MAX_Length of maximum length coupling word from machine dictionary, and the last character that the reference position of window enumeration device is sentence to be matched, is designated as n-P, wherein P=0;

(10) if n-P is less than 0, coupling finishes, and directly forwards step (14) to, when P is more than or equal to M and M and is more than or equal to 1, from the position of n-P, starts the substring that intercepted length is M forward and mates, if mate the unsuccessful step (11) that forwards to; If the match is successful, forward step (12) to; When being less than, P when M and M are more than or equal to 1, forwards step (13) to; When being less than 1, M forwards step (14) to;

(11) p=p+1, forwards step (10) to;

(12) participle that the match is successful is taken out, deposit storer in, P=P+M, forwards step (10) to;

(13) M value subtracts 1, P=0, forwards step (10) to;

(14) word segmentation result being kept in storer is exported by input-output device.

A kind of idiographic flow of reverse participle of the two-way segmenting method based on Chinese retrieval as shown in Figure 2.

A kind of two-way participle device based on Chinese retrieval proposed by the invention, is comprised of central processing unit, input-output device, register, machine dictionary, window enumeration device, storer, to Chinese character string S=C ₁c ₂c ₃c ₄c _ncarrying out participle comprises the following steps:

(4) j=j+1, window enumeration device adds 1, forwards step (3) to;

(6) M value subtracts 1, j=0, forwards step (3) to;

(11) p=p+1, forwards step (10) to;

(13) M value subtracts 1, P=0, forwards step (10) to;

accompanying drawing explanation

Fig. 1 is the forward participle particular flow sheet of a kind of two-way segmenting method based on Chinese retrieval of the present invention;

Fig. 2 is the reverse participle particular flow sheet of a kind of two-way segmenting method based on Chinese retrieval of the present invention;

Fig. 3 is the Web full-text search middleware framework model building on basis of the present invention;

Fig. 4 is the document search system framework model that Web full-text search middleware builds as shown in Figure 3.

Embodiment

Take the word segmentation result of sentence 1 " people of developing country " and sentence 2 " the university student movable point " improvement effect in object lesson illustrates Chinese two-way segmenting method below.When using improved MM algorithm to carry out participle, sentence 1 can be obtained a result " development/China/household/people ", and sentence 2 can obtain a result " university student/activity/point ".When using improved RMM algorithm, sentence 1 can be obtained a result " in development/country/people ", and sentence 2 can obtain a result " university/life/move point ".In this case, obviously human brain can clearly pick out, and the word segmentation result that the improved RMM algorithm of sentence 1 use is made is correct, and the improved MM algorithm of sentence 2 use is obtained a result correctly.

Here can learn, when the improved MM algorithm of independent use or improved RMM algorithm carry out participle, if this two classes sentence in participle work, occur simultaneously, must have a class and can not make correct word segmentation result.And if now carry out participle with the maximum length matching algorithm of improved forward and reverse combination, can in contrast specialized dictionary, find out any matching way the most suitable, thereby draw correct word segmentation result, in coupling, judge the sentence improved RMM algorithm of 1 use and the improved MM algorithm of sentence 2 use.

In specific implementation process, the present invention can also be applied to information retrieval field, constructs following a kind of Web full-text search middleware and a kind of document search system.

Fig. 3 has provided the framework model of Web full-text search middleware.Whole middleware is comprised of information acquisition module, message processing module and full-text search module.Being briefly described below of modules.

1) information acquisition module.This module is to be mainly responsible for Web webpage to carry out multithreading crawl and carry out duplicate removal processing to capturing resulting URL.In this module, acquisition interface, towards Web website, only needs the initial URL of given crawl, can complete by breadth-first search strategy the crawl of all webpages of whole website.

2) message processing module.This module comprises two main contents, first the web page contents collecting is carried out to text extraction; Utilize above-mentioned a kind of two-way segmenting method based on Chinese retrieval carry out participle and set up index extracting result afterwards, wherein participle function is by realizing with above-mentioned a kind of two-way participle device based on Chinese retrieval.

3) full-text search module.Full-text search module provides the interface of user search function, its inside encapsulated full-text search, user search condition resolve, to the individual operation function that result for retrieval sorts and some raising users experience, as searched key word intelligent prompt, associative key search and advanced search etc.

Fig. 4 has provided the framework model of document search system.Document search system adopts J2EE technology in conjunction with MVC framework, utilizes Web full-text search middleware, adopts Java language exploitation to realize.

1) presentation layer.For generating the Web page of user access, comprise the search interface of document search engine, the back page of result, advanced search page, search engine carries out initial setting up or adjusts some pages of server capability, all concentrates on presentation layer.Briefly, presentation layer is exactly this system and various users' man-machine interface.

2) logical layer.Logical layer is positioned at the server end of system, comprises numerous functional modules, is the core level of realizing document search system and search service function.The various functions that propose in presentation layer all realize by the corresponding code module of logical layer.The design of logical layer comprises two main contents: the one, and towards the automatic acquisition function of info web of Internet, by a special multithreading reptile program, realize, and the page info gathering is stored in data storage layer below; Another is analysis user condition, carries out combinatorial search, and result for retrieval is carried out to buffer memory according to specific cache policy, simultaneously to give result that user shows according to time ask sequence or relevancy ranking.The design of logical layer is to realize system robustness, reusability, extensibility and maintainable key factor.

3) data storage layer.Data storage layer is mainly that the html page of being responsible for spiders to collect carries out URL duplicate removal, then page body is extracted, extraction result is packaged into the form of object, utilize above-mentioned a kind of two-way segmenting method based on Chinese retrieval carry out participle and set up index extracting result, wherein participle function, by realizing with above-mentioned a kind of two-way participle device based on Chinese retrieval, is stored corresponding data in indexed file.

Claims

1. the two-way segmenting method based on Chinese retrieval, is characterized in that, by the device that formed by central processing unit, input-output device, register, machine dictionary, window enumeration device, storer to Chinese character string S=C ₁c ₂c ₃c ₄c _ncarry out participle, comprise the following steps:

(4) j=j+1, window enumeration device adds 1, forwards step (3) to;

(6) M value subtracts 1, j=0, forwards step (3) to;

(11) p=p+1, forwards step (10) to;

(13) M value subtracts 1, P=0, forwards step (10) to;

2. the two-way participle device based on Chinese retrieval, is characterized in that, consists of, to Chinese character string S=C central processing unit, input-output device, register, machine dictionary, window enumeration device, storer ₁c ₂c ₃c ₄c _ncarrying out participle comprises the following steps:

(4) j=j+1, window enumeration device adds 1, forwards step (3) to;

(6) M value subtracts 1, j=0, forwards step (3) to;

(11) p=p+1, forwards step (10) to;

(13) M value subtracts 1, P=0, forwards step (10) to;

3. a Web full-text search middleware, by information acquisition interface, information acquisition module, message processing module, full-text search module and Retrieval Interface, formed, wherein, information acquisition module captures the Web webpage from information acquisition interface and carries out duplicate removal processing to capturing resulting URL; The web page contents that message processing module first collects information acquisition module carries out text extraction, to extracting result, carries out participle and sets up index afterwards; Full-text search inside modules has encapsulated full-text search, user search condition is resolved, result for retrieval is sorted and improves the individual operation function that user experiences, Retrieval Interface is externally provided, it is characterized in that, in message processing module, text being extracted to result, to carry out the device of participle be a kind of two-way participle device based on Chinese retrieval as claimed in claim 2.

4. a kind of Web full-text search middleware as claimed in claim 3, is characterized in that, information acquisition module adopts multithreading to capture when the Web webpage from information acquisition interface is captured.

5. a Web text searching method for middleware, this middleware is comprised of information acquisition interface, information acquisition module, message processing module, full-text search module and Retrieval Interface, and full-text search comprises the following steps:

(1) information acquisition module captures the Web webpage from information acquisition interface and carries out duplicate removal processing to capturing resulting URL;

(2) web page contents that message processing module first collects information acquisition module carries out text extraction, to extracting result, carries out participle and sets up index afterwards;

(3) full-text search inside modules has encapsulated full-text search, user search condition is resolved, result for retrieval is sorted and improves the individual operation function that user experiences, and Retrieval Interface is externally provided;

It is characterized in that, in step (2), in message processing module to text extract result carry out participle method be a kind of two-way segmenting method based on Chinese retrieval as claimed in claim 1.

6. the building method of a kind of Web full-text search middleware as claimed in claim 5, is characterized in that, information acquisition module adopts multithreading to capture when the Web webpage from information acquisition interface is captured.

7. a document search system, it is characterized in that, man-machine interface and a kind of Web full-text search middleware as claimed in claim 3, consist of, man-machine interface provides the information acquisition interface of initial URL to middleware, and shows the result for retrieval that middleware is exported by Retrieval Interface.

8. a method for document searching, searches for by man-machine interface and Web full-text search Middleware implementation, comprises the following steps:

(1) man-machine interface provides initial URL;

(2) Web full-text search middleware carries out full-text search;

(3) middleware is exported result for retrieval be presented at man-machine interface;

It is characterized in that, in step (2), the method that Web full-text search middleware carries out full-text search is the Web text searching method of a kind of middleware as claimed in claim 5.