CN113536807B

CN113536807B - Incomplete maximum matching word segmentation method based on semantics

Info

Publication number: CN113536807B
Application number: CN202110888301.2A
Authority: CN
Inventors: 苏航; 周汉清; 吕海熊; 张春雷; 丁新; 刘勇
Original assignee: China Aero Polytechnology Establishment
Current assignee: China Aero Polytechnology Establishment
Priority date: 2021-08-03
Filing date: 2021-08-03
Publication date: 2023-05-05
Anticipated expiration: 2041-08-03
Also published as: CN113536807A

Abstract

The invention provides a semantic-based incomplete maximum matching word segmentation method, which comprises the following steps of: s1, utilizing training set corpus T and synonym Lin Goujian forward semantic similarity dictionary D ^Sim The method comprises the steps of carrying out a first treatment on the surface of the S2, segmenting the words to obtain initial words; s3, automatic recognition of subsequent words: for the Chinese character string S to be segmented _n ＝w ₁ w ₂ ......w _n S is obtained through the segmentation method of S2 _h ＝w ₁ w ₂ ......w _h (h.ltoreq.n), in dictionary D ^Sim Reads all S _h Subsequent word sets of (a)

If there is S _h Is successfully matched with a certain subsequent word, namely

Will S _h2 Automatically identified as S _h Subsequent words of (a); s4, repeatedly executing the steps S2-S3, and finally cutting to finish the Chinese character string S _n . The invention combines two methods based on rules and statistics, provides a semantic-based incomplete maximum matching word segmentation method, solves the defect of word adhesion of the traditional maximum matching algorithm by using three-feature weight calculation, and improves the word segmentation accuracy.

Description

Incomplete maximum matching word segmentation method based on semantics

Technical Field

The invention relates to a phrase distribution method, in particular to a semantic-based incomplete maximum matching word segmentation method.

Background

The Chinese word segmentation technology is used as an initial stage of text processing, and directly influences the accuracy of the whole data mining process. The high-precision Chinese word segmentation technology provides a high-quality text preprocessing basis for the fields of semantic disambiguation, keyword extraction, information retrieval and the like, and has important significance for promoting the development of natural language. At present, the research of Chinese word segmentation technology is mainly advanced in two directions of accuracy and timeliness: in improving algorithm timeliness, the main way is by defining a dictionary and a high-performance data structure. Such as loading a dictionary with a character tree or grouping a character tree. In terms of improving accuracy, most research is focused on improvement of word segmentation algorithms. The commonly used Chinese word segmentation methods can be divided into the following two categories:

rule-based word segmentation algorithm:

the maximum matching algorithm is typically based on rule-based lexical approach. According to the word segmentation dictionary, the problem of the text field is not needed to be considered, so that the method has the advantages of field independence and high timeliness. But is difficult to process ambiguous words and word adhesion is easy to occur. Many improvements to the maximum matching algorithm have emerged in the industry, such as: dynamically intercepting input strings by using dictionary entries, improving word segmentation efficiency by applying a hash technology, and the like.

Statistical-based word segmentation algorithm:

statistical word segmentation focuses on stable combinations of words and commonly used co-occurrence ratios of adjacent words simulate the likelihood that they constitute words. Word segmentation is achieved by means of the statistical word occurrence frequency.

However, the existing two methods are easy to adhere words and cannot guarantee the accuracy of word segmentation.

Disclosure of Invention

In order to solve the defects in the prior art, the invention provides a semantic-based incomplete maximum matching word segmentation method, which can construct a forward semantic similarity dictionary, and the dictionary can record the association strength among words, so that the defect of word adhesion of a maximum matching algorithm is overcome on the basis of guaranteeing time expenditure, and the accuracy of a word segmentation algorithm is improved. On one hand, the dictionary is applied to realize the recognition of subsequent words, so that the word segmentation accuracy is improved, the circulation is reduced, and the efficiency is improved. On the other hand, the invention provides a three-feature weight calculation formula, which redefines the segmentation principle of the word segmentation algorithm and solves the defect of word adhesion of the traditional algorithm.

Specifically, the invention provides a semantic-based incomplete maximum matching word segmentation method, which redefines a segmentation principle of a matching algorithm by using semantic elements in a semantic dictionary, and specifically comprises the following steps:

s1, constructing a forward semantic similarity dictionary: forward semantic similarity dictionary D using training set corpus T and synonym Lin Goujian ^Sim The construction process specifically comprises the following substeps:

s11, for the word w in the corpus T of the training set _i ，n _i The set of the subsequent entries is C _w ＝{w _ij ,1≤j≤n _i By w _i And w is equal to _ij The set of semantic similarity components of (a) is C _Sim ＝{w _ij :Sim _ij ,1≤j≤n _i -w is _ij Representing w _i Is the j-th subsequent entry, sim _ij Representing w _i And w is equal to _ij Semantic similarity in word forests,

representing vocabulary entry w _i Average value of semantic similarity with all subsequent terms, namely: />

S12, w _i Store C for keys _Sim And mean value

Recording semantic information of adjacent entries in T to obtain D ^Sim ；

S2, segmenting the words to obtain initial words, wherein the method specifically comprises the following substeps:

s21, supposing that a Chinese character string S with the length of n to be segmented exists _n ＝w ₁ w ₂ ......w _n Counting a general dictionary containing all entries as D; in a round of maximum matching algorithm, the set of h components that all match successfully is counted as C _h ＝{h|(1≤h≤k)∩w ₁ w ₂ ......w _h E D, k represents the matching word length of the maximum matching algorithm, i.e., w ₁ w ₂ ......w _k Is the first segmentation result of the maximum matching algorithm, set C _h The word segmentation formed by each element in the dictionary is carried out in the general dictionary D, and is used as a standby result;

s22, calculating a tri-feature weight WE of each word, wherein the calculation formula is as follows:

wherein S is _h Representing the result of the set C _h The length of the word segmentation determined by the medium element is h;

representation word S _h Average semantic similarity to subsequent words;

Representing the maximum average semantic similarity; p is p _h Representing the frequency of occurrence of words of word length h in the general dictionary D; p is p _max Represented by p in the general dictionary D _h Is the maximum value of (2);

s23, taking the S with the maximum weight of three characteristics _h As a word segmentation result, count the initial word S _h ＝w ₁ w ₂ ......w _h ；

S3, automatic recognition of subsequent words: for the Chinese character string S to be segmented _n ＝w ₁ w ₂ ......w _n S is obtained through the segmentation method of S2 _h ＝w ₁ w ₂ ......w _h (h.ltoreq.n), in dictionary D ^Sim Reads all S _h Subsequent word sets of (a)

If there is S _h Is successfully matched with a certain subsequent word, i.e. +.>

Will S _h2 Automatically identified as S _h Subsequent words of (a);

if there are a plurality of S _h2 Taking and S _h The semantic similarity is the largest; if there is no continuous execution of step S2 to segment S _h Subsequent strings of (i.e. input string is S) _n-h ＝w _h+1 w _h+2 ......w _n ；

S4, repeatedly executing the steps S2-S3, and finally cutting to finish the Chinese character string S _n 。

Preferably, in step S1, for term w in T _i ，n _i The set of the subsequent entries is C _w ＝{w _ij ,1≤j≤n _i -a }; from w _i And w is equal to _ij The set of semantic similarity components of (a) is C _Sim ＝{w _ij :Sim _ij ,1≤j≤n _i -w is _ij Representing w _i Is the j-th subsequent entry, sim _ij Representing w _i And w is equal to _ij Semantic similarity in word forests,

representing vocabulary entry w _i Average value of semantic similarity with all subsequent terms, namely:

D ^Sim in w _i Store C for keys _Sim And mean value

Semantic information of adjacent entries in T is recorded.

Preferably, the semantic similarity and word frequency characteristics of the three characteristic weights are smaller than 1, and the word length characteristics are larger than 1.

Preferably, the three feature weights are normalized by using the maximum average semantic similarity and the maximum frequency to eliminate data contingency.

Preferably, the method further comprises step S5 of verifying the calculation result: using accuracy, recall, harmonic mean F1 and time overhead as evaluation indexes, assuming that in the word segmentation after algorithm segmentation, the correct result is r ₁ And the error result isf, the number of the word segmentation given by the sample is r ₂ The calculation formulas of the accuracy, the recall rate and F1 are as follows:

compared with the prior art, the invention has the following beneficial effects:

(1) The invention combines two methods based on rules and statistics, and provides an incomplete maximum matching word segmentation method based on semantics. The defect of word adhesion of the traditional maximum matching algorithm is overcome by using the three-feature weight calculation method. Meanwhile, the forward semantic similarity dictionary is applied to realize automatic recognition of subsequent words, so that the algorithm performance is improved, and the time cost is reduced. The accuracy of the algorithm is improved on the basis of guaranteeing the time expenditure of the algorithm.

(2) The invention focuses on the association of semantic layers among words in a mode of carrying out subsequent word recognition through semantics, and achieves the effect of solving word ambiguity to a certain extent. Therefore, the invention not only provides an accurate and efficient word segmentation method, but also provides convenience for the subsequent disambiguation step of text processing.

Drawings

FIG. 1 is a schematic block diagram of a flow of the present invention;

FIG. 2 is a schematic flow chart of the present invention;

FIG. 3 is a comparative line graph of F1 in an embodiment of the invention.

Detailed Description

Exemplary embodiments, features and aspects of the present invention will be described in detail below with reference to the attached drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

Specifically, the invention provides a semantic-based incomplete maximum matching word segmentation method, as shown in fig. 1 and 2, which comprises the following steps:

s12, w _i Store C for keys _Sim And mean value

Recording semantic information of adjacent entries in T to obtain D ^Sim ；

s21, supposing that a Chinese character string S with the length of n to be segmented exists _n ＝w ₁ w ₂ ......w _n Counting a general dictionary containing all entries as D; in a round of maximum matching algorithm, all h groups successfully matched form a setAggregate as C _h ＝{h|(1≤h≤k)∩w ₁ w ₂ ......w _h E D, k represents the matching word length of the maximum matching algorithm, i.e., w ₁ w ₂ ......w _k Is the first segmentation result of the maximum matching algorithm, set C _h The word segmentation formed by each element in the dictionary is carried out in the general dictionary D, and is used as a standby result;

representation word S _h Average semantic similarity to subsequent words;

S3, automatic recognition of subsequent words: for the Chinese character string S to be segmented _n ＝w ₁ w ₂ ……w _n S is obtained through the segmentation method of S2 _h ＝w ₁ w ₂ ......w _h (h.ltoreq.n), in dictionary D ^Sim Reads all S _h Subsequent word sets of (a)

Will S _h2 Automatically identified as S _h Subsequent words of (a);

if there are a plurality of S _h2 Taking and S _h The semantic similarity is the largest; if there is no continuous execution of step S2 to segment S _h Subsequent strings of (i.e. input string is S) _n-h ＝w _h+1 w _h+2 ……w _n ；

Preferably, in step S1, for term w in T _i ，n _i The set of the subsequent entries is C _w ＝{w _ij ,1≤j≤n _i }. From w _i And w is equal to _ij The set of semantic similarity components of (a) is C _Sim ＝{w _ij :Sim _ij ,1≤j≤n _i -w is _ij Representing w _i Is the j-th subsequent entry, sim _ij Representing w _i And w is equal to _ij Semantic similarity in word forests,

D ^Sim in w _i Store C for keys _Sim And mean value

Semantic information of adjacent entries in T is recorded.

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

S1, segmentation and calculationA method of manufacturing the same. Assume that there is a string S of Chinese characters of length n to be segmented _n ＝w ₁ w ₂ ……w _n . Counting a general dictionary containing all entries as D; in a round of maximum matching algorithm, the set of h components that all match successfully is counted as C _h ＝{h|(1≤h≤k)∩w ₁ w ₂ ……w _h E D }. k represents the matching word length of the maximum matching algorithm, i.e. w ₁ w ₂ ……w _k Is the first segmentation result of the maximum matching algorithm. Set C _h The word segmentation of each element in the dictionary D is the general dictionary D, and they may be the final result. The tri-feature weight WE for each word is calculated as follows:

representation word S _h Average semantic similarity to subsequent words;

Representing the maximum average semantic similarity; p is p _h Representing the frequency of occurrence of words of word length h in the general dictionary D; p is p _max Represented by p in the general dictionary D _h Is a maximum value of (a).

Finally, S with the maximum weight of three features is taken _h As a result of the word segmentation. Is counted as initial word S _h ＝w ₁ w ₂ ……w _h 。

S2, automatic recognition of subsequent words. For the Chinese character string S to be segmented _n ＝w ₁ w ₂ ......w _n S is obtained through the segmentation method of S1 _h ＝w ₁ w ₂ ......w _h (h.ltoreq.n), in dictionary D ^Sim Reads all S _h Subsequent word sets of (a)

Will S _h2 Automatically identified as S _h Subsequent words of (a). If there are a plurality of S _h2 Taking and S _h The semantic similarity is the largest; if there is no continuous S2 segmentation S _h Subsequent strings of (i.e. input string is S) _n-h ＝w _h+1 w _h+2 ......w _n 。

Repeating 1 and 2 to finally cut to complete Chinese character string S _n 。

After word segmentation is completed, the performance of the algorithm in accuracy and time overhead is verified through experiments.

The sources of experimental data mainly include three parts:

dictionary: the general dictionary D consists of a hundred-degree word segmentation word stock, a dog search word stock and a wubi word stock, and comprises 41.2736 ten thousand words after arrangement, and meanwhile, synonym word forest is introduced for calculating semantic similarity.

Training library text: 7243 paragraphs in different fields are included, and 37.6519 ten thousand segmentation words are added. The method is mainly used for constructing a forward semantic similarity dictionary.

Test library text: contains 3147 segments of different fields, amounting to 12.3928 ten thousand words. And selecting part of the sections from the test library for experimental testing.

To verify the accuracy and efficiency of the algorithm, accuracy, recall, harmonic mean F1, and time overhead are used herein as evaluation indicators. It is assumed that in the word segmentation after the algorithm segmentation, the correct result is r ₁ The number of wrong results is f, and the number of word segmentation given by the sample is r ₂ . The calculation formulas of the accuracy rate, the recall rate and F1 are as follows:

to examine the performance of the algorithm, three sets of comparative experiments were performed herein for five word segmentation algorithms, each of which is:

FMM: forward maximum matching word segmentation method;

BMM: a backward maximum matching word segmentation method;

DSFMM based on D ^Sim Realizing a forward maximum matching algorithm for the recognition of subsequent words;

DSBMM based on D ^Sim Realizing a backward maximum matching algorithm for the recognition of the subsequent words;

SIMM, namely the incomplete maximum matching word segmentation method based on semantic feature improvement.

The basic information of three sets of experiments is shown in tables 2, 3, 4:

TABLE 2 first set of experimental basic information

TABLE 3 second set of experimental basic information

Table 4: third group of experimental basic information

Experimental results and analysis

The five algorithms were tested separately. For ease of comparison, the results of experiments E1 and E2 are presented in summary. The statistics of the accuracy, recall and the harmonic mean F1 of experiments E1 and E2 are shown in tables 5, 6 and 7, and the comparison line diagram of F1 is shown in FIG. 3:

table 5: e1 and E2 accuracy comparison table

Grouping

C1/C1′

C2/C2′

C3/C3′

C4/C4′

C5/C5′

C6/C6′

Average value of

FMM(E1)

84％

83.333％

81.395％

84％

80.488％

83.720％

82.823％

FMM(E2)

88.372％

82.5％

86.275％

85.714％

78.049％

84.314％

83.704％

BMM(E1)

86.275％

86.047％

83.720％

82.353％

82.927％

80.952％

83.712％

BMM(E2)

84.091％

78.049％

84.314％

81.633％

80.952％

78.571％

81.268％

DSFMM(E1)

88.235％

88.372％

79.545％

86.275％

83.721％

86.364％

85.419％

DSFMM(E2)

86.667％

83.721％

88.462％

86.275％

83.721％

86.364％

85.868％

DSBMM(E1)

88.235％

86.364％

79.545％

84.615％

84.091％

88.889％

85.290％

DSBMM(E2)

88.889％

81.818％

88.235％

86.538％

86.047％

86.792％

86.387％

SIMM(E1)

85.455％

87.037％

92％

86.885％

91.489％

85.714％

88.097％

SIMM(E2)

86.275％

88.679％

85.714％

86.441％

91.489％

87.272％

87.645％

Table 6: recall ratio comparison table for E1 and E2

Table 7: f1 comparative Table of E1 and E2

Grouping

C1/C1′

C2/C2′

C3/C3′

C4/C4′

C5/C5′

C6/C6′

Average value of

FMM(E1)

83.858％

83.820％

82.373％

84.683％

81.858％

83.535％

83.355％

FMM(E2)

87.309％

83.032％

85.777％

84.806％

80.003％

84.316％

84.207％

BMM(E1)

86.125％

86.305％

83.870％

82.956％

82.360％

82.363％

83.997％

BMM(E2)

84.003％

78.939％

84.322％

81.885％

80.602％

78.870％

81.437％

DSFMM(E1)

85.660％

86.815％

79.570％

86.064％

82.846％

86.113％

84.511％

DSFMM(E2)

86.164％

84.233％

88.162％

86.281％

84.088％

86.280％

85.868％

DSBMM(E1)

87.682％

86.482％

79.693％

84.914％

84.173％

88.186％

85.178％

DSBMM(E2)

88.561％

82.085％

88.140％

86.618％

86.157％

86.924％

86.414％

SIMM(E1)

85.402％

87.154％

90.418％

86.494％

89.833％

86.124％

87.571％

SIMM(E2)

86.319％

87.850％

86.188％

86.855％

90.134％

87.751％

87.516％

Five algorithms are classified into three categories: the first is the traditional maximum matching algorithm (FMM algorithm and BMM algorithm); the second category is two algorithms (DSFMM algorithm and DSBMM algorithm) that implement automatic recognition of subsequent words using a forward semantic similarity dictionary. If not, the maximum match is still made. The second class of algorithms still to some extent follow the principle of maximum matching; the third class of algorithm is SIMM algorithm, which adds word frequency and semantics into word segmentation matching algorithm to form new calculation method, which is incomplete maximum matching word segmentation method. From the line graph, it can be seen that:

in the transverse direction, the slope of one type of algorithm is the largest, and the slopes of three types of algorithms are the smallest. This illustrates that one class of algorithms is more affected by word segmentation length. Because the word adhesion phenomenon is easily caused according to the maximum matching principle, the segmentation result is often a combination of a plurality of words, and if the word segmentation result of an experimental sample is mostly short words, the accuracy of one type of algorithm is reduced.

In the longitudinal direction, the F1 value of one type of algorithm is minimum, and the F1 value of three types of algorithms is maximum. The second-class algorithm F1 value is larger than the first-class algorithm F1 value, and the introduced forward semantic similarity dictionary can correctly and effectively identify the subsequent words. The F1 value of the three kinds of algorithms is larger than that of the two kinds of algorithms, which shows that the segmentation principle redefined by the semantics and the word frequency is used to improve the accuracy of the word segmentation algorithm.

In the third set of tests E3, experiments were performed from one of three classes of algorithms, the time performance of which are shown in table 8:

table 8: three algorithms process 100 paragraph time-consuming comparison tables

The DSFMM algorithm is shorter in time consumption than the FMM algorithm, and the fact that the forward semantic similarity dictionary is introduced to conduct subsequent word recognition on the basis of the traditional maximum matching algorithm can effectively improve algorithm efficiency. The SIMM algorithm presented herein is somewhat more time-consuming than the conventional algorithm because it requires computation of each word that is matched, consuming a lot of time. But simultaneously adopts the automatic recognition of the subsequent words, thereby saving time. Making the SIMM algorithm nearly equal in time performance to the traditional maximum matching algorithm. Also within acceptable limits.

Finally, it should be noted that: the embodiments described above are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. The incomplete maximum matching word segmentation method based on the semantics is characterized by comprising the following steps of: redefining a segmentation principle of a matching algorithm by using semantic elements in a semantic dictionary, wherein the segmentation principle specifically comprises the following steps:

s12, w _i Store C for keys _Sim And mean value

Recording semantic information of adjacent entries in T to obtain D ^Sim ；

S21、assume that there is a string S of Chinese characters of length n to be segmented _n ＝w ₁ w ₂ ……w _n Counting a general dictionary containing all entries as D; in a round of maximum matching algorithm, the set of h components that all match successfully is counted as C _h ＝{h|(1≤h≤k)∩w ₁ w ₂ ......w _h E D, wherein h represents the entry length of the Chinese character string to be segmented in the general dictionary D, and k represents the matching word length of the maximum matching algorithm, i.e. w ₁ w ₂ ……w _k Is the first segmentation result of the maximum matching algorithm, set C _h The word segmentation formed by each element in the dictionary belongs to a general dictionary D and is used as a standby result;

representation word S _h Average semantic similarity to subsequent words;

s23, taking the S with the maximum weight of three characteristics _h As a word segmentation result, count the initial word S _h ＝w ₁ w ₂ ……w _h ；

S3, automatically identifying the following words: for the Chinese character string S to be segmented _n ＝w ₁ w ₂ ......w _n S is obtained through the segmentation method of S2 _h ＝w ₁ w ₂ ......w _h (h.ltoreq.n), in dictionary D ^Sim Reads all S _h Subsequent words of (2)Aggregation

Will S _h2 Automatically recognized as an initial word S _h Subsequent words of (a);

if there are a plurality of S _h2 Taking and S _h The initial word S with the maximum semantic similarity _h Subsequent words of (a); if there is no continuous execution of step S2 to segment S _h Subsequent strings of (i.e. input string is S) _n-h ＝w _h+1 w _h+2 ......w _n ；

2. The semantic-based incomplete maximum-match word segmentation method according to claim 1, wherein: in step S12, D ^Sim The stored structure of (2) is shown in the following table:

3. the semantic-based incomplete maximum-match word segmentation method according to claim 1, wherein: the semantic similarity and word frequency characteristics of the three characteristic weights are smaller than 1, and the word length characteristics are larger than 1.

4. The semantic-based incomplete maximum-match word segmentation method according to claim 3, wherein: and carrying out normalization processing on the three characteristic weights by using the maximum average semantic similarity and the maximum frequency, and eliminating data contingency.

5. The semantic-based incomplete maximum-match word segmentation method according to claim 3, wherein: and the method further comprises the step S5 of verifying the calculation result: make the following stepsWith accuracy, recall, and harmonic mean F ₁ Taking time overhead as an evaluation index; it is assumed that in the word segmentation after the algorithm segmentation, the correct result is r ₁ The number of wrong results is f, and the number of word segmentation given by the sample is r ₂ Accuracy P, recall R and a harmonic mean F of the two ₁ The calculation formula of (2) is as follows: