[go: up one dir, main page]

CN113536807B - Incomplete maximum matching word segmentation method based on semantics - Google Patents

Incomplete maximum matching word segmentation method based on semantics Download PDF

Info

Publication number
CN113536807B
CN113536807B CN202110888301.2A CN202110888301A CN113536807B CN 113536807 B CN113536807 B CN 113536807B CN 202110888301 A CN202110888301 A CN 202110888301A CN 113536807 B CN113536807 B CN 113536807B
Authority
CN
China
Prior art keywords
word
subsequent
maximum
semantic
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110888301.2A
Other languages
Chinese (zh)
Other versions
CN113536807A (en
Inventor
苏航
周汉清
吕海熊
张春雷
丁新
刘勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Aero Polytechnology Establishment
Original Assignee
China Aero Polytechnology Establishment
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Aero Polytechnology Establishment filed Critical China Aero Polytechnology Establishment
Priority to CN202110888301.2A priority Critical patent/CN113536807B/en
Publication of CN113536807A publication Critical patent/CN113536807A/en
Application granted granted Critical
Publication of CN113536807B publication Critical patent/CN113536807B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Character Discrimination (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a semantic-based incomplete maximum matching word segmentation method, which comprises the following steps of: s1, utilizing training set corpus T and synonym Lin Goujian forward semantic similarity dictionary D Sim The method comprises the steps of carrying out a first treatment on the surface of the S2, segmenting the words to obtain initial words; s3, automatic recognition of subsequent words: for the Chinese character string S to be segmented n =w 1 w 2 ......w n S is obtained through the segmentation method of S2 h =w 1 w 2 ......w h (h.ltoreq.n), in dictionary D Sim Reads all S h Subsequent word sets of (a)
Figure DDA0003195009740000011
If there is S h Is successfully matched with a certain subsequent word, namely
Figure DDA0003195009740000012
Will S h2 Automatically identified as S h Subsequent words of (a); s4, repeatedly executing the steps S2-S3, and finally cutting to finish the Chinese character string S n . The invention combines two methods based on rules and statistics, provides a semantic-based incomplete maximum matching word segmentation method, solves the defect of word adhesion of the traditional maximum matching algorithm by using three-feature weight calculation, and improves the word segmentation accuracy.

Description

Incomplete maximum matching word segmentation method based on semantics
Technical Field
The invention relates to a phrase distribution method, in particular to a semantic-based incomplete maximum matching word segmentation method.
Background
The Chinese word segmentation technology is used as an initial stage of text processing, and directly influences the accuracy of the whole data mining process. The high-precision Chinese word segmentation technology provides a high-quality text preprocessing basis for the fields of semantic disambiguation, keyword extraction, information retrieval and the like, and has important significance for promoting the development of natural language. At present, the research of Chinese word segmentation technology is mainly advanced in two directions of accuracy and timeliness: in improving algorithm timeliness, the main way is by defining a dictionary and a high-performance data structure. Such as loading a dictionary with a character tree or grouping a character tree. In terms of improving accuracy, most research is focused on improvement of word segmentation algorithms. The commonly used Chinese word segmentation methods can be divided into the following two categories:
rule-based word segmentation algorithm:
the maximum matching algorithm is typically based on rule-based lexical approach. According to the word segmentation dictionary, the problem of the text field is not needed to be considered, so that the method has the advantages of field independence and high timeliness. But is difficult to process ambiguous words and word adhesion is easy to occur. Many improvements to the maximum matching algorithm have emerged in the industry, such as: dynamically intercepting input strings by using dictionary entries, improving word segmentation efficiency by applying a hash technology, and the like.
Statistical-based word segmentation algorithm:
statistical word segmentation focuses on stable combinations of words and commonly used co-occurrence ratios of adjacent words simulate the likelihood that they constitute words. Word segmentation is achieved by means of the statistical word occurrence frequency.
However, the existing two methods are easy to adhere words and cannot guarantee the accuracy of word segmentation.
Disclosure of Invention
In order to solve the defects in the prior art, the invention provides a semantic-based incomplete maximum matching word segmentation method, which can construct a forward semantic similarity dictionary, and the dictionary can record the association strength among words, so that the defect of word adhesion of a maximum matching algorithm is overcome on the basis of guaranteeing time expenditure, and the accuracy of a word segmentation algorithm is improved. On one hand, the dictionary is applied to realize the recognition of subsequent words, so that the word segmentation accuracy is improved, the circulation is reduced, and the efficiency is improved. On the other hand, the invention provides a three-feature weight calculation formula, which redefines the segmentation principle of the word segmentation algorithm and solves the defect of word adhesion of the traditional algorithm.
Specifically, the invention provides a semantic-based incomplete maximum matching word segmentation method, which redefines a segmentation principle of a matching algorithm by using semantic elements in a semantic dictionary, and specifically comprises the following steps:
s1, constructing a forward semantic similarity dictionary: forward semantic similarity dictionary D using training set corpus T and synonym Lin Goujian Sim The construction process specifically comprises the following substeps:
s11, for the word w in the corpus T of the training set i ,n i The set of the subsequent entries is C w ={w ij ,1≤j≤n i By w i And w is equal to ij The set of semantic similarity components of (a) is C Sim ={w ij :Sim ij ,1≤j≤n i -w is ij Representing w i Is the j-th subsequent entry, sim ij Representing w i And w is equal to ij Semantic similarity in word forests,
Figure BDA0003195009720000021
representing vocabulary entry w i Average value of semantic similarity with all subsequent terms, namely: />
Figure BDA0003195009720000022
S12, w i Store C for keys Sim And mean value
Figure BDA0003195009720000023
Recording semantic information of adjacent entries in T to obtain D Sim
S2, segmenting the words to obtain initial words, wherein the method specifically comprises the following substeps:
s21, supposing that a Chinese character string S with the length of n to be segmented exists n =w 1 w 2 ......w n Counting a general dictionary containing all entries as D; in a round of maximum matching algorithm, the set of h components that all match successfully is counted as C h ={h|(1≤h≤k)∩w 1 w 2 ......w h E D, k represents the matching word length of the maximum matching algorithm, i.e., w 1 w 2 ......w k Is the first segmentation result of the maximum matching algorithm, set C h The word segmentation formed by each element in the dictionary is carried out in the general dictionary D, and is used as a standby result;
s22, calculating a tri-feature weight WE of each word, wherein the calculation formula is as follows:
Figure BDA0003195009720000024
wherein S is h Representing the result of the set C h The length of the word segmentation determined by the medium element is h;
Figure BDA0003195009720000025
representation word S h Average semantic similarity to subsequent words;
Figure BDA0003195009720000026
Representing the maximum average semantic similarity; p is p h Representing the frequency of occurrence of words of word length h in the general dictionary D; p is p max Represented by p in the general dictionary D h Is the maximum value of (2);
s23, taking the S with the maximum weight of three characteristics h As a word segmentation result, count the initial word S h =w 1 w 2 ......w h
S3, automatic recognition of subsequent words: for the Chinese character string S to be segmented n =w 1 w 2 ......w n S is obtained through the segmentation method of S2 h =w 1 w 2 ......w h (h.ltoreq.n), in dictionary D Sim Reads all S h Subsequent word sets of (a)
Figure BDA0003195009720000031
If there is S h Is successfully matched with a certain subsequent word, i.e. +.>
Figure BDA0003195009720000032
Will S h2 Automatically identified as S h Subsequent words of (a);
if there are a plurality of S h2 Taking and S h The semantic similarity is the largest; if there is no continuous execution of step S2 to segment S h Subsequent strings of (i.e. input string is S) n-h =w h+1 w h+2 ......w n
S4, repeatedly executing the steps S2-S3, and finally cutting to finish the Chinese character string S n
Preferably, in step S1, for term w in T i ,n i The set of the subsequent entries is C w ={w ij ,1≤j≤n i -a }; from w i And w is equal to ij The set of semantic similarity components of (a) is C Sim ={w ij :Sim ij ,1≤j≤n i -w is ij Representing w i Is the j-th subsequent entry, sim ij Representing w i And w is equal to ij Semantic similarity in word forests,
Figure BDA0003195009720000033
representing vocabulary entry w i Average value of semantic similarity with all subsequent terms, namely:
Figure BDA0003195009720000034
D Sim in w i Store C for keys Sim And mean value
Figure BDA0003195009720000035
Semantic information of adjacent entries in T is recorded.
Preferably, the semantic similarity and word frequency characteristics of the three characteristic weights are smaller than 1, and the word length characteristics are larger than 1.
Preferably, the three feature weights are normalized by using the maximum average semantic similarity and the maximum frequency to eliminate data contingency.
Preferably, the method further comprises step S5 of verifying the calculation result: using accuracy, recall, harmonic mean F1 and time overhead as evaluation indexes, assuming that in the word segmentation after algorithm segmentation, the correct result is r 1 And the error result isf, the number of the word segmentation given by the sample is r 2 The calculation formulas of the accuracy, the recall rate and F1 are as follows:
Figure BDA0003195009720000036
Figure BDA0003195009720000037
Figure BDA0003195009720000038
compared with the prior art, the invention has the following beneficial effects:
(1) The invention combines two methods based on rules and statistics, and provides an incomplete maximum matching word segmentation method based on semantics. The defect of word adhesion of the traditional maximum matching algorithm is overcome by using the three-feature weight calculation method. Meanwhile, the forward semantic similarity dictionary is applied to realize automatic recognition of subsequent words, so that the algorithm performance is improved, and the time cost is reduced. The accuracy of the algorithm is improved on the basis of guaranteeing the time expenditure of the algorithm.
(2) The invention focuses on the association of semantic layers among words in a mode of carrying out subsequent word recognition through semantics, and achieves the effect of solving word ambiguity to a certain extent. Therefore, the invention not only provides an accurate and efficient word segmentation method, but also provides convenience for the subsequent disambiguation step of text processing.
Drawings
FIG. 1 is a schematic block diagram of a flow of the present invention;
FIG. 2 is a schematic flow chart of the present invention;
FIG. 3 is a comparative line graph of F1 in an embodiment of the invention.
Detailed Description
Exemplary embodiments, features and aspects of the present invention will be described in detail below with reference to the attached drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
Specifically, the invention provides a semantic-based incomplete maximum matching word segmentation method, as shown in fig. 1 and 2, which comprises the following steps:
s1, constructing a forward semantic similarity dictionary: forward semantic similarity dictionary D using training set corpus T and synonym Lin Goujian Sim The construction process specifically comprises the following substeps:
s11, for the word w in the corpus T of the training set i ,n i The set of the subsequent entries is C w ={w ij ,1≤j≤n i By w i And w is equal to ij The set of semantic similarity components of (a) is C Sim ={w ij :Sim ij ,1≤j≤n i -w is ij Representing w i Is the j-th subsequent entry, sim ij Representing w i And w is equal to ij Semantic similarity in word forests,
Figure BDA0003195009720000041
representing vocabulary entry w i Average value of semantic similarity with all subsequent terms, namely:
Figure BDA0003195009720000042
s12, w i Store C for keys Sim And mean value
Figure BDA0003195009720000043
Recording semantic information of adjacent entries in T to obtain D Sim
S2, segmenting the words to obtain initial words, wherein the method specifically comprises the following substeps:
s21, supposing that a Chinese character string S with the length of n to be segmented exists n =w 1 w 2 ......w n Counting a general dictionary containing all entries as D; in a round of maximum matching algorithm, all h groups successfully matched form a setAggregate as C h ={h|(1≤h≤k)∩w 1 w 2 ......w h E D, k represents the matching word length of the maximum matching algorithm, i.e., w 1 w 2 ......w k Is the first segmentation result of the maximum matching algorithm, set C h The word segmentation formed by each element in the dictionary is carried out in the general dictionary D, and is used as a standby result;
s22, calculating a tri-feature weight WE of each word, wherein the calculation formula is as follows:
Figure BDA0003195009720000051
wherein S is h Representing the result of the set C h The length of the word segmentation determined by the medium element is h;
Figure BDA0003195009720000052
representation word S h Average semantic similarity to subsequent words;
Figure BDA0003195009720000053
Representing the maximum average semantic similarity; p is p h Representing the frequency of occurrence of words of word length h in the general dictionary D; p is p max Represented by p in the general dictionary D h Is the maximum value of (2);
s23, taking the S with the maximum weight of three characteristics h As a word segmentation result, count the initial word S h =w 1 w 2 ......w h
S3, automatic recognition of subsequent words: for the Chinese character string S to be segmented n =w 1 w 2 ……w n S is obtained through the segmentation method of S2 h =w 1 w 2 ......w h (h.ltoreq.n), in dictionary D Sim Reads all S h Subsequent word sets of (a)
Figure BDA0003195009720000054
If there is S h Is successfully matched with a certain subsequent word, i.e. +.>
Figure BDA0003195009720000055
Will S h2 Automatically identified as S h Subsequent words of (a);
if there are a plurality of S h2 Taking and S h The semantic similarity is the largest; if there is no continuous execution of step S2 to segment S h Subsequent strings of (i.e. input string is S) n-h =w h+1 w h+2 ……w n
S4, repeatedly executing the steps S2-S3, and finally cutting to finish the Chinese character string S n
Preferably, in step S1, for term w in T i ,n i The set of the subsequent entries is C w ={w ij ,1≤j≤n i }. From w i And w is equal to ij The set of semantic similarity components of (a) is C Sim ={w ij :Sim ij ,1≤j≤n i -w is ij Representing w i Is the j-th subsequent entry, sim ij Representing w i And w is equal to ij Semantic similarity in word forests,
Figure BDA0003195009720000056
representing vocabulary entry w i Average value of semantic similarity with all subsequent terms, namely:
Figure BDA0003195009720000061
D Sim in w i Store C for keys Sim And mean value
Figure BDA0003195009720000062
Semantic information of adjacent entries in T is recorded.
Preferably, the semantic similarity and word frequency characteristics of the three characteristic weights are smaller than 1, and the word length characteristics are larger than 1.
Preferably, the three feature weights are normalized by using the maximum average semantic similarity and the maximum frequency to eliminate data contingency.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
S1, segmentation and calculationA method of manufacturing the same. Assume that there is a string S of Chinese characters of length n to be segmented n =w 1 w 2 ……w n . Counting a general dictionary containing all entries as D; in a round of maximum matching algorithm, the set of h components that all match successfully is counted as C h ={h|(1≤h≤k)∩w 1 w 2 ……w h E D }. k represents the matching word length of the maximum matching algorithm, i.e. w 1 w 2 ……w k Is the first segmentation result of the maximum matching algorithm. Set C h The word segmentation of each element in the dictionary D is the general dictionary D, and they may be the final result. The tri-feature weight WE for each word is calculated as follows:
Figure BDA0003195009720000063
wherein S is h Representing the result of the set C h The length of the word segmentation determined by the medium element is h;
Figure BDA0003195009720000064
representation word S h Average semantic similarity to subsequent words;
Figure BDA0003195009720000065
Representing the maximum average semantic similarity; p is p h Representing the frequency of occurrence of words of word length h in the general dictionary D; p is p max Represented by p in the general dictionary D h Is a maximum value of (a).
Finally, S with the maximum weight of three features is taken h As a result of the word segmentation. Is counted as initial word S h =w 1 w 2 ……w h
S2, automatic recognition of subsequent words. For the Chinese character string S to be segmented n =w 1 w 2 ......w n S is obtained through the segmentation method of S1 h =w 1 w 2 ......w h (h.ltoreq.n), in dictionary D Sim Reads all S h Subsequent word sets of (a)
Figure BDA0003195009720000066
If there is S h Is successfully matched with a certain subsequent word, i.e. +.>
Figure BDA0003195009720000067
Will S h2 Automatically identified as S h Subsequent words of (a). If there are a plurality of S h2 Taking and S h The semantic similarity is the largest; if there is no continuous S2 segmentation S h Subsequent strings of (i.e. input string is S) n-h =w h+1 w h+2 ......w n
Repeating 1 and 2 to finally cut to complete Chinese character string S n
After word segmentation is completed, the performance of the algorithm in accuracy and time overhead is verified through experiments.
The sources of experimental data mainly include three parts:
dictionary: the general dictionary D consists of a hundred-degree word segmentation word stock, a dog search word stock and a wubi word stock, and comprises 41.2736 ten thousand words after arrangement, and meanwhile, synonym word forest is introduced for calculating semantic similarity.
Training library text: 7243 paragraphs in different fields are included, and 37.6519 ten thousand segmentation words are added. The method is mainly used for constructing a forward semantic similarity dictionary.
Test library text: contains 3147 segments of different fields, amounting to 12.3928 ten thousand words. And selecting part of the sections from the test library for experimental testing.
To verify the accuracy and efficiency of the algorithm, accuracy, recall, harmonic mean F1, and time overhead are used herein as evaluation indicators. It is assumed that in the word segmentation after the algorithm segmentation, the correct result is r 1 The number of wrong results is f, and the number of word segmentation given by the sample is r 2 . The calculation formulas of the accuracy rate, the recall rate and F1 are as follows:
Figure BDA0003195009720000071
Figure BDA0003195009720000072
Figure BDA0003195009720000073
to examine the performance of the algorithm, three sets of comparative experiments were performed herein for five word segmentation algorithms, each of which is:
FMM: forward maximum matching word segmentation method;
BMM: a backward maximum matching word segmentation method;
DSFMM based on D Sim Realizing a forward maximum matching algorithm for the recognition of subsequent words;
DSBMM based on D Sim Realizing a backward maximum matching algorithm for the recognition of the subsequent words;
SIMM, namely the incomplete maximum matching word segmentation method based on semantic feature improvement.
The basic information of three sets of experiments is shown in tables 2, 3, 4:
TABLE 2 first set of experimental basic information
Figure BDA0003195009720000074
TABLE 3 second set of experimental basic information
Figure BDA0003195009720000075
Figure BDA0003195009720000081
Table 4: third group of experimental basic information
Figure BDA0003195009720000082
Experimental results and analysis
The five algorithms were tested separately. For ease of comparison, the results of experiments E1 and E2 are presented in summary. The statistics of the accuracy, recall and the harmonic mean F1 of experiments E1 and E2 are shown in tables 5, 6 and 7, and the comparison line diagram of F1 is shown in FIG. 3:
table 5: e1 and E2 accuracy comparison table
Grouping C1/C1′ C2/C2′ C3/C3′ C4/C4′ C5/C5′ C6/C6′ Average value of
FMM(E1) 84% 83.333% 81.395% 84% 80.488% 83.720% 82.823%
FMM(E2) 88.372% 82.5% 86.275% 85.714% 78.049% 84.314% 83.704%
BMM(E1) 86.275% 86.047% 83.720% 82.353% 82.927% 80.952% 83.712%
BMM(E2) 84.091% 78.049% 84.314% 81.633% 80.952% 78.571% 81.268%
DSFMM(E1) 88.235% 88.372% 79.545% 86.275% 83.721% 86.364% 85.419%
DSFMM(E2) 86.667% 83.721% 88.462% 86.275% 83.721% 86.364% 85.868%
DSBMM(E1) 88.235% 86.364% 79.545% 84.615% 84.091% 88.889% 85.290%
DSBMM(E2) 88.889% 81.818% 88.235% 86.538% 86.047% 86.792% 86.387%
SIMM(E1) 85.455% 87.037% 92% 86.885% 91.489% 85.714% 88.097%
SIMM(E2) 86.275% 88.679% 85.714% 86.441% 91.489% 87.272% 87.645%
Table 6: recall ratio comparison table for E1 and E2
Figure BDA0003195009720000083
Figure BDA0003195009720000091
Table 7: f1 comparative Table of E1 and E2
Grouping C1/C1′ C2/C2′ C3/C3′ C4/C4′ C5/C5′ C6/C6′ Average value of
FMM(E1) 83.858% 83.820% 82.373% 84.683% 81.858% 83.535% 83.355%
FMM(E2) 87.309% 83.032% 85.777% 84.806% 80.003% 84.316% 84.207%
BMM(E1) 86.125% 86.305% 83.870% 82.956% 82.360% 82.363% 83.997%
BMM(E2) 84.003% 78.939% 84.322% 81.885% 80.602% 78.870% 81.437%
DSFMM(E1) 85.660% 86.815% 79.570% 86.064% 82.846% 86.113% 84.511%
DSFMM(E2) 86.164% 84.233% 88.162% 86.281% 84.088% 86.280% 85.868%
DSBMM(E1) 87.682% 86.482% 79.693% 84.914% 84.173% 88.186% 85.178%
DSBMM(E2) 88.561% 82.085% 88.140% 86.618% 86.157% 86.924% 86.414%
SIMM(E1) 85.402% 87.154% 90.418% 86.494% 89.833% 86.124% 87.571%
SIMM(E2) 86.319% 87.850% 86.188% 86.855% 90.134% 87.751% 87.516%
Five algorithms are classified into three categories: the first is the traditional maximum matching algorithm (FMM algorithm and BMM algorithm); the second category is two algorithms (DSFMM algorithm and DSBMM algorithm) that implement automatic recognition of subsequent words using a forward semantic similarity dictionary. If not, the maximum match is still made. The second class of algorithms still to some extent follow the principle of maximum matching; the third class of algorithm is SIMM algorithm, which adds word frequency and semantics into word segmentation matching algorithm to form new calculation method, which is incomplete maximum matching word segmentation method. From the line graph, it can be seen that:
in the transverse direction, the slope of one type of algorithm is the largest, and the slopes of three types of algorithms are the smallest. This illustrates that one class of algorithms is more affected by word segmentation length. Because the word adhesion phenomenon is easily caused according to the maximum matching principle, the segmentation result is often a combination of a plurality of words, and if the word segmentation result of an experimental sample is mostly short words, the accuracy of one type of algorithm is reduced.
In the longitudinal direction, the F1 value of one type of algorithm is minimum, and the F1 value of three types of algorithms is maximum. The second-class algorithm F1 value is larger than the first-class algorithm F1 value, and the introduced forward semantic similarity dictionary can correctly and effectively identify the subsequent words. The F1 value of the three kinds of algorithms is larger than that of the two kinds of algorithms, which shows that the segmentation principle redefined by the semantics and the word frequency is used to improve the accuracy of the word segmentation algorithm.
In the third set of tests E3, experiments were performed from one of three classes of algorithms, the time performance of which are shown in table 8:
table 8: three algorithms process 100 paragraph time-consuming comparison tables
Figure BDA0003195009720000101
The DSFMM algorithm is shorter in time consumption than the FMM algorithm, and the fact that the forward semantic similarity dictionary is introduced to conduct subsequent word recognition on the basis of the traditional maximum matching algorithm can effectively improve algorithm efficiency. The SIMM algorithm presented herein is somewhat more time-consuming than the conventional algorithm because it requires computation of each word that is matched, consuming a lot of time. But simultaneously adopts the automatic recognition of the subsequent words, thereby saving time. Making the SIMM algorithm nearly equal in time performance to the traditional maximum matching algorithm. Also within acceptable limits.
Finally, it should be noted that: the embodiments described above are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims (5)

1. The incomplete maximum matching word segmentation method based on the semantics is characterized by comprising the following steps of: redefining a segmentation principle of a matching algorithm by using semantic elements in a semantic dictionary, wherein the segmentation principle specifically comprises the following steps:
s1, constructing a forward semantic similarity dictionary: forward semantic similarity dictionary D using training set corpus T and synonym Lin Goujian Sim The construction process specifically comprises the following substeps:
s11, for the word w in the corpus T of the training set i ,n i The set of the subsequent entries is C w ={w ij ,1≤j≤n i By w i And w is equal to ij The set of semantic similarity components of (a) is C Sim ={w ij :Sim ij ,1≤j≤n i -w is ij Representing w i Is the j-th subsequent entry, sim ij Representing w i And w is equal to ij Semantic similarity in word forests,
Figure QLYQS_1
representing vocabulary entry w i Average value of semantic similarity with all subsequent terms, namely:
Figure QLYQS_2
s12, w i Store C for keys Sim And mean value
Figure QLYQS_3
Recording semantic information of adjacent entries in T to obtain D Sim
S2, segmenting the words to obtain initial words, wherein the method specifically comprises the following substeps:
S21、assume that there is a string S of Chinese characters of length n to be segmented n =w 1 w 2 ……w n Counting a general dictionary containing all entries as D; in a round of maximum matching algorithm, the set of h components that all match successfully is counted as C h ={h|(1≤h≤k)∩w 1 w 2 ......w h E D, wherein h represents the entry length of the Chinese character string to be segmented in the general dictionary D, and k represents the matching word length of the maximum matching algorithm, i.e. w 1 w 2 ……w k Is the first segmentation result of the maximum matching algorithm, set C h The word segmentation formed by each element in the dictionary belongs to a general dictionary D and is used as a standby result;
s22, calculating a tri-feature weight WE of each word, wherein the calculation formula is as follows:
Figure QLYQS_4
wherein S is h Representing the result of the set C h The length of the word segmentation determined by the medium element is h;
Figure QLYQS_5
representation word S h Average semantic similarity to subsequent words;
Figure QLYQS_6
Representing the maximum average semantic similarity; p is p h Representing the frequency of occurrence of words of word length h in the general dictionary D; p is p max Represented by p in the general dictionary D h Is the maximum value of (2);
s23, taking the S with the maximum weight of three characteristics h As a word segmentation result, count the initial word S h =w 1 w 2 ……w h
S3, automatically identifying the following words: for the Chinese character string S to be segmented n =w 1 w 2 ......w n S is obtained through the segmentation method of S2 h =w 1 w 2 ......w h (h.ltoreq.n), in dictionary D Sim Reads all S h Subsequent words of (2)Aggregation
Figure QLYQS_7
If there is S h Is successfully matched with a certain subsequent word, i.e. +.>
Figure QLYQS_8
Will S h2 Automatically recognized as an initial word S h Subsequent words of (a);
if there are a plurality of S h2 Taking and S h The initial word S with the maximum semantic similarity h Subsequent words of (a); if there is no continuous execution of step S2 to segment S h Subsequent strings of (i.e. input string is S) n-h =w h+1 w h+2 ......w n
S4, repeatedly executing the steps S2-S3, and finally cutting to finish the Chinese character string S n
2. The semantic-based incomplete maximum-match word segmentation method according to claim 1, wherein: in step S12, D Sim The stored structure of (2) is shown in the following table:
Figure QLYQS_9
3. the semantic-based incomplete maximum-match word segmentation method according to claim 1, wherein: the semantic similarity and word frequency characteristics of the three characteristic weights are smaller than 1, and the word length characteristics are larger than 1.
4. The semantic-based incomplete maximum-match word segmentation method according to claim 3, wherein: and carrying out normalization processing on the three characteristic weights by using the maximum average semantic similarity and the maximum frequency, and eliminating data contingency.
5. The semantic-based incomplete maximum-match word segmentation method according to claim 3, wherein: and the method further comprises the step S5 of verifying the calculation result: make the following stepsWith accuracy, recall, and harmonic mean F 1 Taking time overhead as an evaluation index; it is assumed that in the word segmentation after the algorithm segmentation, the correct result is r 1 The number of wrong results is f, and the number of word segmentation given by the sample is r 2 Accuracy P, recall R and a harmonic mean F of the two 1 The calculation formula of (2) is as follows:
Figure QLYQS_10
Figure QLYQS_11
Figure QLYQS_12
CN202110888301.2A 2021-08-03 2021-08-03 Incomplete maximum matching word segmentation method based on semantics Active CN113536807B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110888301.2A CN113536807B (en) 2021-08-03 2021-08-03 Incomplete maximum matching word segmentation method based on semantics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110888301.2A CN113536807B (en) 2021-08-03 2021-08-03 Incomplete maximum matching word segmentation method based on semantics

Publications (2)

Publication Number Publication Date
CN113536807A CN113536807A (en) 2021-10-22
CN113536807B true CN113536807B (en) 2023-05-05

Family

ID=78090363

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110888301.2A Active CN113536807B (en) 2021-08-03 2021-08-03 Incomplete maximum matching word segmentation method based on semantics

Country Status (1)

Country Link
CN (1) CN113536807B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109960786A (en) * 2019-03-27 2019-07-02 北京信息科技大学 Chinese word similarity calculation method based on fusion strategy
CN111832299A (en) * 2020-07-17 2020-10-27 成都信息工程大学 Chinese word segmentation system
WO2021093755A1 (en) * 2019-11-14 2021-05-20 华为技术有限公司 Matching method and apparatus for questions, and reply method and apparatus for questions

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7503035B2 (en) * 2003-11-25 2009-03-10 Software Analysis And Forensic Engineering Corp. Software tool for detecting plagiarism in computer source code

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109960786A (en) * 2019-03-27 2019-07-02 北京信息科技大学 Chinese word similarity calculation method based on fusion strategy
WO2021093755A1 (en) * 2019-11-14 2021-05-20 华为技术有限公司 Matching method and apparatus for questions, and reply method and apparatus for questions
CN111832299A (en) * 2020-07-17 2020-10-27 成都信息工程大学 Chinese word segmentation system

Also Published As

Publication number Publication date
CN113536807A (en) 2021-10-22

Similar Documents

Publication Publication Date Title
CN112069298B (en) Man-machine interaction method, device and medium based on semantic web and intention recognition
CN110442760B (en) A synonym mining method and device for question answering retrieval system
CN108197117B (en) A Chinese text keyword extraction method based on document topic structure and semantics
CN111125334B (en) Search question-answering system based on pre-training
CN105786991B (en) Method and system for Chinese emotional new word recognition combined with user emotional expression
CN109522547B (en) Chinese synonym iteration extraction method based on pattern learning
CN112632287B (en) Electric power knowledge graph construction method and device
CN110674296B (en) Information abstract extraction method and system based on key words
CN111897917B (en) Rail transit industry term extraction method based on multi-modal natural language features
CN114579729B (en) FAQ question-answer matching method and system fusing multi-algorithm models
CN111191051B (en) Method and system for constructing emergency knowledge map based on Chinese word segmentation technology
CN102063424A (en) Method for Chinese word segmentation
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN113360647A (en) 5G mobile service complaint source-tracing analysis method based on clustering
CN113988053A (en) Hot word extraction method and device
CN111881678B (en) A domain word discovery method based on unsupervised learning
CN109033066A (en) A kind of abstract forming method and device
CN115858787B (en) A Hotspot Extraction and Mining Method Based on Problem Appeal Information in Road Transportation
CN117973381A (en) Method for automatically extracting text keywords
CN114266249A (en) A Massive Text Clustering Method Based on Birch Clustering
CN110851593A (en) Complex value word vector construction method based on position and semantics
CN118797005A (en) Intelligent question-answering method, device, electronic device, storage medium and product
CN114879945B (en) Diversified API sequence recommendation method and device for long tail distribution characteristics
CN113536807B (en) Incomplete maximum matching word segmentation method based on semantics
CN115221871B (en) Multi-feature fusion method for extracting keywords from English scientific literature

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant