CN113076750A

CN113076750A - Cross-domain Chinese word segmentation system and method based on new word discovery

Info

Publication number: CN113076750A
Application number: CN202110463683.4A
Authority: CN
Inventors: 张军; 李�学; 宁更新; 杨萃; 冯义志; 余华; 陈芳炯; 季飞
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-04-26
Filing date: 2021-04-26
Publication date: 2021-07-06
Anticipated expiration: 2041-04-26
Also published as: CN113076750B

Abstract

The invention discloses a cross-domain Chinese word segmentation system and method based on new word discovery. The system includes: a new word discovery module, which uses enhanced mutual information combined with statistical information and semantic information to realize a new word discovery algorithm, and is used for Mining new vocabulary vocabulary in the labeled corpus; automatic tagging module, using the new vocabulary vocabulary combined with the reverse maximum matching algorithm to realize the initial segmentation of the unlabeled corpus, and obtain the incompletely segmented corpus, and use the word segmentation model to segment incompletely. The corpus is completely segmented, and the automatically labeled corpus is obtained; the cross-domain word segmentation module uses the adversarial method to realize the cross-domain Chinese word segmentation algorithm, and uses the labeled source domain corpus and the automatically labeled corpus for adversarial training. The invention optimizes the new word discovery algorithm by using enhanced mutual information, improves the accuracy rate of new word discovery and the domain of the vocabulary; in the cross-domain word segmentation algorithm, the utilization rate of unlabeled corpus is improved, and the recall rate of word segmentation is optimized. and accuracy.

Description

Cross-domain Chinese word segmentation system and method based on new word discovery

Technical Field

The invention relates to the technical field of natural language, in particular to a cross-domain Chinese word segmentation system and a cross-domain Chinese word segmentation method based on new word discovery.

Background

The Chinese text takes the Chinese characters as the minimum writing unit, the Chinese characters are mutually combined to form words, and finally the words form the Chinese text. The words are the minimum structural units which contain semantic information and can be independently used in the Chinese text, but are different from languages such as English and the like, no explicit separators exist among the Chinese words, the Chinese text is divided into words by using a certain technical method so as to be convenient for a computer to understand, and the process is the Chinese word segmentation. Chinese segmentation is the most basic task in Chinese natural language processing, which is the cornerstone of natural language processing tasks such as text classification, text generation, and emotion analysis. Therefore, the quality of the Chinese word segmentation result directly influences the result of the downstream task.

The traditional Chinese word segmentation method mainly comprises two types: mechanical word segmentation method and Chinese word segmentation method based on statistics. The mechanical word segmentation method is used for segmenting words by taking an existing dictionary as a basis and combining a certain manual rule, and the recognition capability Of unknown words (Out Of Vocalburry, OOV, words which do not appear in a word segmentation dictionary) is very low; the word segmentation method based on statistics is limited in context in a very small range, global features cannot be counted, and the recognition capability of unknown words is as low as that of the unknown words, so that the accuracy and recall rate of the two word segmentation methods are poor, and the two word segmentation methods cannot be used as practical word segmentation technologies at present. With the development of deep learning, applying the deep learning technology to Chinese word segmentation becomes a current research hotspot. The existing neural network model treats Chinese word segmentation as a sequence labeling problem, the model is trained by a manually labeled data set, a Chinese dictionary and an artificial construction rule do not need to be obtained, a characteristic template does not need to be artificially designed, and the accuracy and the recall rate are far higher than those of the traditional method, so that the neural network model becomes a mainstream method of Chinese word segmentation at present. At present, a model trained on large-scale artificial labeling data (called a source field) has a good effect on a word segmentation task, but when the model segments a cross-field corpus (a target field), the effect is sharply reduced, and Chinese word segmentation in the source field and the target field, which do not belong to the same field, is called cross-field Chinese word segmentation. The effect of cross-domain Chinese word segmentation is mainly limited by the problems of Expression gaps (Expression gaps) and unknown words of texts between a target domain and a source domain, wherein the Expression gaps refer to different segmentation modes of the same text in different domains, and the unknown words refer to words which only appear in the target domain but not in the source domain, and mainly include names of people, place names and professional nouns. The best method for solving the expression gap is to artificially label the corpora of the target field and then mix the corpora of the two fields to retrain the model, however, large-scale artificial labeling needs to consume a large amount of manpower and material resources, and artificial labeling of all the fields is not possible, so that the method has no practical feasibility; the best method for solving the problem of unknown words is to enable a professional to extract words which are not appeared in a source field corpus from a target field corpus and place the words as training corpuses into a model for training, however, on one hand, a large amount of manpower and material resources are consumed for selecting the unknown words, and on the other hand, all the unknown words cannot be selected by manpower because various new words are listed in the current society endlessly. Therefore, the cross-domain Chinese word segmentation is difficult to achieve a good effect all the time.

Disclosure of Invention

The invention aims to solve the defects in the prior art and provides a cross-domain Chinese word segmentation system and a cross-domain Chinese word segmentation method based on new word discovery. The cross-domain Chinese word segmentation system is divided into three modules, namely a new word discovery module, an automatic labeling module and a cross-domain word segmentation module. The difference between the invention and the traditional Chinese word segmentation method is that: (1) the relevance based on semantic information is added on the basis of a traditional new word discovery algorithm based on mutual information and adjacent entropy to form vector enhanced mutual information, so that the internal condensation degree of a character string can be measured more accurately, and the generation of junk character strings is reduced; (2) compared with the traditional cross-domain Chinese word segmentation method based on dictionary and source domain labeled corpus training, the method realizes automatic labeling of the target domain corpus without labels based on the domain dictionary, can obviously reduce the unknown word rate of the test corpus, and then converts the cross-domain word segmentation into the intra-domain word segmentation by utilizing the automatically labeled corpus training model, and can obviously reduce the influence of the expression gap on the model. The invention can be widely applied to various fields without large-scale language materials, such as the medical field, the scientific field, the biological field, the novel field and the like.

The first purpose of the invention can be achieved by adopting the following technical scheme:

a cross-domain Chinese word segmentation system based on new word discovery is composed of a new word discovery module, an automatic labeling module and a cross-domain word segmentation module, wherein the three modules are connected in sequence. The above modules are as follows:

(1) the new word discovery module: the method is used for extracting new words which do not appear in the source field, namely unknown words, from the target field linguistic data without labels. The new word discovery module is composed of a candidate word extraction sub-module, an enhanced mutual information extraction sub-module, an adjacent entropy extraction sub-module and a candidate word filtering sub-module, wherein the candidate word extraction sub-module, the enhanced mutual information extraction sub-module and the candidate word filtering sub-module are sequentially connected, the candidate word extraction sub-module is used for extracting all candidate words from the target field corpus, the enhanced mutual information extraction sub-module is used for extracting enhanced mutual information of all candidate words, and the candidate word filtering sub-module is used for filtering the candidate words; the candidate word filtering submodule, the adjacent entropy extracting submodule and the candidate word filtering submodule are sequentially connected, and the adjacent entropy extracting submodule is used for extracting the adjacent entropy of all candidate words; and the candidate word extracting submodule is connected with the candidate word filtering submodule.

(2) An automatic labeling module: and automatically labeling the linguistic data in the non-labeled target field by using a new word list and a word segmentation algorithm obtained in the new word discovery module. The automatic labeling module is composed of a first Chinese word segmentation sub-module and a second Chinese word segmentation sub-module, the first Chinese word segmentation sub-module matches the corpus based on the new word vocabulary, if the matching is successful, segmentation is carried out, otherwise segmentation is not carried out, and incomplete segmentation of the corpus in the target field is realized; and the second Chinese word segmentation sub-module performs segmentation on the unsingulated corpus in the first Chinese word segmentation sub-module by using a GCNN-CRF word segmentation algorithm based on source field corpus training to realize complete segmentation on the corpus in the target field.

(3) A cross-domain word segmentation module: and training the confrontation type deep neural network by using the labeled source field linguistic data and the automatically labeled target field linguistic data, and converting cross-field participles into in-field participles to realize the participles of the target field. The cross-domain word segmentation module comprises a source domain feature extraction submodule, a public feature extraction submodule, a target domain feature extraction submodule, a source domain lexeme labeling submodule, a text classification submodule and a target domain lexeme labeling submodule, wherein the source domain feature extraction submodule and the source domain lexeme labeling submodule are connected to form a branch I; the public feature extraction submodule is respectively connected with the text classification submodule, the source field lexeme labeling submodule and the target field lexeme labeling submodule to form a branch II, the public feature extraction submodule is used for extracting public features of source field linguistic data and target field linguistic data, the text classification submodule is used for judging which field the input comes from, and the target field lexeme labeling submodule is connected and used for performing lexeme labeling on the target field linguistic data; and the target field feature extraction submodule and the target field lexeme labeling submodule are connected to form a branch III, and the target field feature extraction submodule is used for extracting the unique features of the target field corpus.

Furthermore, the source domain feature extraction submodule, the target domain feature extraction submodule and the public feature extraction submodule all adopt GCNN as a feature extractor, the GCNN comprises 4 CNN layers and 1 activation layer, input vectors parallelly enter the 4 CNN layers, feature extraction is carried out through the CNN to obtain 4 feature vectors, the feature vectors of the first CNN layer are input into the activation layers to be activated, the dimensionality is kept unchanged, numbers in the vectors are limited between 0 and 1 to serve as weight vectors, the vectors obtained by multiplying the weight vectors and the feature vectors output by the other 3 CNN layers are the final feature vectors, and the activation function is sigmoid.

The second purpose of the invention can be achieved by adopting the following technical scheme:

a cross-domain Chinese word segmentation method based on new word discovery is used for achieving word segmentation of linguistic data in different domains by adopting the following steps:

step S1: and mining a new word list of the field from the target field corpus by using a new word discovery module.

Step S2: and (3) automatically labeling the linguistic data of the non-labeled target field by using an automatic labeling module in combination with the new word list of the field obtained in the step (1).

Step S3: extracting the characteristics of the source field and the target field corpora through three branches of the cross-field word segmentation module, wherein the first branch uses the source field characteristic extraction submodule to extract the source field characteristic H from the source field corpora_src(ii) a The second branch circuit uses a common feature extraction submodule to extract the common features H of the corpora of the source field and the target field_shr(ii) a Branch three-use target field feature extraction submodule for extracting target field features H from target field corpus_tgt；

Step S4: h obtained in step S3_srcAnd H_shrInputting the predicted values into a source field lexeme labeling submodule to predict a source field lexeme label, and performing step S3 on the obtained H_tgtAnd H_shrInputting the predicted target domain word position label into a target domain word position labeling submodule to predict a target domain word position label, and performing the step S3 on the obtained H_shrInput into a text classification sub-module to predict domain labels for the input text.

Further, in step S1, the step of using the new word discovery module to extract the new word list of the domain from the target domain corpus includes the following steps: step S1.1: and extracting all candidate words with the length not exceeding n in the domain corpus from the unmarked target domain corpus by using a candidate word extraction submodule.

Step S1.2: randomly segmenting the candidate word C into a front internal segment A and a rear internal segment B, and counting the times of C, A and B as n_C、n_AAnd n_BThe mutual information MI of C is calculated by the following method_C。

Wherein n is_wIndicating the number of occurrences of an arbitrary character string w in the corpus.

Step S1.3: training Word2Vec model by using target field corpus to obtain any character c_jWord vector of

The word vector Vec of the internal segment A is calculated by the following method_AAnd the word vector Vec of the inner segment B_B：

Wherein i represents the number of Chinese characters in A, m represents the number of Chinese characters in B, a_pAnd b_qRepresents the value of the word vector at positions p and q, n represents the vector dimension;

step S1.4: word vectors Vec from the internal segment a and the internal segment B in step S1.3_A、Vec_BIs measured by the following methodCalculating the semantic relevance sim (A, B) of the internal segment A and the internal segment B:

step S1.5: according to mutual information MI in step S1.2_CAnd semantic relevance sim (A, B) of step S1.4, calculating the enhanced mutual information ENMI of the candidate word C by the following method_C：

Wherein beta is₁And weight coefficients representing semantic relevance in enhancing mutual information.

Step S1.6: respectively finding out all left adjacent characters [ L ] of candidate word C from target field linguistic data₁，...L_u...L_H]And the right adjacent word [ R₁，...R_v...R_D]Where H and D represent the number of left and right adjacent words, respectively, and the number of occurrences of each left adjacent word on the left side of the candidate word [ n (L) ]₁),...n(L_u)...n(L_p)]And the number of occurrences of each right adjacent word to the right of the candidate word [ n (R)₁),...n(R_d)...n(R_q)]The probability of each adjacent word is calculated separately using the following formula:

step S1.7: according to the probability p (L) of the left and right adjacent words in step S1.6_u) And p (R)_v) Calculating the left adjacent entropy H of the candidate word C by using the following formula_l(C) And right adjacent entropy H_r(C)：

Step S1.8: according to the left adjacent entropy H of the candidate word C in step S1.7_l(C) And right adjacent entropy H_r(C) And calculating the adjacency entropy of the candidate word C by adopting the following method:

step S1.9: according to the enhanced mutual information ENMI in step S1.4_CAnd the adjacency entropy BE in step S1.5_CCalculating the overall score of the candidate word C by adopting the following method:

score(C)＝sigmoid(β₂*ENMI_C+BE_C)

wherein, beta₂In order to enhance the weight of mutual information in the overall score, sigmoid represents normalization, and the calculation method comprises the following steps:

step S1.10: and setting a candidate word score threshold, comparing the overall score (C) of the step C in the step S1.6 with the threshold, if score (C) is greater than the threshold, determining that the candidate word is a reasonable word, and if not, removing the candidate word from the candidate word list to obtain a new word list.

Further, in step S2, the automatic labeling module combines the new word list of the field obtained in step S1 with the first chinese word segmentation module and the second chinese word segmentation module to realize automatic labeling of the linguistic data of the target field without labeling, and the process is as follows:

step S2.1: the first Chinese word segmentation module adopts a reverse maximum matching algorithm, sets a maximum matching length N, takes out a character string with the length of N from the last character of a sentence, inquires whether the character string is in a new word list, if the character string is segmented, moves the current character position to the left by N distances, if the character string is not segmented, subtracts 1 from the matching length, continues to match, if the matching length is still unsuccessful after subtracting 1 from the matching length, moves the current character position to the left by a distance, continues to match until the whole sentence is matched, and realizes the primary segmentation of the target corpus;

step S2.2: segmenting labeled source field corpus into words according to spaces, marking each character in the words with a lexeme label according to the length of each word, taking an input text as an input, and taking the lexeme label as an output to construct a training data set, wherein the lexeme labels comprise B, M, E and S, B represents a starting character of a multi-word, M represents a middle character of the multi-word, E represents an ending character of the multi-word, and S represents an independent word-forming character;

step S2.3: the second Chinese word segmentation module calculates the cost function by adopting the following method:

wherein Y ═ Y₁,y₂,y₃,y₄) Real lexeme labels, y, representing characters_s∈{0,1}，y_sA value of 1 indicates that the character tag is the s-th lexeme tag,

representing the output lexeme labels through the model,

representing the probability that the model predicts the character tag belongs to the s-th lexeme tag;

step S2.4: training data is input into the model for training until loss₁Is less than a preset value;

step S2.5: and (4) segmenting the unsingulated part in the incompletely segmented corpus obtained in the first Chinese segmentation module by using the second Chinese segmentation module obtained by training in the step (S2.4) to obtain the completely segmented target field automatic labeling corpus.

Further, the step S3 process is as follows:

step S3.1: segmenting the labeled source field corpus and the automatically labeled target field corpus into words, and marking a lexeme label B, M, E or S for each word according to the length of each word, wherein B represents the initial character of a multi-word, M represents the middle character of the multi-word, E represents the end character of the multi-word, S represents the character of an independent word, the corpus is used as input, and the lexeme labels are used as output to respectively construct a source field training set and a target field training set;

step S3.2: inputting the text of the source field training set obtained in the step S1 into the source field feature extraction submodule, where the vector output by the source field feature extraction submodule is the source field feature H unique to the source field corpus_src；

Step S3.3: inputting the text of the target field training set obtained in the step S1 into the target field feature extraction submodule, where the vector output by the target field feature extraction submodule is the unique target field feature H of the target field corpus_tgt；

Step S3.4: the texts in the source field training set and the target field corpus obtained in the step S1 are sequentially input into a common feature extraction submodule, and the vector output by the common feature extraction submodule is the feature H common to the source field and the target field_shr。

Further, the step S4 process is as follows:

step S4.1: using source domain feature H_srcAnd target Domain characteristic H_tgtRespectively with a common feature H_shrSplicing is carried out by adopting the following method to obtain the general characteristics of the source field

And general characteristics of the target area

Step S4.2: the cost functions of the source field lexeme labeling submodule, the target field lexeme labeling submodule and the text classification submodule are loss respectively_src、loss_tgtAnd loss_shrThe following method is adopted for calculation:

wherein Y ═ Y₁,y₂,y₃,y₄) Real lexeme labels, y, representing characters_t∈{0,1}，y_tWhen the value is 1, the character label is the t-th lexeme label, y_tA value of 0 indicates that the character tag is not the tth lexeme tag,

representing the output lexeme labels through the model,

representing the probability that the model predicts the character tag belongs to the t-th lexeme tag; y ═ Y'₁,y′₂,y′₃，y′₄) Real lexeme tag, y 'representing a character'_k∈{0，1}，y′_kWhen the number is 1, the character label is the kth lexeme label, y'_kA value of 0 indicates that the character tag is not the kth lexeme tag,

representing output lexemes through a modelThe number of the labels is such that,

representing the probability that the model predicts that the character tag belongs to the ith lexeme tag; y ″ (Y ″)₁，y″₂) Real world tag, y ″, representing an input sample_l∈{0，1}，y″_lA sample of 1 indicates that the sample is from the source domain, a sample of 0 indicates that the sample is from the target domain,

a sample domain label representing the output through the model,

representing the probability that the sample of the model output belongs to the l-th domain;

step S4.3: performing countermeasure training by using a source field lexeme labeling submodule, a target field lexeme labeling submodule and a text classification submodule, and realizing cross-field Chinese word segmentation from the source field to the target field, wherein the total cost function is as follows:

loss＝loss_srC+β₃*loss_tgt+β₄*loss_shrwherein, beta₃And beta₄Respectively represent loss_tgtAnd loss_shrThe weight occupied in the total cost function;

step S4.4: the model is trained until the variation of loss is less than a predetermined value.

Compared with the prior art, the invention has the following advantages and effects:

1. the invention improves the new word discovery algorithm by using vector enhanced mutual information, effectively combines statistical information and semantic information in the corpus, not only obviously improves the accuracy of new word discovery, reduces garbage word strings in the new word list, but also obviously enhances the field of the new word list.

2. The invention uses a new word discovery algorithm to extract a new word vocabulary of the corpus relative to the corpus of the source field from the target field corpus without labels, which is also called an unknown word vocabulary. The method has the advantages that the target domain linguistic data without labels are automatically labeled based on the new word vocabulary, and the unknown word rate of the test linguistic data relative to the training linguistic data is remarkably reduced.

3. The method uses the labeled source field corpus and the automatically labeled target field corpus to train the Chinese word segmentation algorithm based on the antagonistic training, reduces the influence of noise samples in the automatically labeled corpus on model training, improves the effect of the cross-field Chinese word segmentation algorithm, and has the effect superior to the effect of the Chinese word segmentation algorithm without the antagonistic training.

Drawings

FIG. 1 is a block diagram of a cross-domain Chinese segmentation system based on new word discovery as disclosed in an embodiment of the present invention;

FIG. 2 is a block diagram of a new word discovery module according to an embodiment of the present invention;

FIG. 3 is a block diagram of an automatic labeling module according to an embodiment of the present invention;

FIG. 4 is a block diagram of a domain-wide Chinese segmentation module in an embodiment of the present invention;

FIG. 5 is a schematic diagram of a GCNN-CRF model network according to an embodiment of the present invention;

FIG. 6 is a diagram of a network structure of a TextCNN model according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

A structural block diagram of a cross-domain chinese word segmentation system based on new word discovery disclosed in this embodiment is shown in fig. 1, and is composed of a new word discovery module, an automatic labeling module, and a cross-domain word segmentation module, where the new word discovery module, the automatic labeling module, and the cross-domain word segmentation module are connected in sequence, and are respectively used for mining new words from non-labeled target domain corpora, automatically labeling the non-labeled target domain corpora, and training a neural network for cross-domain chinese word segmentation.

A structural block diagram of a new word discovery module in this embodiment is shown in fig. 2, and the new word discovery module is composed of a candidate word extraction sub-module, an enhanced mutual information extraction sub-module, an adjacent entropy extraction sub-module, and a candidate word filtering sub-module, where the candidate word extraction sub-module, the enhanced mutual information extraction sub-module, and the candidate word filtering sub-module are connected in sequence, the candidate word extraction sub-module is used to extract all candidate words from a corpus of a target field, the enhanced mutual information extraction sub-module is used to extract enhanced mutual information of all candidate words, and the candidate word filtering sub-module is used to filter candidate words; the candidate word extraction sub-module, the adjacent entropy extraction sub-module and the candidate word filtering sub-module are sequentially connected, and the adjacent entropy extraction sub-module is used for extracting the adjacent entropy of all candidate words; and the candidate word extracting submodule is connected with the candidate word filtering submodule.

In the above embodiment, the candidate word extraction sub-module extracts all candidate words in the corpus from the unmarked target domain corpus; the enhanced mutual information extraction sub-module considers the semantic information of the candidate words on the basis of the mutual information, and adds the semantic similarity of the candidate words into the mutual information to calculate the enhanced mutual information of the candidate words; on the basis of the existing calculation of the adjacent entropy, the adjacent entropy extraction sub-module fully considers the left and right adjacent entropies of the candidate word, gives certain weight to the left and right adjacent entropies, and calculates and obtains the adjacent entropies of all the candidate words by utilizing the information contained in the left and right adjacent entropies to the maximum extent; and the candidate word filtering submodule adds the enhanced mutual information and the adjacent entropy of the candidate word according to a certain weight, balances the importance of the enhanced mutual information and the adjacent entropy to obtain the final score of the candidate word, and is used for filtering the candidate word to obtain a new word list of the corpus.

The structural block diagram of the auto-tagging module in this embodiment is shown in fig. 3, and is composed of a first chinese participle sub-module and a second chinese participle sub-module, where the first chinese participle sub-module is used to perform incomplete segmentation on a corpus of a target field, and the second chinese participle sub-module is used to perform complete segmentation on the incompletely segmented corpus.

In the above embodiment, the automatic labeling module is composed of a first chinese word segmentation sub-module and a second chinese word segmentation sub-module, the first chinese word segmentation sub-module matches the corpus based on the new word vocabulary, if the matching is successful, the corpus is segmented, otherwise, the corpus is not segmented, and the incomplete segmentation of the corpus in the target field is realized; and the second Chinese word segmentation sub-module performs segmentation on the unsingulated corpus in the first Chinese word segmentation sub-module by using a GCNN-CRF word segmentation algorithm based on source field corpus training to realize complete segmentation on the corpus in the target field. The module realizes automatic labeling of the non-labeled corpus in two steps, can better label the non-labeled target field corpus by using the new word vocabulary, and improves the labeling accuracy.

A cross-domain word segmentation module structure block diagram in this embodiment is shown in fig. 4, and is composed of a source domain feature extraction sub-module, a common feature extraction sub-module, a target domain feature extraction sub-module, a source domain lexeme tagging sub-module, a text classification sub-module, and a target domain lexeme tagging sub-module, where the source domain feature extraction sub-module is connected with the source domain lexeme tagging sub-module, the source domain feature extraction sub-module is used for extracting unique features of a source domain corpus, and the source domain lexeme tagging sub-module is used for performing lexeme tagging on the source domain corpus; the public feature extraction submodule is respectively connected with the text classification submodule, the source field lexeme labeling submodule and the target field lexeme labeling submodule, the public feature extraction submodule is used for extracting public features of source field linguistic data and target field linguistic data, the text classification submodule is used for judging which field the input comes from, and the target field lexeme labeling submodule is connected and used for performing lexeme labeling on the target field linguistic data; the target field feature extraction submodule is connected with the target field lexeme labeling submodule and is used for extracting the unique features of the target field linguistic data; the public feature extraction submodule is connected with the source field lexeme labeling submodule; and the public characteristic extraction submodule is connected with the target field lexeme labeling submodule.

In the above embodiment, the cross-domain word segmentation module introduces countermeasure training, which includes three branches, and the source domain feature extraction sub-module and the source domain lexeme tagging sub-module are connected to form a branch one; the public feature extraction submodule is respectively connected with the text classification submodule, the source field lexeme labeling submodule and the target field lexeme labeling submodule to form a branch II; and the target field feature extraction submodule and the target field lexeme labeling submodule are connected to form a branch III. The cross-domain word segmentation module can effectively reduce the influence of noise in the automatic labeling corpus on model training by giving certain weight to the automatic labeling corpus in model training, and improves the word segmentation effect of the model.

Example two

The embodiment provides a cross-domain Chinese word segmentation method based on a cross-domain Chinese word segmentation system discovered based on new words, and the method adopts the following steps to realize word segmentation of linguistic data in different domains:

step S1: and mining a new word list of the field from the target field corpus by using a new word discovery module. In the step S1, the method for mining a new word vocabulary of the field from the target field corpus using the new word discovery module includes the following steps:

step S1.1: and extracting all candidate words with the length not exceeding n in the domain corpus from the unmarked target domain corpus by using a candidate word extraction submodule.

In this embodiment, the candidate word extraction sub-module segments the corpus according to the non-chinese characters, sets the maximum candidate word length to 6, extracts all candidate words having a length not exceeding 6 from the sentences of the segmented corpus, stores the extracted candidate words in the candidate word set, and counts the occurrence number of the candidate words according to the corpus, and stores the candidate words in the dictionary D.

Wherein n is_wRepresenting the number of occurrences of any character string w in the corpus;

in this embodiment, the minimum length of both a and B is 1.

step S1.4: word vectors Vec from the internal segment a and the internal segment B in step S1.3_A、Vec_BCalculating the semantic relevance sim (A, B) of the internal segment A and the internal segment B by adopting the following method:

Wherein beta is₁Weight coefficients representing semantic relevance in enhancing mutual information;

in the above embodiment, the weight coefficient β₁Is 300.

Step S1.6: respectively finding out all left adjacent characters [ L ] of candidate word C from target field linguistic data₁，...L_u...L_H]And the right adjacent word [ R₁，...R_v...R_D]Where H and D represent the number of left and right adjacent words, respectively, and the number of occurrences of each left adjacent word on the left side of the candidate word [ n (L) ]₁)，...n(L_u)...n(L_p)]And the number of occurrences of each right adjacent word to the right of the candidate word [ n (R)₁)，...n(R_d)...n(R_q)]The probability of each adjacent word is calculated separately using the following formula:

Step S1.8: according to the left adjacent entropy H of the candidate word C in step S1.7_l(C) And right adjacent entropy H_r(C) And calculating the adjacency entropy value of C by adopting the following method:

score(C)＝sigmoid(β₂*ENMI_C+BE_C)

in this embodiment, the weight β occupied by the enhanced mutual information in the overall score is₂Is 60.

Step S1.10: setting a candidate word score threshold, comparing the overall score (C) of the step C in the step 1.6 with the threshold, if score (C) is greater than the threshold, considering the candidate word as a reasonable word, otherwise, removing the candidate word from the candidate word list, and finally obtaining a new word list.

In the present embodiment, the score threshold is set to 0.9.

Step S2: and the automatic labeling module automatically labels the linguistic data of the non-labeled target field by combining the new word list of the field obtained in the step S1.

In the step S2, the automatic labeling of the linguistic data of the target unmarked domain is realized by using the new domain word list obtained in the step S1 and combining the first chinese word segmentation module and the second chinese word segmentation module, which includes the following steps:

step S2.1: the first Chinese word segmentation module adopts a reverse maximum matching algorithm and sets a maximum matching length N. Starting from the last character of the sentence, taking out the character string with the length of N, inquiring whether the character string is in a new word list, if the character string is segmented, moving the current character position to the left by N distances, if the character string is not segmented, subtracting 1 from the matching length, continuing to match, if the matching length is not successful after subtracting 1, moving the current character position to the left by a distance, continuing to match until the whole sentence is matched, and realizing the preliminary segmentation of the target corpus.

In the above embodiment, the maximum matching length N is set to 6 (typically set to the maximum length of an entry in a vocabulary).

Step S2.2: dividing the labeled source field corpus into words according to spaces, and marking a lexeme label B, M, E, S for each character in the words according to the length of each word, wherein B represents the initial character of a multi-word, M represents the middle character in the multi-word, E represents the end character in the multi-word, and S represents the character which is independently formed into words, taking the input text as input, and taking the lexeme label as output to construct a training data set.

representing the output lexeme labels through the model,

in the above embodiment, a structural block diagram of the second chinese word segmentation module is shown in fig. 5, and includes 3 GCNN layers, 1 fully-connected layer, and 1 CRF layer. The input text vector becomes a feature vector with dimension of 200 after passing through 3 GCNNs, the feature vector becomes an output vector with dimension of 4 after passing through a full connection layer, and the output vector obtains a lexeme label of the character after passing through CRF. The size of the convolution kernel of the GCNN1 is 3, the number of the convolution kernels is 200, the size of the convolution kernel of the GCNN2 is 4, the number of the convolution kernels is 200, the size of the convolution kernel of the GCNN3 is 5, the number of the convolution kernels is 200, and the number of nodes of the full connection layer is 4.

Step S2.4: training data is input into the model for training until loss₁Is less than a preset value.

In the above embodiment, the preset value is set to 0.01.

Step S2.5: and (4) segmenting the unsingulated part in the incompletely segmented corpus obtained in the first Chinese word segmentation module by using the second Chinese word segmentation module in the step (S2.4) to obtain the completely segmented target field automatic labeling corpus.

The implementation of the above step S3 includes the following steps:

step S3.2: inputting the text of the source field training set obtained in the step S1 into the source field feature extraction submodule, where the vector output by the source field feature extraction submodule is the source field feature H unique to the source field corpus_src。

Step S3.3: the text of the target domain training set obtained in the step S1The vector output by the target field characteristic extraction submodule is the unique target field characteristic H of the target field corpus_tgt。

The source field feature extraction submodule, the target field feature extraction submodule and the public feature extraction submodule are completely consistent in structure and comprise 3 GCNN layers and 1 activation layer. The input text vector sequentially passes through 3 GCNNs to form a feature vector with the dimension of 200, the dimension of the feature vector remains unchanged after passing through an activation layer, the size of each number in the vector is between 0 and 1, the sizes of convolution kernels of the 3 GCNNs are 3, 4 and 5, the number of the convolution kernels is 200, and the activation function of the activation layer is sigmoid.

In the step S4, the label prediction using the feature includes the steps of:

And general characteristics of the target area

wherein Y ═ Y₁，y₂,y₃,y₄) Real lexeme labels, y, representing characters_t∈{0,1}，y_tWhen the value is 1, the character label is the t-th lexeme label, y_tA value of 0 indicates that the character tag is not the tth lexeme tag,

representing the output lexeme labels through the model,

representing the probability that the model predicts the character tag belongs to the t-th lexeme tag; y ═ Y'₁,y′₂,y′₃,y′₄) Real lexeme tag, y 'representing a character'_k∈{0,1}，y′_kWhen the number is 1, the character label is the kth lexeme label, y'_kA value of 0 indicates that the character tag is not the kth lexeme tag,

representing output lexemes passed through a modelThe label is a paper label with a color,

a sample domain label representing the output through the model,

the probability that a sample representing the model output belongs to the l-th domain.

In the above embodiment, a structural block diagram of the text classification sub-module is shown in fig. 6, and includes 3 convolutional layers, 3 pooling layers, a splicing layer, and a full-link layer. The method comprises the steps that three feature vectors are obtained through three parallel convolutional layers of input text vectors respectively, then the three feature vectors are sequentially input into a pooling layer, then the three feature vectors output by the pooling layer are input into a splicing layer to be spliced, finally the spliced vectors are input into a full-connection layer, the sizes of convolution kernels of CNN1, CNN2 and CNN3 are respectively 3, 4 and 5, the number of the convolution kernels is 200, the pooling layer uses the maximum pooling, and the number of nodes of the full-connection layer is 2.

loss＝loss_srC+α₃*loss_tgt+α₄*loss_shrwherein, beta₃And beta₄Respectively represent loss_tgtAnd loss_shrThe weight occupied in the total cost function.

In this embodiment, the default value of loss is 0.01.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A cross-domain Chinese word segmentation system based on new word discovery is characterized by comprising a new word discovery module, an automatic labeling module and a cross-domain word segmentation module which are sequentially connected, wherein,

the new word discovery module is used for extracting unknown words which do not appear in the source field from the target field linguistic data without labels to obtain a new word list of the field;

the automatic labeling module is used for carrying out initial segmentation on the non-labeled corpus by using a reverse maximum matching algorithm based on a new word vocabulary to obtain a corpus which is not completely segmented; completely segmenting the part which is not segmented in the corpus after the initial segmentation by using a GCNN-CRF word segmentation algorithm based on corpus training of the source field, and realizing the automatic segmentation of the corpus of the target field without the label;

the cross-domain word segmentation module trains the confrontation type deep neural network by using the labeled source domain linguistic data and the automatically labeled target domain linguistic data, and converts the cross-domain word segmentation into the in-domain word segmentation to realize the word segmentation of the target domain.

2. The system according to claim 1, wherein the new word discovery module comprises a candidate word extraction sub-module, an enhanced mutual information extraction sub-module, an adjacent entropy extraction sub-module, and a candidate word filtering sub-module, wherein the candidate word extraction sub-module, the enhanced mutual information extraction sub-module, and the candidate word filtering sub-module are connected in sequence, the candidate word extraction sub-module is configured to extract all candidate words from the target domain corpus, the enhanced mutual information extraction sub-module is configured to extract enhanced mutual information of all candidate words, and the candidate word filtering sub-module is configured to filter the candidate words; the candidate word filtering submodule, the adjacent entropy extracting submodule and the candidate word filtering submodule are sequentially connected, and the adjacent entropy extracting submodule is used for extracting the adjacent entropy of all candidate words; and the candidate word extracting submodule is connected with the candidate word filtering submodule.

3. The system according to claim 1, wherein the automatic labeling module comprises a first Chinese word segmentation submodule and a second Chinese word segmentation submodule, the first Chinese word segmentation submodule matches the corpus based on the new word vocabulary, if the matching is successful, segmentation is performed, otherwise segmentation is not performed, and incomplete segmentation of the corpus in the target field is realized; and the second Chinese word segmentation sub-module performs segmentation on the unsingulated corpus in the first Chinese word segmentation sub-module by using a GCNN-CRF word segmentation algorithm based on source field corpus training to realize complete segmentation on the corpus in the target field.

4. The system for Chinese word segmentation across fields based on new word discovery according to claim 1, wherein the module for word segmentation across fields comprises a source field feature extraction submodule, a common feature extraction submodule, a target field feature extraction submodule, a source field lexeme labeling submodule, a text classification submodule and a target field lexeme labeling submodule, wherein the source field feature extraction submodule and the source field lexeme labeling submodule are connected to form a first branch, the source field feature extraction submodule is used for extracting unique features of a corpus of a source field, and the source field lexeme labeling submodule is used for word position labeling of the corpus of the source field; the public feature extraction submodule is respectively connected with the text classification submodule, the source field lexeme labeling submodule and the target field lexeme labeling submodule to form a branch II, the public feature extraction submodule is used for extracting public features of source field linguistic data and target field linguistic data, the text classification submodule is used for judging which field the input comes from, and the target field lexeme labeling submodule is connected and used for performing lexeme labeling on the target field linguistic data; and the target field feature extraction submodule and the target field lexeme labeling submodule are connected to form a branch III, and the target field feature extraction submodule is used for extracting the unique features of the target field corpus.

5. The cross-domain Chinese word segmentation system based on new word discovery as claimed in claim 1, wherein the source domain feature extraction sub-module, the target domain feature extraction sub-module and the common feature extraction sub-module all use GCNN as a feature extractor, GCNN includes 4 CNN layers and 1 activation layer, input vectors enter 4 CNN layers in parallel, feature extraction is performed through CNN to obtain 4 feature vectors, the feature vector of the first CNN layer is input into the activation layer to be activated, the dimension is kept unchanged, numbers in the vectors are limited between 0 and 1 to serve as a weight vector, the vectors obtained by multiplying the weight vector and feature vectors output by the other 3 CNN layers are the final feature vector, and the activation function is sigmoid.

6. A word segmentation method of a cross-domain chinese word segmentation system based on new word discovery according to any one of claims 1 to 5, characterized by implementing word segmentation for linguistic data of different domains by adopting the following steps:

Step S2: and (5) automatically labeling the linguistic data of the non-labeled target field by using an automatic labeling module in combination with the new word list of the field obtained in the step (S).

Step S4: h obtained in step S3_shcAnd H_shrInputting the predicted values into a source field lexeme labeling submodule to predict a source field lexeme label, and performing step S3 on the obtained H_tgtAnd H_shrInputting the predicted target domain word position label into a target domain word position labeling submodule to predict a target domain word position label, and performing the step S3 on the obtained H_shrInput into a text classification sub-module to predict domain labels for the input text.

7. The method as claimed in claim 6, wherein in step S1, the new word discovery module is used to extract a new word list of the domain from the target domain corpus as follows: step S1.1: extracting all candidate words with the length not exceeding n on the domain corpus from the unmarked target domain corpus by using a candidate word extraction submodule;

step S1.6: respectively finding out all left adjacent characters [ L ] of candidate word C from target field linguistic data₁，...L_u...L_H]And the right adjacent word [ R₁，...R_v...R_D]Where H and D represent the number of left and right adjacent words, respectively, and the number of occurrences of each left adjacent word on the left side of the candidate word [ n (L) ]₁)，...n(L_u)...n(L_p)]And the number of occurrences of each right adjacent word to the right of the candidate word [ n (R)₁),...n(R_d)...n(R_q)]The outline of each adjacent word is calculated by the following formulaRate:

score(C)＝sigmoid(β₂*ENMI_C+BE_C)

8. The method of claim 6, wherein in step S2, the automatic labeling module combines the domain new word vocabulary obtained in step S1 and the first chinese word segmentation module and the second chinese word segmentation module to realize automatic labeling of the linguistic data of the target domain without labeling, and the process is as follows:

wherein Y ═ Y₁,y₂,y₃,y₄) Real lexeme labels, y, representing characters_s∈{0，1}，y_sA value of 1 indicates that the character tag is the s-th lexeme tag,

representing the output lexeme labels through the model,

9. The method for Chinese segmentation across fields based on new word discovery as claimed in claim 6, wherein the step S3 is as follows:

Step S3.3: training the target area obtained in step S1Inputting the text of the set into a target field feature extraction submodule, wherein the vector output by the target field feature extraction submodule is the unique target field feature H of the target field corpus_tgt；

10. The method for Chinese segmentation across fields based on new word discovery as claimed in claim 6, wherein the step S4 is as follows: