Disclosure of Invention
The invention aims to solve the defects in the prior art and provides a cross-domain Chinese word segmentation system and a cross-domain Chinese word segmentation method based on new word discovery. The cross-domain Chinese word segmentation system is divided into three modules, namely a new word discovery module, an automatic labeling module and a cross-domain word segmentation module. The difference between the invention and the traditional Chinese word segmentation method is that: (1) the relevance based on semantic information is added on the basis of a traditional new word discovery algorithm based on mutual information and adjacent entropy to form vector enhanced mutual information, so that the internal condensation degree of a character string can be measured more accurately, and the generation of junk character strings is reduced; (2) compared with the traditional cross-domain Chinese word segmentation method based on dictionary and source domain labeled corpus training, the method realizes automatic labeling of the target domain corpus without labels based on the domain dictionary, can obviously reduce the unknown word rate of the test corpus, and then converts the cross-domain word segmentation into the intra-domain word segmentation by utilizing the automatically labeled corpus training model, and can obviously reduce the influence of the expression gap on the model. The invention can be widely applied to various fields without large-scale language materials, such as the medical field, the scientific field, the biological field, the novel field and the like.
The first purpose of the invention can be achieved by adopting the following technical scheme:
a cross-domain Chinese word segmentation system based on new word discovery is composed of a new word discovery module, an automatic labeling module and a cross-domain word segmentation module, wherein the three modules are connected in sequence. The above modules are as follows:
(1) the new word discovery module: the method is used for extracting new words which do not appear in the source field, namely unknown words, from the target field linguistic data without labels. The new word discovery module is composed of a candidate word extraction sub-module, an enhanced mutual information extraction sub-module, an adjacent entropy extraction sub-module and a candidate word filtering sub-module, wherein the candidate word extraction sub-module, the enhanced mutual information extraction sub-module and the candidate word filtering sub-module are sequentially connected, the candidate word extraction sub-module is used for extracting all candidate words from the target field corpus, the enhanced mutual information extraction sub-module is used for extracting enhanced mutual information of all candidate words, and the candidate word filtering sub-module is used for filtering the candidate words; the candidate word filtering submodule, the adjacent entropy extracting submodule and the candidate word filtering submodule are sequentially connected, and the adjacent entropy extracting submodule is used for extracting the adjacent entropy of all candidate words; and the candidate word extracting submodule is connected with the candidate word filtering submodule.
(2) An automatic labeling module: and automatically labeling the linguistic data in the non-labeled target field by using a new word list and a word segmentation algorithm obtained in the new word discovery module. The automatic labeling module is composed of a first Chinese word segmentation sub-module and a second Chinese word segmentation sub-module, the first Chinese word segmentation sub-module matches the corpus based on the new word vocabulary, if the matching is successful, segmentation is carried out, otherwise segmentation is not carried out, and incomplete segmentation of the corpus in the target field is realized; and the second Chinese word segmentation sub-module performs segmentation on the unsingulated corpus in the first Chinese word segmentation sub-module by using a GCNN-CRF word segmentation algorithm based on source field corpus training to realize complete segmentation on the corpus in the target field.
(3) A cross-domain word segmentation module: and training the confrontation type deep neural network by using the labeled source field linguistic data and the automatically labeled target field linguistic data, and converting cross-field participles into in-field participles to realize the participles of the target field. The cross-domain word segmentation module comprises a source domain feature extraction submodule, a public feature extraction submodule, a target domain feature extraction submodule, a source domain lexeme labeling submodule, a text classification submodule and a target domain lexeme labeling submodule, wherein the source domain feature extraction submodule and the source domain lexeme labeling submodule are connected to form a branch I; the public feature extraction submodule is respectively connected with the text classification submodule, the source field lexeme labeling submodule and the target field lexeme labeling submodule to form a branch II, the public feature extraction submodule is used for extracting public features of source field linguistic data and target field linguistic data, the text classification submodule is used for judging which field the input comes from, and the target field lexeme labeling submodule is connected and used for performing lexeme labeling on the target field linguistic data; and the target field feature extraction submodule and the target field lexeme labeling submodule are connected to form a branch III, and the target field feature extraction submodule is used for extracting the unique features of the target field corpus.
Furthermore, the source domain feature extraction submodule, the target domain feature extraction submodule and the public feature extraction submodule all adopt GCNN as a feature extractor, the GCNN comprises 4 CNN layers and 1 activation layer, input vectors parallelly enter the 4 CNN layers, feature extraction is carried out through the CNN to obtain 4 feature vectors, the feature vectors of the first CNN layer are input into the activation layers to be activated, the dimensionality is kept unchanged, numbers in the vectors are limited between 0 and 1 to serve as weight vectors, the vectors obtained by multiplying the weight vectors and the feature vectors output by the other 3 CNN layers are the final feature vectors, and the activation function is sigmoid.
The second purpose of the invention can be achieved by adopting the following technical scheme:
a cross-domain Chinese word segmentation method based on new word discovery is used for achieving word segmentation of linguistic data in different domains by adopting the following steps:
step S1: and mining a new word list of the field from the target field corpus by using a new word discovery module.
Step S2: and (3) automatically labeling the linguistic data of the non-labeled target field by using an automatic labeling module in combination with the new word list of the field obtained in the step (1).
Step S3: extracting the characteristics of the source field and the target field corpora through three branches of the cross-field word segmentation module, wherein the first branch uses the source field characteristic extraction submodule to extract the source field characteristic H from the source field corporasrc(ii) a The second branch circuit uses a common feature extraction submodule to extract the common features H of the corpora of the source field and the target fieldshr(ii) a Branch three-use target field feature extraction submodule for extracting target field features H from target field corpustgt;
Step S4: h obtained in step S3srcAnd HshrInputting the predicted values into a source field lexeme labeling submodule to predict a source field lexeme label, and performing step S3 on the obtained HtgtAnd HshrInputting the predicted target domain word position label into a target domain word position labeling submodule to predict a target domain word position label, and performing the step S3 on the obtained HshrInput into a text classification sub-module to predict domain labels for the input text.
Further, in step S1, the step of using the new word discovery module to extract the new word list of the domain from the target domain corpus includes the following steps: step S1.1: and extracting all candidate words with the length not exceeding n in the domain corpus from the unmarked target domain corpus by using a candidate word extraction submodule.
Step S1.2: randomly segmenting the candidate word C into a front internal segment A and a rear internal segment B, and counting the times of C, A and B as nC、nAAnd nBThe mutual information MI of C is calculated by the following methodC。
Wherein n iswIndicating the number of occurrences of an arbitrary character string w in the corpus.
Step S1.3: training Word2Vec model by using target field corpus to obtain any character c
jWord vector of
The word vector Vec of the internal segment A is calculated by the following method
AAnd the word vector Vec of the inner segment B
B:
Wherein i represents the number of Chinese characters in A, m represents the number of Chinese characters in B, apAnd bqRepresents the value of the word vector at positions p and q, n represents the vector dimension;
step S1.4: word vectors Vec from the internal segment a and the internal segment B in step S1.3A、VecBIs measured by the following methodCalculating the semantic relevance sim (A, B) of the internal segment A and the internal segment B:
step S1.5: according to mutual information MI in step S1.2CAnd semantic relevance sim (A, B) of step S1.4, calculating the enhanced mutual information ENMI of the candidate word C by the following methodC:
Wherein beta is1And weight coefficients representing semantic relevance in enhancing mutual information.
Step S1.6: respectively finding out all left adjacent characters [ L ] of candidate word C from target field linguistic data1,...Lu...LH]And the right adjacent word [ R1,...Rv...RD]Where H and D represent the number of left and right adjacent words, respectively, and the number of occurrences of each left adjacent word on the left side of the candidate word [ n (L) ]1),...n(Lu)...n(Lp)]And the number of occurrences of each right adjacent word to the right of the candidate word [ n (R)1),...n(Rd)...n(Rq)]The probability of each adjacent word is calculated separately using the following formula:
step S1.7: according to the probability p (L) of the left and right adjacent words in step S1.6u) And p (R)v) Calculating the left adjacent entropy H of the candidate word C by using the following formulal(C) And right adjacent entropy Hr(C):
Step S1.8: according to the left adjacent entropy H of the candidate word C in step S1.7l(C) And right adjacent entropy Hr(C) And calculating the adjacency entropy of the candidate word C by adopting the following method:
step S1.9: according to the enhanced mutual information ENMI in step S1.4CAnd the adjacency entropy BE in step S1.5CCalculating the overall score of the candidate word C by adopting the following method:
score(C)=sigmoid(β2*ENMIC+BEC)
wherein, beta
2In order to enhance the weight of mutual information in the overall score, sigmoid represents normalization, and the calculation method comprises the following steps:
step S1.10: and setting a candidate word score threshold, comparing the overall score (C) of the step C in the step S1.6 with the threshold, if score (C) is greater than the threshold, determining that the candidate word is a reasonable word, and if not, removing the candidate word from the candidate word list to obtain a new word list.
Further, in step S2, the automatic labeling module combines the new word list of the field obtained in step S1 with the first chinese word segmentation module and the second chinese word segmentation module to realize automatic labeling of the linguistic data of the target field without labeling, and the process is as follows:
step S2.1: the first Chinese word segmentation module adopts a reverse maximum matching algorithm, sets a maximum matching length N, takes out a character string with the length of N from the last character of a sentence, inquires whether the character string is in a new word list, if the character string is segmented, moves the current character position to the left by N distances, if the character string is not segmented, subtracts 1 from the matching length, continues to match, if the matching length is still unsuccessful after subtracting 1 from the matching length, moves the current character position to the left by a distance, continues to match until the whole sentence is matched, and realizes the primary segmentation of the target corpus;
step S2.2: segmenting labeled source field corpus into words according to spaces, marking each character in the words with a lexeme label according to the length of each word, taking an input text as an input, and taking the lexeme label as an output to construct a training data set, wherein the lexeme labels comprise B, M, E and S, B represents a starting character of a multi-word, M represents a middle character of the multi-word, E represents an ending character of the multi-word, and S represents an independent word-forming character;
step S2.3: the second Chinese word segmentation module calculates the cost function by adopting the following method:
wherein Y ═ Y
1,y
2,y
3,y
4) Real lexeme labels, y, representing characters
s∈{0,1},y
sA value of 1 indicates that the character tag is the s-th lexeme tag,
representing the output lexeme labels through the model,
representing the probability that the model predicts the character tag belongs to the s-th lexeme tag;
step S2.4: training data is input into the model for training until loss1Is less than a preset value;
step S2.5: and (4) segmenting the unsingulated part in the incompletely segmented corpus obtained in the first Chinese segmentation module by using the second Chinese segmentation module obtained by training in the step (S2.4) to obtain the completely segmented target field automatic labeling corpus.
Further, the step S3 process is as follows:
step S3.1: segmenting the labeled source field corpus and the automatically labeled target field corpus into words, and marking a lexeme label B, M, E or S for each word according to the length of each word, wherein B represents the initial character of a multi-word, M represents the middle character of the multi-word, E represents the end character of the multi-word, S represents the character of an independent word, the corpus is used as input, and the lexeme labels are used as output to respectively construct a source field training set and a target field training set;
step S3.2: inputting the text of the source field training set obtained in the step S1 into the source field feature extraction submodule, where the vector output by the source field feature extraction submodule is the source field feature H unique to the source field corpussrc;
Step S3.3: inputting the text of the target field training set obtained in the step S1 into the target field feature extraction submodule, where the vector output by the target field feature extraction submodule is the unique target field feature H of the target field corpustgt;
Step S3.4: the texts in the source field training set and the target field corpus obtained in the step S1 are sequentially input into a common feature extraction submodule, and the vector output by the common feature extraction submodule is the feature H common to the source field and the target fieldshr。
Further, the step S4 process is as follows:
step S4.1: using source domain feature H
srcAnd target Domain characteristic H
tgtRespectively with a common feature H
shrSplicing is carried out by adopting the following method to obtain the general characteristics of the source field
And general characteristics of the target area
Step S4.2: the cost functions of the source field lexeme labeling submodule, the target field lexeme labeling submodule and the text classification submodule are loss respectivelysrc、losstgtAnd lossshrThe following method is adopted for calculation:
wherein Y ═ Y
1,y
2,y
3,y
4) Real lexeme labels, y, representing characters
t∈{0,1},y
tWhen the value is 1, the character label is the t-th lexeme label, y
tA value of 0 indicates that the character tag is not the tth lexeme tag,
representing the output lexeme labels through the model,
representing the probability that the model predicts the character tag belongs to the t-th lexeme tag; y ═ Y'
1,y′
2,y′
3,y′
4) Real lexeme tag, y 'representing a character'
k∈{0,1},y′
kWhen the number is 1, the character label is the kth lexeme label, y'
kA value of 0 indicates that the character tag is not the kth lexeme tag,
representing output lexemes through a modelThe number of the labels is such that,
representing the probability that the model predicts that the character tag belongs to the ith lexeme tag; y ″ (Y ″)
1,y″
2) Real world tag, y ″, representing an input sample
l∈{0,1},y″
lA sample of 1 indicates that the sample is from the source domain, a sample of 0 indicates that the sample is from the target domain,
a sample domain label representing the output through the model,
representing the probability that the sample of the model output belongs to the l-th domain;
step S4.3: performing countermeasure training by using a source field lexeme labeling submodule, a target field lexeme labeling submodule and a text classification submodule, and realizing cross-field Chinese word segmentation from the source field to the target field, wherein the total cost function is as follows:
loss=losssrC+β3*losstgt+β4*lossshrwherein, beta3And beta4Respectively represent losstgtAnd lossshrThe weight occupied in the total cost function;
step S4.4: the model is trained until the variation of loss is less than a predetermined value.
Compared with the prior art, the invention has the following advantages and effects:
1. the invention improves the new word discovery algorithm by using vector enhanced mutual information, effectively combines statistical information and semantic information in the corpus, not only obviously improves the accuracy of new word discovery, reduces garbage word strings in the new word list, but also obviously enhances the field of the new word list.
2. The invention uses a new word discovery algorithm to extract a new word vocabulary of the corpus relative to the corpus of the source field from the target field corpus without labels, which is also called an unknown word vocabulary. The method has the advantages that the target domain linguistic data without labels are automatically labeled based on the new word vocabulary, and the unknown word rate of the test linguistic data relative to the training linguistic data is remarkably reduced.
3. The method uses the labeled source field corpus and the automatically labeled target field corpus to train the Chinese word segmentation algorithm based on the antagonistic training, reduces the influence of noise samples in the automatically labeled corpus on model training, improves the effect of the cross-field Chinese word segmentation algorithm, and has the effect superior to the effect of the Chinese word segmentation algorithm without the antagonistic training.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
A structural block diagram of a cross-domain chinese word segmentation system based on new word discovery disclosed in this embodiment is shown in fig. 1, and is composed of a new word discovery module, an automatic labeling module, and a cross-domain word segmentation module, where the new word discovery module, the automatic labeling module, and the cross-domain word segmentation module are connected in sequence, and are respectively used for mining new words from non-labeled target domain corpora, automatically labeling the non-labeled target domain corpora, and training a neural network for cross-domain chinese word segmentation.
A structural block diagram of a new word discovery module in this embodiment is shown in fig. 2, and the new word discovery module is composed of a candidate word extraction sub-module, an enhanced mutual information extraction sub-module, an adjacent entropy extraction sub-module, and a candidate word filtering sub-module, where the candidate word extraction sub-module, the enhanced mutual information extraction sub-module, and the candidate word filtering sub-module are connected in sequence, the candidate word extraction sub-module is used to extract all candidate words from a corpus of a target field, the enhanced mutual information extraction sub-module is used to extract enhanced mutual information of all candidate words, and the candidate word filtering sub-module is used to filter candidate words; the candidate word extraction sub-module, the adjacent entropy extraction sub-module and the candidate word filtering sub-module are sequentially connected, and the adjacent entropy extraction sub-module is used for extracting the adjacent entropy of all candidate words; and the candidate word extracting submodule is connected with the candidate word filtering submodule.
In the above embodiment, the candidate word extraction sub-module extracts all candidate words in the corpus from the unmarked target domain corpus; the enhanced mutual information extraction sub-module considers the semantic information of the candidate words on the basis of the mutual information, and adds the semantic similarity of the candidate words into the mutual information to calculate the enhanced mutual information of the candidate words; on the basis of the existing calculation of the adjacent entropy, the adjacent entropy extraction sub-module fully considers the left and right adjacent entropies of the candidate word, gives certain weight to the left and right adjacent entropies, and calculates and obtains the adjacent entropies of all the candidate words by utilizing the information contained in the left and right adjacent entropies to the maximum extent; and the candidate word filtering submodule adds the enhanced mutual information and the adjacent entropy of the candidate word according to a certain weight, balances the importance of the enhanced mutual information and the adjacent entropy to obtain the final score of the candidate word, and is used for filtering the candidate word to obtain a new word list of the corpus.
The structural block diagram of the auto-tagging module in this embodiment is shown in fig. 3, and is composed of a first chinese participle sub-module and a second chinese participle sub-module, where the first chinese participle sub-module is used to perform incomplete segmentation on a corpus of a target field, and the second chinese participle sub-module is used to perform complete segmentation on the incompletely segmented corpus.
In the above embodiment, the automatic labeling module is composed of a first chinese word segmentation sub-module and a second chinese word segmentation sub-module, the first chinese word segmentation sub-module matches the corpus based on the new word vocabulary, if the matching is successful, the corpus is segmented, otherwise, the corpus is not segmented, and the incomplete segmentation of the corpus in the target field is realized; and the second Chinese word segmentation sub-module performs segmentation on the unsingulated corpus in the first Chinese word segmentation sub-module by using a GCNN-CRF word segmentation algorithm based on source field corpus training to realize complete segmentation on the corpus in the target field. The module realizes automatic labeling of the non-labeled corpus in two steps, can better label the non-labeled target field corpus by using the new word vocabulary, and improves the labeling accuracy.
A cross-domain word segmentation module structure block diagram in this embodiment is shown in fig. 4, and is composed of a source domain feature extraction sub-module, a common feature extraction sub-module, a target domain feature extraction sub-module, a source domain lexeme tagging sub-module, a text classification sub-module, and a target domain lexeme tagging sub-module, where the source domain feature extraction sub-module is connected with the source domain lexeme tagging sub-module, the source domain feature extraction sub-module is used for extracting unique features of a source domain corpus, and the source domain lexeme tagging sub-module is used for performing lexeme tagging on the source domain corpus; the public feature extraction submodule is respectively connected with the text classification submodule, the source field lexeme labeling submodule and the target field lexeme labeling submodule, the public feature extraction submodule is used for extracting public features of source field linguistic data and target field linguistic data, the text classification submodule is used for judging which field the input comes from, and the target field lexeme labeling submodule is connected and used for performing lexeme labeling on the target field linguistic data; the target field feature extraction submodule is connected with the target field lexeme labeling submodule and is used for extracting the unique features of the target field linguistic data; the public feature extraction submodule is connected with the source field lexeme labeling submodule; and the public characteristic extraction submodule is connected with the target field lexeme labeling submodule.
In the above embodiment, the cross-domain word segmentation module introduces countermeasure training, which includes three branches, and the source domain feature extraction sub-module and the source domain lexeme tagging sub-module are connected to form a branch one; the public feature extraction submodule is respectively connected with the text classification submodule, the source field lexeme labeling submodule and the target field lexeme labeling submodule to form a branch II; and the target field feature extraction submodule and the target field lexeme labeling submodule are connected to form a branch III. The cross-domain word segmentation module can effectively reduce the influence of noise in the automatic labeling corpus on model training by giving certain weight to the automatic labeling corpus in model training, and improves the word segmentation effect of the model.
Example two
The embodiment provides a cross-domain Chinese word segmentation method based on a cross-domain Chinese word segmentation system discovered based on new words, and the method adopts the following steps to realize word segmentation of linguistic data in different domains:
step S1: and mining a new word list of the field from the target field corpus by using a new word discovery module. In the step S1, the method for mining a new word vocabulary of the field from the target field corpus using the new word discovery module includes the following steps:
step S1.1: and extracting all candidate words with the length not exceeding n in the domain corpus from the unmarked target domain corpus by using a candidate word extraction submodule.
In this embodiment, the candidate word extraction sub-module segments the corpus according to the non-chinese characters, sets the maximum candidate word length to 6, extracts all candidate words having a length not exceeding 6 from the sentences of the segmented corpus, stores the extracted candidate words in the candidate word set, and counts the occurrence number of the candidate words according to the corpus, and stores the candidate words in the dictionary D.
Step S1.2: randomly segmenting the candidate word C into a front internal segment A and a rear internal segment B, and counting the times of C, A and B as nC、nAAnd nBThe mutual information MI of C is calculated by the following methodC。
Wherein n iswRepresenting the number of occurrences of any character string w in the corpus;
in this embodiment, the minimum length of both a and B is 1.
Step S1.3: training Word2Vec model by using target field corpus to obtain any character c
jWord vector of
The word vector Vec of the internal segment A is calculated by the following method
AAnd the word vector Vec of the inner segment B
B:
Wherein i represents the number of Chinese characters in A, m represents the number of Chinese characters in B, apAnd bqRepresents the value of the word vector at positions p and q, n represents the vector dimension;
step S1.4: word vectors Vec from the internal segment a and the internal segment B in step S1.3A、VecBCalculating the semantic relevance sim (A, B) of the internal segment A and the internal segment B by adopting the following method:
step S1.5: according to mutual information MI in step S1.2CAnd semantic relevance sim (A, B) of step S1.4, calculating the enhanced mutual information ENMI of the candidate word C by the following methodC:
Wherein beta is1Weight coefficients representing semantic relevance in enhancing mutual information;
in the above embodiment, the weight coefficient β1Is 300.
Step S1.6: respectively finding out all left adjacent characters [ L ] of candidate word C from target field linguistic data1,...Lu...LH]And the right adjacent word [ R1,...Rv...RD]Where H and D represent the number of left and right adjacent words, respectively, and the number of occurrences of each left adjacent word on the left side of the candidate word [ n (L) ]1),...n(Lu)...n(Lp)]And the number of occurrences of each right adjacent word to the right of the candidate word [ n (R)1),...n(Rd)...n(Rq)]The probability of each adjacent word is calculated separately using the following formula:
step S1.7: according to the probability p (L) of the left and right adjacent words in step S1.6u) And p (R)v) Calculating the left adjacent entropy H of the candidate word C by using the following formulal(C) And right adjacent entropy Hr(C):
Step S1.8: according to the left adjacent entropy H of the candidate word C in step S1.7l(C) And right adjacent entropy Hr(C) And calculating the adjacency entropy value of C by adopting the following method:
step S1.9: according to the enhanced mutual information ENMI in step S1.4CAnd the adjacency entropy BE in step S1.5CCalculating the overall score of the candidate word C by adopting the following method:
score(C)=sigmoid(β2*ENMIC+BEC)
wherein, beta2In order to enhance the weight of mutual information in the overall score, sigmoid represents normalization, and the calculation method comprises the following steps:
in this embodiment, the weight β occupied by the enhanced mutual information in the overall score is2Is 60.
Step S1.10: setting a candidate word score threshold, comparing the overall score (C) of the step C in the step 1.6 with the threshold, if score (C) is greater than the threshold, considering the candidate word as a reasonable word, otherwise, removing the candidate word from the candidate word list, and finally obtaining a new word list.
In the present embodiment, the score threshold is set to 0.9.
Step S2: and the automatic labeling module automatically labels the linguistic data of the non-labeled target field by combining the new word list of the field obtained in the step S1.
In the step S2, the automatic labeling of the linguistic data of the target unmarked domain is realized by using the new domain word list obtained in the step S1 and combining the first chinese word segmentation module and the second chinese word segmentation module, which includes the following steps:
step S2.1: the first Chinese word segmentation module adopts a reverse maximum matching algorithm and sets a maximum matching length N. Starting from the last character of the sentence, taking out the character string with the length of N, inquiring whether the character string is in a new word list, if the character string is segmented, moving the current character position to the left by N distances, if the character string is not segmented, subtracting 1 from the matching length, continuing to match, if the matching length is not successful after subtracting 1, moving the current character position to the left by a distance, continuing to match until the whole sentence is matched, and realizing the preliminary segmentation of the target corpus.
In the above embodiment, the maximum matching length N is set to 6 (typically set to the maximum length of an entry in a vocabulary).
Step S2.2: dividing the labeled source field corpus into words according to spaces, and marking a lexeme label B, M, E, S for each character in the words according to the length of each word, wherein B represents the initial character of a multi-word, M represents the middle character in the multi-word, E represents the end character in the multi-word, and S represents the character which is independently formed into words, taking the input text as input, and taking the lexeme label as output to construct a training data set.
Step S2.3: the second Chinese word segmentation module calculates the cost function by adopting the following method:
wherein Y ═ Y
1,y
2,y
3,y
4) Real lexeme labels, y, representing characters
s∈{0,1},y
sA value of 1 indicates that the character tag is the s-th lexeme tag,
representing the output lexeme labels through the model,
representing the probability that the model predicts the character tag belongs to the s-th lexeme tag;
in the above embodiment, a structural block diagram of the second chinese word segmentation module is shown in fig. 5, and includes 3 GCNN layers, 1 fully-connected layer, and 1 CRF layer. The input text vector becomes a feature vector with dimension of 200 after passing through 3 GCNNs, the feature vector becomes an output vector with dimension of 4 after passing through a full connection layer, and the output vector obtains a lexeme label of the character after passing through CRF. The size of the convolution kernel of the GCNN1 is 3, the number of the convolution kernels is 200, the size of the convolution kernel of the GCNN2 is 4, the number of the convolution kernels is 200, the size of the convolution kernel of the GCNN3 is 5, the number of the convolution kernels is 200, and the number of nodes of the full connection layer is 4.
Step S2.4: training data is input into the model for training until loss1Is less than a preset value.
In the above embodiment, the preset value is set to 0.01.
Step S2.5: and (4) segmenting the unsingulated part in the incompletely segmented corpus obtained in the first Chinese word segmentation module by using the second Chinese word segmentation module in the step (S2.4) to obtain the completely segmented target field automatic labeling corpus.
Step S3: extracting the characteristics of the source field and the target field corpora through three branches of the cross-field word segmentation module, wherein the first branch uses the source field characteristic extraction submodule to extract the source field characteristic H from the source field corporasrc(ii) a The second branch circuit uses a common feature extraction submodule to extract the common features H of the corpora of the source field and the target fieldshr(ii) a Branch three-use target field feature extraction submodule for extracting target field features H from target field corpustgt;
The implementation of the above step S3 includes the following steps:
step S3.1: segmenting the labeled source field corpus and the automatically labeled target field corpus into words, and marking a lexeme label B, M, E or S for each word according to the length of each word, wherein B represents the initial character of a multi-word, M represents the middle character of the multi-word, E represents the end character of the multi-word, S represents the character of an independent word, the corpus is used as input, and the lexeme labels are used as output to respectively construct a source field training set and a target field training set;
step S3.2: inputting the text of the source field training set obtained in the step S1 into the source field feature extraction submodule, where the vector output by the source field feature extraction submodule is the source field feature H unique to the source field corpussrc。
Step S3.3: the text of the target domain training set obtained in the step S1The vector output by the target field characteristic extraction submodule is the unique target field characteristic H of the target field corpustgt。
Step S3.4: the texts in the source field training set and the target field corpus obtained in the step S1 are sequentially input into a common feature extraction submodule, and the vector output by the common feature extraction submodule is the feature H common to the source field and the target fieldshr。
The source field feature extraction submodule, the target field feature extraction submodule and the public feature extraction submodule are completely consistent in structure and comprise 3 GCNN layers and 1 activation layer. The input text vector sequentially passes through 3 GCNNs to form a feature vector with the dimension of 200, the dimension of the feature vector remains unchanged after passing through an activation layer, the size of each number in the vector is between 0 and 1, the sizes of convolution kernels of the 3 GCNNs are 3, 4 and 5, the number of the convolution kernels is 200, and the activation function of the activation layer is sigmoid.
Step S4: h obtained in step S3srcAnd HshrInputting the predicted values into a source field lexeme labeling submodule to predict a source field lexeme label, and performing step S3 on the obtained HtgtAnd HshrInputting the predicted target domain word position label into a target domain word position labeling submodule to predict a target domain word position label, and performing the step S3 on the obtained HshrInput into a text classification sub-module to predict domain labels for the input text.
In the step S4, the label prediction using the feature includes the steps of:
step S4.1: using source domain feature H
srcAnd target Domain characteristic H
tgtRespectively with a common feature H
shrSplicing is carried out by adopting the following method to obtain the general characteristics of the source field
And general characteristics of the target area
Step S4.2: the cost functions of the source field lexeme labeling submodule, the target field lexeme labeling submodule and the text classification submodule are loss respectivelysrc、losstgtAnd lossshrThe following method is adopted for calculation:
wherein Y ═ Y
1,y
2,y
3,y
4) Real lexeme labels, y, representing characters
t∈{0,1},y
tWhen the value is 1, the character label is the t-th lexeme label, y
tA value of 0 indicates that the character tag is not the tth lexeme tag,
representing the output lexeme labels through the model,
representing the probability that the model predicts the character tag belongs to the t-th lexeme tag; y ═ Y'
1,y′
2,y′
3,y′
4) Real lexeme tag, y 'representing a character'
k∈{0,1},y′
kWhen the number is 1, the character label is the kth lexeme label, y'
kA value of 0 indicates that the character tag is not the kth lexeme tag,
representing output lexemes passed through a modelThe label is a paper label with a color,
representing the probability that the model predicts that the character tag belongs to the ith lexeme tag; y ″ (Y ″)
1,y″
2) Real world tag, y ″, representing an input sample
l∈{0,1},y″
lA sample of 1 indicates that the sample is from the source domain, a sample of 0 indicates that the sample is from the target domain,
a sample domain label representing the output through the model,
the probability that a sample representing the model output belongs to the l-th domain.
In the above embodiment, a structural block diagram of the text classification sub-module is shown in fig. 6, and includes 3 convolutional layers, 3 pooling layers, a splicing layer, and a full-link layer. The method comprises the steps that three feature vectors are obtained through three parallel convolutional layers of input text vectors respectively, then the three feature vectors are sequentially input into a pooling layer, then the three feature vectors output by the pooling layer are input into a splicing layer to be spliced, finally the spliced vectors are input into a full-connection layer, the sizes of convolution kernels of CNN1, CNN2 and CNN3 are respectively 3, 4 and 5, the number of the convolution kernels is 200, the pooling layer uses the maximum pooling, and the number of nodes of the full-connection layer is 2.
Step S4.3: performing countermeasure training by using a source field lexeme labeling submodule, a target field lexeme labeling submodule and a text classification submodule, and realizing cross-field Chinese word segmentation from the source field to the target field, wherein the total cost function is as follows:
loss=losssrC+α3*losstgt+α4*lossshrwherein, beta3And beta4Respectively represent losstgtAnd lossshrThe weight occupied in the total cost function.
Step S4.4: the model is trained until the variation of loss is less than a predetermined value.
In this embodiment, the default value of loss is 0.01.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.