[go: up one dir, main page]

CN113076750A - Cross-domain Chinese word segmentation system and method based on new word discovery - Google Patents

Cross-domain Chinese word segmentation system and method based on new word discovery Download PDF

Info

Publication number
CN113076750A
CN113076750A CN202110463683.4A CN202110463683A CN113076750A CN 113076750 A CN113076750 A CN 113076750A CN 202110463683 A CN202110463683 A CN 202110463683A CN 113076750 A CN113076750 A CN 113076750A
Authority
CN
China
Prior art keywords
word
corpus
submodule
module
lexeme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110463683.4A
Other languages
Chinese (zh)
Other versions
CN113076750B (en
Inventor
张军
李�学
宁更新
杨萃
冯义志
余华
陈芳炯
季飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202110463683.4A priority Critical patent/CN113076750B/en
Publication of CN113076750A publication Critical patent/CN113076750A/en
Application granted granted Critical
Publication of CN113076750B publication Critical patent/CN113076750B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

本发明公开了一种基于新词发现的跨领域中文分词系统及方法,该系统包括:新词发现模块,使用结合统计信息和语义信息的增强互信息来实现新词发现算法,用于从无标注的语料中挖掘新词词表;自动标注模块,使用新词词表结合逆向最大匹配算法实现对无标注语料的初始切分,得到切分不完全的语料,使用分词模型对切分不完全的语料进行完全切分,得到自动标注的语料;跨领域分词模块,使用对抗式方法实现跨领域中文分词算法,并使用有标注的源领域语料和自动标注的语料进行对抗式训练。本发明使用增强互信息优化了新词发现算法,提升了新词发现的准确率和词表的领域性;在跨领域分词算法中提升了对无标注语料的利用率,优化了分词的召回率和准确率。

Figure 202110463683

The invention discloses a cross-domain Chinese word segmentation system and method based on new word discovery. The system includes: a new word discovery module, which uses enhanced mutual information combined with statistical information and semantic information to realize a new word discovery algorithm, and is used for Mining new vocabulary vocabulary in the labeled corpus; automatic tagging module, using the new vocabulary vocabulary combined with the reverse maximum matching algorithm to realize the initial segmentation of the unlabeled corpus, and obtain the incompletely segmented corpus, and use the word segmentation model to segment incompletely. The corpus is completely segmented, and the automatically labeled corpus is obtained; the cross-domain word segmentation module uses the adversarial method to realize the cross-domain Chinese word segmentation algorithm, and uses the labeled source domain corpus and the automatically labeled corpus for adversarial training. The invention optimizes the new word discovery algorithm by using enhanced mutual information, improves the accuracy rate of new word discovery and the domain of the vocabulary; in the cross-domain word segmentation algorithm, the utilization rate of unlabeled corpus is improved, and the recall rate of word segmentation is optimized. and accuracy.

Figure 202110463683

Description

Cross-domain Chinese word segmentation system and method based on new word discovery
Technical Field
The invention relates to the technical field of natural language, in particular to a cross-domain Chinese word segmentation system and a cross-domain Chinese word segmentation method based on new word discovery.
Background
The Chinese text takes the Chinese characters as the minimum writing unit, the Chinese characters are mutually combined to form words, and finally the words form the Chinese text. The words are the minimum structural units which contain semantic information and can be independently used in the Chinese text, but are different from languages such as English and the like, no explicit separators exist among the Chinese words, the Chinese text is divided into words by using a certain technical method so as to be convenient for a computer to understand, and the process is the Chinese word segmentation. Chinese segmentation is the most basic task in Chinese natural language processing, which is the cornerstone of natural language processing tasks such as text classification, text generation, and emotion analysis. Therefore, the quality of the Chinese word segmentation result directly influences the result of the downstream task.
The traditional Chinese word segmentation method mainly comprises two types: mechanical word segmentation method and Chinese word segmentation method based on statistics. The mechanical word segmentation method is used for segmenting words by taking an existing dictionary as a basis and combining a certain manual rule, and the recognition capability Of unknown words (Out Of Vocalburry, OOV, words which do not appear in a word segmentation dictionary) is very low; the word segmentation method based on statistics is limited in context in a very small range, global features cannot be counted, and the recognition capability of unknown words is as low as that of the unknown words, so that the accuracy and recall rate of the two word segmentation methods are poor, and the two word segmentation methods cannot be used as practical word segmentation technologies at present. With the development of deep learning, applying the deep learning technology to Chinese word segmentation becomes a current research hotspot. The existing neural network model treats Chinese word segmentation as a sequence labeling problem, the model is trained by a manually labeled data set, a Chinese dictionary and an artificial construction rule do not need to be obtained, a characteristic template does not need to be artificially designed, and the accuracy and the recall rate are far higher than those of the traditional method, so that the neural network model becomes a mainstream method of Chinese word segmentation at present. At present, a model trained on large-scale artificial labeling data (called a source field) has a good effect on a word segmentation task, but when the model segments a cross-field corpus (a target field), the effect is sharply reduced, and Chinese word segmentation in the source field and the target field, which do not belong to the same field, is called cross-field Chinese word segmentation. The effect of cross-domain Chinese word segmentation is mainly limited by the problems of Expression gaps (Expression gaps) and unknown words of texts between a target domain and a source domain, wherein the Expression gaps refer to different segmentation modes of the same text in different domains, and the unknown words refer to words which only appear in the target domain but not in the source domain, and mainly include names of people, place names and professional nouns. The best method for solving the expression gap is to artificially label the corpora of the target field and then mix the corpora of the two fields to retrain the model, however, large-scale artificial labeling needs to consume a large amount of manpower and material resources, and artificial labeling of all the fields is not possible, so that the method has no practical feasibility; the best method for solving the problem of unknown words is to enable a professional to extract words which are not appeared in a source field corpus from a target field corpus and place the words as training corpuses into a model for training, however, on one hand, a large amount of manpower and material resources are consumed for selecting the unknown words, and on the other hand, all the unknown words cannot be selected by manpower because various new words are listed in the current society endlessly. Therefore, the cross-domain Chinese word segmentation is difficult to achieve a good effect all the time.
Disclosure of Invention
The invention aims to solve the defects in the prior art and provides a cross-domain Chinese word segmentation system and a cross-domain Chinese word segmentation method based on new word discovery. The cross-domain Chinese word segmentation system is divided into three modules, namely a new word discovery module, an automatic labeling module and a cross-domain word segmentation module. The difference between the invention and the traditional Chinese word segmentation method is that: (1) the relevance based on semantic information is added on the basis of a traditional new word discovery algorithm based on mutual information and adjacent entropy to form vector enhanced mutual information, so that the internal condensation degree of a character string can be measured more accurately, and the generation of junk character strings is reduced; (2) compared with the traditional cross-domain Chinese word segmentation method based on dictionary and source domain labeled corpus training, the method realizes automatic labeling of the target domain corpus without labels based on the domain dictionary, can obviously reduce the unknown word rate of the test corpus, and then converts the cross-domain word segmentation into the intra-domain word segmentation by utilizing the automatically labeled corpus training model, and can obviously reduce the influence of the expression gap on the model. The invention can be widely applied to various fields without large-scale language materials, such as the medical field, the scientific field, the biological field, the novel field and the like.
The first purpose of the invention can be achieved by adopting the following technical scheme:
a cross-domain Chinese word segmentation system based on new word discovery is composed of a new word discovery module, an automatic labeling module and a cross-domain word segmentation module, wherein the three modules are connected in sequence. The above modules are as follows:
(1) the new word discovery module: the method is used for extracting new words which do not appear in the source field, namely unknown words, from the target field linguistic data without labels. The new word discovery module is composed of a candidate word extraction sub-module, an enhanced mutual information extraction sub-module, an adjacent entropy extraction sub-module and a candidate word filtering sub-module, wherein the candidate word extraction sub-module, the enhanced mutual information extraction sub-module and the candidate word filtering sub-module are sequentially connected, the candidate word extraction sub-module is used for extracting all candidate words from the target field corpus, the enhanced mutual information extraction sub-module is used for extracting enhanced mutual information of all candidate words, and the candidate word filtering sub-module is used for filtering the candidate words; the candidate word filtering submodule, the adjacent entropy extracting submodule and the candidate word filtering submodule are sequentially connected, and the adjacent entropy extracting submodule is used for extracting the adjacent entropy of all candidate words; and the candidate word extracting submodule is connected with the candidate word filtering submodule.
(2) An automatic labeling module: and automatically labeling the linguistic data in the non-labeled target field by using a new word list and a word segmentation algorithm obtained in the new word discovery module. The automatic labeling module is composed of a first Chinese word segmentation sub-module and a second Chinese word segmentation sub-module, the first Chinese word segmentation sub-module matches the corpus based on the new word vocabulary, if the matching is successful, segmentation is carried out, otherwise segmentation is not carried out, and incomplete segmentation of the corpus in the target field is realized; and the second Chinese word segmentation sub-module performs segmentation on the unsingulated corpus in the first Chinese word segmentation sub-module by using a GCNN-CRF word segmentation algorithm based on source field corpus training to realize complete segmentation on the corpus in the target field.
(3) A cross-domain word segmentation module: and training the confrontation type deep neural network by using the labeled source field linguistic data and the automatically labeled target field linguistic data, and converting cross-field participles into in-field participles to realize the participles of the target field. The cross-domain word segmentation module comprises a source domain feature extraction submodule, a public feature extraction submodule, a target domain feature extraction submodule, a source domain lexeme labeling submodule, a text classification submodule and a target domain lexeme labeling submodule, wherein the source domain feature extraction submodule and the source domain lexeme labeling submodule are connected to form a branch I; the public feature extraction submodule is respectively connected with the text classification submodule, the source field lexeme labeling submodule and the target field lexeme labeling submodule to form a branch II, the public feature extraction submodule is used for extracting public features of source field linguistic data and target field linguistic data, the text classification submodule is used for judging which field the input comes from, and the target field lexeme labeling submodule is connected and used for performing lexeme labeling on the target field linguistic data; and the target field feature extraction submodule and the target field lexeme labeling submodule are connected to form a branch III, and the target field feature extraction submodule is used for extracting the unique features of the target field corpus.
Furthermore, the source domain feature extraction submodule, the target domain feature extraction submodule and the public feature extraction submodule all adopt GCNN as a feature extractor, the GCNN comprises 4 CNN layers and 1 activation layer, input vectors parallelly enter the 4 CNN layers, feature extraction is carried out through the CNN to obtain 4 feature vectors, the feature vectors of the first CNN layer are input into the activation layers to be activated, the dimensionality is kept unchanged, numbers in the vectors are limited between 0 and 1 to serve as weight vectors, the vectors obtained by multiplying the weight vectors and the feature vectors output by the other 3 CNN layers are the final feature vectors, and the activation function is sigmoid.
The second purpose of the invention can be achieved by adopting the following technical scheme:
a cross-domain Chinese word segmentation method based on new word discovery is used for achieving word segmentation of linguistic data in different domains by adopting the following steps:
step S1: and mining a new word list of the field from the target field corpus by using a new word discovery module.
Step S2: and (3) automatically labeling the linguistic data of the non-labeled target field by using an automatic labeling module in combination with the new word list of the field obtained in the step (1).
Step S3: extracting the characteristics of the source field and the target field corpora through three branches of the cross-field word segmentation module, wherein the first branch uses the source field characteristic extraction submodule to extract the source field characteristic H from the source field corporasrc(ii) a The second branch circuit uses a common feature extraction submodule to extract the common features H of the corpora of the source field and the target fieldshr(ii) a Branch three-use target field feature extraction submodule for extracting target field features H from target field corpustgt
Step S4: h obtained in step S3srcAnd HshrInputting the predicted values into a source field lexeme labeling submodule to predict a source field lexeme label, and performing step S3 on the obtained HtgtAnd HshrInputting the predicted target domain word position label into a target domain word position labeling submodule to predict a target domain word position label, and performing the step S3 on the obtained HshrInput into a text classification sub-module to predict domain labels for the input text.
Further, in step S1, the step of using the new word discovery module to extract the new word list of the domain from the target domain corpus includes the following steps: step S1.1: and extracting all candidate words with the length not exceeding n in the domain corpus from the unmarked target domain corpus by using a candidate word extraction submodule.
Step S1.2: randomly segmenting the candidate word C into a front internal segment A and a rear internal segment B, and counting the times of C, A and B as nC、nAAnd nBThe mutual information MI of C is calculated by the following methodC
Figure BDA0003040022090000051
Figure BDA0003040022090000052
Wherein n iswIndicating the number of occurrences of an arbitrary character string w in the corpus.
Step S1.3: training Word2Vec model by using target field corpus to obtain any character cjWord vector of
Figure BDA0003040022090000055
The word vector Vec of the internal segment A is calculated by the following methodAAnd the word vector Vec of the inner segment BB
Figure BDA0003040022090000053
Figure BDA0003040022090000054
Wherein i represents the number of Chinese characters in A, m represents the number of Chinese characters in B, apAnd bqRepresents the value of the word vector at positions p and q, n represents the vector dimension;
step S1.4: word vectors Vec from the internal segment a and the internal segment B in step S1.3A、VecBIs measured by the following methodCalculating the semantic relevance sim (A, B) of the internal segment A and the internal segment B:
Figure BDA0003040022090000061
step S1.5: according to mutual information MI in step S1.2CAnd semantic relevance sim (A, B) of step S1.4, calculating the enhanced mutual information ENMI of the candidate word C by the following methodC
Figure BDA0003040022090000062
Wherein beta is1And weight coefficients representing semantic relevance in enhancing mutual information.
Step S1.6: respectively finding out all left adjacent characters [ L ] of candidate word C from target field linguistic data1,...Lu...LH]And the right adjacent word [ R1,...Rv...RD]Where H and D represent the number of left and right adjacent words, respectively, and the number of occurrences of each left adjacent word on the left side of the candidate word [ n (L) ]1),...n(Lu)...n(Lp)]And the number of occurrences of each right adjacent word to the right of the candidate word [ n (R)1),...n(Rd)...n(Rq)]The probability of each adjacent word is calculated separately using the following formula:
Figure BDA0003040022090000063
Figure BDA0003040022090000064
step S1.7: according to the probability p (L) of the left and right adjacent words in step S1.6u) And p (R)v) Calculating the left adjacent entropy H of the candidate word C by using the following formulal(C) And right adjacent entropy Hr(C):
Figure BDA0003040022090000065
Figure BDA0003040022090000066
Step S1.8: according to the left adjacent entropy H of the candidate word C in step S1.7l(C) And right adjacent entropy Hr(C) And calculating the adjacency entropy of the candidate word C by adopting the following method:
Figure BDA0003040022090000071
step S1.9: according to the enhanced mutual information ENMI in step S1.4CAnd the adjacency entropy BE in step S1.5CCalculating the overall score of the candidate word C by adopting the following method:
score(C)=sigmoid(β2*ENMIC+BEC)
wherein, beta2In order to enhance the weight of mutual information in the overall score, sigmoid represents normalization, and the calculation method comprises the following steps:
Figure BDA0003040022090000072
step S1.10: and setting a candidate word score threshold, comparing the overall score (C) of the step C in the step S1.6 with the threshold, if score (C) is greater than the threshold, determining that the candidate word is a reasonable word, and if not, removing the candidate word from the candidate word list to obtain a new word list.
Further, in step S2, the automatic labeling module combines the new word list of the field obtained in step S1 with the first chinese word segmentation module and the second chinese word segmentation module to realize automatic labeling of the linguistic data of the target field without labeling, and the process is as follows:
step S2.1: the first Chinese word segmentation module adopts a reverse maximum matching algorithm, sets a maximum matching length N, takes out a character string with the length of N from the last character of a sentence, inquires whether the character string is in a new word list, if the character string is segmented, moves the current character position to the left by N distances, if the character string is not segmented, subtracts 1 from the matching length, continues to match, if the matching length is still unsuccessful after subtracting 1 from the matching length, moves the current character position to the left by a distance, continues to match until the whole sentence is matched, and realizes the primary segmentation of the target corpus;
step S2.2: segmenting labeled source field corpus into words according to spaces, marking each character in the words with a lexeme label according to the length of each word, taking an input text as an input, and taking the lexeme label as an output to construct a training data set, wherein the lexeme labels comprise B, M, E and S, B represents a starting character of a multi-word, M represents a middle character of the multi-word, E represents an ending character of the multi-word, and S represents an independent word-forming character;
step S2.3: the second Chinese word segmentation module calculates the cost function by adopting the following method:
Figure BDA0003040022090000081
wherein Y ═ Y1,y2,y3,y4) Real lexeme labels, y, representing characterss∈{0,1},ysA value of 1 indicates that the character tag is the s-th lexeme tag,
Figure BDA0003040022090000082
representing the output lexeme labels through the model,
Figure BDA0003040022090000083
representing the probability that the model predicts the character tag belongs to the s-th lexeme tag;
step S2.4: training data is input into the model for training until loss1Is less than a preset value;
step S2.5: and (4) segmenting the unsingulated part in the incompletely segmented corpus obtained in the first Chinese segmentation module by using the second Chinese segmentation module obtained by training in the step (S2.4) to obtain the completely segmented target field automatic labeling corpus.
Further, the step S3 process is as follows:
step S3.1: segmenting the labeled source field corpus and the automatically labeled target field corpus into words, and marking a lexeme label B, M, E or S for each word according to the length of each word, wherein B represents the initial character of a multi-word, M represents the middle character of the multi-word, E represents the end character of the multi-word, S represents the character of an independent word, the corpus is used as input, and the lexeme labels are used as output to respectively construct a source field training set and a target field training set;
step S3.2: inputting the text of the source field training set obtained in the step S1 into the source field feature extraction submodule, where the vector output by the source field feature extraction submodule is the source field feature H unique to the source field corpussrc
Step S3.3: inputting the text of the target field training set obtained in the step S1 into the target field feature extraction submodule, where the vector output by the target field feature extraction submodule is the unique target field feature H of the target field corpustgt
Step S3.4: the texts in the source field training set and the target field corpus obtained in the step S1 are sequentially input into a common feature extraction submodule, and the vector output by the common feature extraction submodule is the feature H common to the source field and the target fieldshr
Further, the step S4 process is as follows:
step S4.1: using source domain feature HsrcAnd target Domain characteristic HtgtRespectively with a common feature HshrSplicing is carried out by adopting the following method to obtain the general characteristics of the source field
Figure BDA00030400220900000911
And general characteristics of the target area
Figure BDA00030400220900000912
Figure BDA0003040022090000091
Step S4.2: the cost functions of the source field lexeme labeling submodule, the target field lexeme labeling submodule and the text classification submodule are loss respectivelysrc、losstgtAnd lossshrThe following method is adopted for calculation:
Figure BDA0003040022090000092
Figure BDA0003040022090000093
Figure BDA0003040022090000094
wherein Y ═ Y1,y2,y3,y4) Real lexeme labels, y, representing characterst∈{0,1},ytWhen the value is 1, the character label is the t-th lexeme label, ytA value of 0 indicates that the character tag is not the tth lexeme tag,
Figure BDA0003040022090000095
representing the output lexeme labels through the model,
Figure BDA0003040022090000096
representing the probability that the model predicts the character tag belongs to the t-th lexeme tag; y ═ Y'1,y′2,y′3,y′4) Real lexeme tag, y 'representing a character'k∈{0,1},y′kWhen the number is 1, the character label is the kth lexeme label, y'kA value of 0 indicates that the character tag is not the kth lexeme tag,
Figure BDA0003040022090000097
representing output lexemes through a modelThe number of the labels is such that,
Figure BDA0003040022090000098
representing the probability that the model predicts that the character tag belongs to the ith lexeme tag; y ″ (Y ″)1,y″2) Real world tag, y ″, representing an input samplel∈{0,1},y″lA sample of 1 indicates that the sample is from the source domain, a sample of 0 indicates that the sample is from the target domain,
Figure BDA0003040022090000099
a sample domain label representing the output through the model,
Figure BDA00030400220900000910
representing the probability that the sample of the model output belongs to the l-th domain;
step S4.3: performing countermeasure training by using a source field lexeme labeling submodule, a target field lexeme labeling submodule and a text classification submodule, and realizing cross-field Chinese word segmentation from the source field to the target field, wherein the total cost function is as follows:
loss=losssrC3*losstgt4*lossshrwherein, beta3And beta4Respectively represent losstgtAnd lossshrThe weight occupied in the total cost function;
step S4.4: the model is trained until the variation of loss is less than a predetermined value.
Compared with the prior art, the invention has the following advantages and effects:
1. the invention improves the new word discovery algorithm by using vector enhanced mutual information, effectively combines statistical information and semantic information in the corpus, not only obviously improves the accuracy of new word discovery, reduces garbage word strings in the new word list, but also obviously enhances the field of the new word list.
2. The invention uses a new word discovery algorithm to extract a new word vocabulary of the corpus relative to the corpus of the source field from the target field corpus without labels, which is also called an unknown word vocabulary. The method has the advantages that the target domain linguistic data without labels are automatically labeled based on the new word vocabulary, and the unknown word rate of the test linguistic data relative to the training linguistic data is remarkably reduced.
3. The method uses the labeled source field corpus and the automatically labeled target field corpus to train the Chinese word segmentation algorithm based on the antagonistic training, reduces the influence of noise samples in the automatically labeled corpus on model training, improves the effect of the cross-field Chinese word segmentation algorithm, and has the effect superior to the effect of the Chinese word segmentation algorithm without the antagonistic training.
Drawings
FIG. 1 is a block diagram of a cross-domain Chinese segmentation system based on new word discovery as disclosed in an embodiment of the present invention;
FIG. 2 is a block diagram of a new word discovery module according to an embodiment of the present invention;
FIG. 3 is a block diagram of an automatic labeling module according to an embodiment of the present invention;
FIG. 4 is a block diagram of a domain-wide Chinese segmentation module in an embodiment of the present invention;
FIG. 5 is a schematic diagram of a GCNN-CRF model network according to an embodiment of the present invention;
FIG. 6 is a diagram of a network structure of a TextCNN model according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
A structural block diagram of a cross-domain chinese word segmentation system based on new word discovery disclosed in this embodiment is shown in fig. 1, and is composed of a new word discovery module, an automatic labeling module, and a cross-domain word segmentation module, where the new word discovery module, the automatic labeling module, and the cross-domain word segmentation module are connected in sequence, and are respectively used for mining new words from non-labeled target domain corpora, automatically labeling the non-labeled target domain corpora, and training a neural network for cross-domain chinese word segmentation.
A structural block diagram of a new word discovery module in this embodiment is shown in fig. 2, and the new word discovery module is composed of a candidate word extraction sub-module, an enhanced mutual information extraction sub-module, an adjacent entropy extraction sub-module, and a candidate word filtering sub-module, where the candidate word extraction sub-module, the enhanced mutual information extraction sub-module, and the candidate word filtering sub-module are connected in sequence, the candidate word extraction sub-module is used to extract all candidate words from a corpus of a target field, the enhanced mutual information extraction sub-module is used to extract enhanced mutual information of all candidate words, and the candidate word filtering sub-module is used to filter candidate words; the candidate word extraction sub-module, the adjacent entropy extraction sub-module and the candidate word filtering sub-module are sequentially connected, and the adjacent entropy extraction sub-module is used for extracting the adjacent entropy of all candidate words; and the candidate word extracting submodule is connected with the candidate word filtering submodule.
In the above embodiment, the candidate word extraction sub-module extracts all candidate words in the corpus from the unmarked target domain corpus; the enhanced mutual information extraction sub-module considers the semantic information of the candidate words on the basis of the mutual information, and adds the semantic similarity of the candidate words into the mutual information to calculate the enhanced mutual information of the candidate words; on the basis of the existing calculation of the adjacent entropy, the adjacent entropy extraction sub-module fully considers the left and right adjacent entropies of the candidate word, gives certain weight to the left and right adjacent entropies, and calculates and obtains the adjacent entropies of all the candidate words by utilizing the information contained in the left and right adjacent entropies to the maximum extent; and the candidate word filtering submodule adds the enhanced mutual information and the adjacent entropy of the candidate word according to a certain weight, balances the importance of the enhanced mutual information and the adjacent entropy to obtain the final score of the candidate word, and is used for filtering the candidate word to obtain a new word list of the corpus.
The structural block diagram of the auto-tagging module in this embodiment is shown in fig. 3, and is composed of a first chinese participle sub-module and a second chinese participle sub-module, where the first chinese participle sub-module is used to perform incomplete segmentation on a corpus of a target field, and the second chinese participle sub-module is used to perform complete segmentation on the incompletely segmented corpus.
In the above embodiment, the automatic labeling module is composed of a first chinese word segmentation sub-module and a second chinese word segmentation sub-module, the first chinese word segmentation sub-module matches the corpus based on the new word vocabulary, if the matching is successful, the corpus is segmented, otherwise, the corpus is not segmented, and the incomplete segmentation of the corpus in the target field is realized; and the second Chinese word segmentation sub-module performs segmentation on the unsingulated corpus in the first Chinese word segmentation sub-module by using a GCNN-CRF word segmentation algorithm based on source field corpus training to realize complete segmentation on the corpus in the target field. The module realizes automatic labeling of the non-labeled corpus in two steps, can better label the non-labeled target field corpus by using the new word vocabulary, and improves the labeling accuracy.
A cross-domain word segmentation module structure block diagram in this embodiment is shown in fig. 4, and is composed of a source domain feature extraction sub-module, a common feature extraction sub-module, a target domain feature extraction sub-module, a source domain lexeme tagging sub-module, a text classification sub-module, and a target domain lexeme tagging sub-module, where the source domain feature extraction sub-module is connected with the source domain lexeme tagging sub-module, the source domain feature extraction sub-module is used for extracting unique features of a source domain corpus, and the source domain lexeme tagging sub-module is used for performing lexeme tagging on the source domain corpus; the public feature extraction submodule is respectively connected with the text classification submodule, the source field lexeme labeling submodule and the target field lexeme labeling submodule, the public feature extraction submodule is used for extracting public features of source field linguistic data and target field linguistic data, the text classification submodule is used for judging which field the input comes from, and the target field lexeme labeling submodule is connected and used for performing lexeme labeling on the target field linguistic data; the target field feature extraction submodule is connected with the target field lexeme labeling submodule and is used for extracting the unique features of the target field linguistic data; the public feature extraction submodule is connected with the source field lexeme labeling submodule; and the public characteristic extraction submodule is connected with the target field lexeme labeling submodule.
In the above embodiment, the cross-domain word segmentation module introduces countermeasure training, which includes three branches, and the source domain feature extraction sub-module and the source domain lexeme tagging sub-module are connected to form a branch one; the public feature extraction submodule is respectively connected with the text classification submodule, the source field lexeme labeling submodule and the target field lexeme labeling submodule to form a branch II; and the target field feature extraction submodule and the target field lexeme labeling submodule are connected to form a branch III. The cross-domain word segmentation module can effectively reduce the influence of noise in the automatic labeling corpus on model training by giving certain weight to the automatic labeling corpus in model training, and improves the word segmentation effect of the model.
Example two
The embodiment provides a cross-domain Chinese word segmentation method based on a cross-domain Chinese word segmentation system discovered based on new words, and the method adopts the following steps to realize word segmentation of linguistic data in different domains:
step S1: and mining a new word list of the field from the target field corpus by using a new word discovery module. In the step S1, the method for mining a new word vocabulary of the field from the target field corpus using the new word discovery module includes the following steps:
step S1.1: and extracting all candidate words with the length not exceeding n in the domain corpus from the unmarked target domain corpus by using a candidate word extraction submodule.
In this embodiment, the candidate word extraction sub-module segments the corpus according to the non-chinese characters, sets the maximum candidate word length to 6, extracts all candidate words having a length not exceeding 6 from the sentences of the segmented corpus, stores the extracted candidate words in the candidate word set, and counts the occurrence number of the candidate words according to the corpus, and stores the candidate words in the dictionary D.
Step S1.2: randomly segmenting the candidate word C into a front internal segment A and a rear internal segment B, and counting the times of C, A and B as nC、nAAnd nBThe mutual information MI of C is calculated by the following methodC
Figure BDA0003040022090000131
Figure BDA0003040022090000132
Wherein n iswRepresenting the number of occurrences of any character string w in the corpus;
in this embodiment, the minimum length of both a and B is 1.
Step S1.3: training Word2Vec model by using target field corpus to obtain any character cjWord vector of
Figure BDA0003040022090000145
The word vector Vec of the internal segment A is calculated by the following methodAAnd the word vector Vec of the inner segment BB
Figure BDA0003040022090000141
Figure BDA0003040022090000142
Wherein i represents the number of Chinese characters in A, m represents the number of Chinese characters in B, apAnd bqRepresents the value of the word vector at positions p and q, n represents the vector dimension;
step S1.4: word vectors Vec from the internal segment a and the internal segment B in step S1.3A、VecBCalculating the semantic relevance sim (A, B) of the internal segment A and the internal segment B by adopting the following method:
Figure BDA0003040022090000143
step S1.5: according to mutual information MI in step S1.2CAnd semantic relevance sim (A, B) of step S1.4, calculating the enhanced mutual information ENMI of the candidate word C by the following methodC
Figure BDA0003040022090000144
Wherein beta is1Weight coefficients representing semantic relevance in enhancing mutual information;
in the above embodiment, the weight coefficient β1Is 300.
Step S1.6: respectively finding out all left adjacent characters [ L ] of candidate word C from target field linguistic data1,...Lu...LH]And the right adjacent word [ R1,...Rv...RD]Where H and D represent the number of left and right adjacent words, respectively, and the number of occurrences of each left adjacent word on the left side of the candidate word [ n (L) ]1),...n(Lu)...n(Lp)]And the number of occurrences of each right adjacent word to the right of the candidate word [ n (R)1),...n(Rd)...n(Rq)]The probability of each adjacent word is calculated separately using the following formula:
Figure BDA0003040022090000151
Figure BDA0003040022090000152
step S1.7: according to the probability p (L) of the left and right adjacent words in step S1.6u) And p (R)v) Calculating the left adjacent entropy H of the candidate word C by using the following formulal(C) And right adjacent entropy Hr(C):
Figure BDA0003040022090000153
Figure BDA0003040022090000154
Step S1.8: according to the left adjacent entropy H of the candidate word C in step S1.7l(C) And right adjacent entropy Hr(C) And calculating the adjacency entropy value of C by adopting the following method:
Figure BDA0003040022090000155
step S1.9: according to the enhanced mutual information ENMI in step S1.4CAnd the adjacency entropy BE in step S1.5CCalculating the overall score of the candidate word C by adopting the following method:
score(C)=sigmoid(β2*ENMIC+BEC)
wherein, beta2In order to enhance the weight of mutual information in the overall score, sigmoid represents normalization, and the calculation method comprises the following steps:
Figure BDA0003040022090000156
in this embodiment, the weight β occupied by the enhanced mutual information in the overall score is2Is 60.
Step S1.10: setting a candidate word score threshold, comparing the overall score (C) of the step C in the step 1.6 with the threshold, if score (C) is greater than the threshold, considering the candidate word as a reasonable word, otherwise, removing the candidate word from the candidate word list, and finally obtaining a new word list.
In the present embodiment, the score threshold is set to 0.9.
Step S2: and the automatic labeling module automatically labels the linguistic data of the non-labeled target field by combining the new word list of the field obtained in the step S1.
In the step S2, the automatic labeling of the linguistic data of the target unmarked domain is realized by using the new domain word list obtained in the step S1 and combining the first chinese word segmentation module and the second chinese word segmentation module, which includes the following steps:
step S2.1: the first Chinese word segmentation module adopts a reverse maximum matching algorithm and sets a maximum matching length N. Starting from the last character of the sentence, taking out the character string with the length of N, inquiring whether the character string is in a new word list, if the character string is segmented, moving the current character position to the left by N distances, if the character string is not segmented, subtracting 1 from the matching length, continuing to match, if the matching length is not successful after subtracting 1, moving the current character position to the left by a distance, continuing to match until the whole sentence is matched, and realizing the preliminary segmentation of the target corpus.
In the above embodiment, the maximum matching length N is set to 6 (typically set to the maximum length of an entry in a vocabulary).
Step S2.2: dividing the labeled source field corpus into words according to spaces, and marking a lexeme label B, M, E, S for each character in the words according to the length of each word, wherein B represents the initial character of a multi-word, M represents the middle character in the multi-word, E represents the end character in the multi-word, and S represents the character which is independently formed into words, taking the input text as input, and taking the lexeme label as output to construct a training data set.
Step S2.3: the second Chinese word segmentation module calculates the cost function by adopting the following method:
Figure BDA0003040022090000161
wherein Y ═ Y1,y2,y3,y4) Real lexeme labels, y, representing characterss∈{0,1},ysA value of 1 indicates that the character tag is the s-th lexeme tag,
Figure BDA0003040022090000162
representing the output lexeme labels through the model,
Figure BDA0003040022090000163
representing the probability that the model predicts the character tag belongs to the s-th lexeme tag;
in the above embodiment, a structural block diagram of the second chinese word segmentation module is shown in fig. 5, and includes 3 GCNN layers, 1 fully-connected layer, and 1 CRF layer. The input text vector becomes a feature vector with dimension of 200 after passing through 3 GCNNs, the feature vector becomes an output vector with dimension of 4 after passing through a full connection layer, and the output vector obtains a lexeme label of the character after passing through CRF. The size of the convolution kernel of the GCNN1 is 3, the number of the convolution kernels is 200, the size of the convolution kernel of the GCNN2 is 4, the number of the convolution kernels is 200, the size of the convolution kernel of the GCNN3 is 5, the number of the convolution kernels is 200, and the number of nodes of the full connection layer is 4.
Step S2.4: training data is input into the model for training until loss1Is less than a preset value.
In the above embodiment, the preset value is set to 0.01.
Step S2.5: and (4) segmenting the unsingulated part in the incompletely segmented corpus obtained in the first Chinese word segmentation module by using the second Chinese word segmentation module in the step (S2.4) to obtain the completely segmented target field automatic labeling corpus.
Step S3: extracting the characteristics of the source field and the target field corpora through three branches of the cross-field word segmentation module, wherein the first branch uses the source field characteristic extraction submodule to extract the source field characteristic H from the source field corporasrc(ii) a The second branch circuit uses a common feature extraction submodule to extract the common features H of the corpora of the source field and the target fieldshr(ii) a Branch three-use target field feature extraction submodule for extracting target field features H from target field corpustgt
The implementation of the above step S3 includes the following steps:
step S3.1: segmenting the labeled source field corpus and the automatically labeled target field corpus into words, and marking a lexeme label B, M, E or S for each word according to the length of each word, wherein B represents the initial character of a multi-word, M represents the middle character of the multi-word, E represents the end character of the multi-word, S represents the character of an independent word, the corpus is used as input, and the lexeme labels are used as output to respectively construct a source field training set and a target field training set;
step S3.2: inputting the text of the source field training set obtained in the step S1 into the source field feature extraction submodule, where the vector output by the source field feature extraction submodule is the source field feature H unique to the source field corpussrc
Step S3.3: the text of the target domain training set obtained in the step S1The vector output by the target field characteristic extraction submodule is the unique target field characteristic H of the target field corpustgt
Step S3.4: the texts in the source field training set and the target field corpus obtained in the step S1 are sequentially input into a common feature extraction submodule, and the vector output by the common feature extraction submodule is the feature H common to the source field and the target fieldshr
The source field feature extraction submodule, the target field feature extraction submodule and the public feature extraction submodule are completely consistent in structure and comprise 3 GCNN layers and 1 activation layer. The input text vector sequentially passes through 3 GCNNs to form a feature vector with the dimension of 200, the dimension of the feature vector remains unchanged after passing through an activation layer, the size of each number in the vector is between 0 and 1, the sizes of convolution kernels of the 3 GCNNs are 3, 4 and 5, the number of the convolution kernels is 200, and the activation function of the activation layer is sigmoid.
Step S4: h obtained in step S3srcAnd HshrInputting the predicted values into a source field lexeme labeling submodule to predict a source field lexeme label, and performing step S3 on the obtained HtgtAnd HshrInputting the predicted target domain word position label into a target domain word position labeling submodule to predict a target domain word position label, and performing the step S3 on the obtained HshrInput into a text classification sub-module to predict domain labels for the input text.
In the step S4, the label prediction using the feature includes the steps of:
step S4.1: using source domain feature HsrcAnd target Domain characteristic HtgtRespectively with a common feature HshrSplicing is carried out by adopting the following method to obtain the general characteristics of the source field
Figure BDA0003040022090000184
And general characteristics of the target area
Figure BDA0003040022090000185
Figure BDA0003040022090000181
Step S4.2: the cost functions of the source field lexeme labeling submodule, the target field lexeme labeling submodule and the text classification submodule are loss respectivelysrc、losstgtAnd lossshrThe following method is adopted for calculation:
Figure BDA0003040022090000182
Figure BDA0003040022090000183
Figure BDA0003040022090000191
wherein Y ═ Y1,y2,y3,y4) Real lexeme labels, y, representing characterst∈{0,1},ytWhen the value is 1, the character label is the t-th lexeme label, ytA value of 0 indicates that the character tag is not the tth lexeme tag,
Figure BDA0003040022090000192
representing the output lexeme labels through the model,
Figure BDA0003040022090000193
representing the probability that the model predicts the character tag belongs to the t-th lexeme tag; y ═ Y'1,y′2,y′3,y′4) Real lexeme tag, y 'representing a character'k∈{0,1},y′kWhen the number is 1, the character label is the kth lexeme label, y'kA value of 0 indicates that the character tag is not the kth lexeme tag,
Figure BDA0003040022090000194
representing output lexemes passed through a modelThe label is a paper label with a color,
Figure BDA0003040022090000195
representing the probability that the model predicts that the character tag belongs to the ith lexeme tag; y ″ (Y ″)1,y″2) Real world tag, y ″, representing an input samplel∈{0,1},y″lA sample of 1 indicates that the sample is from the source domain, a sample of 0 indicates that the sample is from the target domain,
Figure BDA0003040022090000196
a sample domain label representing the output through the model,
Figure BDA0003040022090000197
the probability that a sample representing the model output belongs to the l-th domain.
In the above embodiment, a structural block diagram of the text classification sub-module is shown in fig. 6, and includes 3 convolutional layers, 3 pooling layers, a splicing layer, and a full-link layer. The method comprises the steps that three feature vectors are obtained through three parallel convolutional layers of input text vectors respectively, then the three feature vectors are sequentially input into a pooling layer, then the three feature vectors output by the pooling layer are input into a splicing layer to be spliced, finally the spliced vectors are input into a full-connection layer, the sizes of convolution kernels of CNN1, CNN2 and CNN3 are respectively 3, 4 and 5, the number of the convolution kernels is 200, the pooling layer uses the maximum pooling, and the number of nodes of the full-connection layer is 2.
Step S4.3: performing countermeasure training by using a source field lexeme labeling submodule, a target field lexeme labeling submodule and a text classification submodule, and realizing cross-field Chinese word segmentation from the source field to the target field, wherein the total cost function is as follows:
loss=losssrC3*losstgt4*lossshrwherein, beta3And beta4Respectively represent losstgtAnd lossshrThe weight occupied in the total cost function.
Step S4.4: the model is trained until the variation of loss is less than a predetermined value.
In this embodiment, the default value of loss is 0.01.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (10)

1. A cross-domain Chinese word segmentation system based on new word discovery is characterized by comprising a new word discovery module, an automatic labeling module and a cross-domain word segmentation module which are sequentially connected, wherein,
the new word discovery module is used for extracting unknown words which do not appear in the source field from the target field linguistic data without labels to obtain a new word list of the field;
the automatic labeling module is used for carrying out initial segmentation on the non-labeled corpus by using a reverse maximum matching algorithm based on a new word vocabulary to obtain a corpus which is not completely segmented; completely segmenting the part which is not segmented in the corpus after the initial segmentation by using a GCNN-CRF word segmentation algorithm based on corpus training of the source field, and realizing the automatic segmentation of the corpus of the target field without the label;
the cross-domain word segmentation module trains the confrontation type deep neural network by using the labeled source domain linguistic data and the automatically labeled target domain linguistic data, and converts the cross-domain word segmentation into the in-domain word segmentation to realize the word segmentation of the target domain.
2. The system according to claim 1, wherein the new word discovery module comprises a candidate word extraction sub-module, an enhanced mutual information extraction sub-module, an adjacent entropy extraction sub-module, and a candidate word filtering sub-module, wherein the candidate word extraction sub-module, the enhanced mutual information extraction sub-module, and the candidate word filtering sub-module are connected in sequence, the candidate word extraction sub-module is configured to extract all candidate words from the target domain corpus, the enhanced mutual information extraction sub-module is configured to extract enhanced mutual information of all candidate words, and the candidate word filtering sub-module is configured to filter the candidate words; the candidate word filtering submodule, the adjacent entropy extracting submodule and the candidate word filtering submodule are sequentially connected, and the adjacent entropy extracting submodule is used for extracting the adjacent entropy of all candidate words; and the candidate word extracting submodule is connected with the candidate word filtering submodule.
3. The system according to claim 1, wherein the automatic labeling module comprises a first Chinese word segmentation submodule and a second Chinese word segmentation submodule, the first Chinese word segmentation submodule matches the corpus based on the new word vocabulary, if the matching is successful, segmentation is performed, otherwise segmentation is not performed, and incomplete segmentation of the corpus in the target field is realized; and the second Chinese word segmentation sub-module performs segmentation on the unsingulated corpus in the first Chinese word segmentation sub-module by using a GCNN-CRF word segmentation algorithm based on source field corpus training to realize complete segmentation on the corpus in the target field.
4. The system for Chinese word segmentation across fields based on new word discovery according to claim 1, wherein the module for word segmentation across fields comprises a source field feature extraction submodule, a common feature extraction submodule, a target field feature extraction submodule, a source field lexeme labeling submodule, a text classification submodule and a target field lexeme labeling submodule, wherein the source field feature extraction submodule and the source field lexeme labeling submodule are connected to form a first branch, the source field feature extraction submodule is used for extracting unique features of a corpus of a source field, and the source field lexeme labeling submodule is used for word position labeling of the corpus of the source field; the public feature extraction submodule is respectively connected with the text classification submodule, the source field lexeme labeling submodule and the target field lexeme labeling submodule to form a branch II, the public feature extraction submodule is used for extracting public features of source field linguistic data and target field linguistic data, the text classification submodule is used for judging which field the input comes from, and the target field lexeme labeling submodule is connected and used for performing lexeme labeling on the target field linguistic data; and the target field feature extraction submodule and the target field lexeme labeling submodule are connected to form a branch III, and the target field feature extraction submodule is used for extracting the unique features of the target field corpus.
5. The cross-domain Chinese word segmentation system based on new word discovery as claimed in claim 1, wherein the source domain feature extraction sub-module, the target domain feature extraction sub-module and the common feature extraction sub-module all use GCNN as a feature extractor, GCNN includes 4 CNN layers and 1 activation layer, input vectors enter 4 CNN layers in parallel, feature extraction is performed through CNN to obtain 4 feature vectors, the feature vector of the first CNN layer is input into the activation layer to be activated, the dimension is kept unchanged, numbers in the vectors are limited between 0 and 1 to serve as a weight vector, the vectors obtained by multiplying the weight vector and feature vectors output by the other 3 CNN layers are the final feature vector, and the activation function is sigmoid.
6. A word segmentation method of a cross-domain chinese word segmentation system based on new word discovery according to any one of claims 1 to 5, characterized by implementing word segmentation for linguistic data of different domains by adopting the following steps:
step S1: and mining a new word list of the field from the target field corpus by using a new word discovery module.
Step S2: and (5) automatically labeling the linguistic data of the non-labeled target field by using an automatic labeling module in combination with the new word list of the field obtained in the step (S).
Step S3: extracting the characteristics of the source field and the target field corpora through three branches of the cross-field word segmentation module, wherein the first branch uses the source field characteristic extraction submodule to extract the source field characteristic H from the source field corporasrc(ii) a The second branch circuit uses a common feature extraction submodule to extract the common features H of the corpora of the source field and the target fieldshr(ii) a Branch three-use target field feature extraction submodule for extracting target field features H from target field corpustgt
Step S4: h obtained in step S3shcAnd HshrInputting the predicted values into a source field lexeme labeling submodule to predict a source field lexeme label, and performing step S3 on the obtained HtgtAnd HshrInputting the predicted target domain word position label into a target domain word position labeling submodule to predict a target domain word position label, and performing the step S3 on the obtained HshrInput into a text classification sub-module to predict domain labels for the input text.
7. The method as claimed in claim 6, wherein in step S1, the new word discovery module is used to extract a new word list of the domain from the target domain corpus as follows: step S1.1: extracting all candidate words with the length not exceeding n on the domain corpus from the unmarked target domain corpus by using a candidate word extraction submodule;
step S1.2: randomly segmenting the candidate word C into a front internal segment A and a rear internal segment B, and counting the times of C, A and B as nC、nAAnd nBThe mutual information MI of C is calculated by the following methodC
Figure FDA0003040022080000031
Figure FDA0003040022080000032
Wherein n iswRepresenting the number of occurrences of any character string w in the corpus;
step S1.3: training Word2Vec model by using target field corpus to obtain any character cjWord vector of
Figure FDA0003040022080000041
The word vector Vec of the internal segment A is calculated by the following methodAAnd the word vector Vec of the inner segment BB
Figure FDA0003040022080000042
Figure FDA0003040022080000043
Wherein i represents the number of Chinese characters in A, m represents the number of Chinese characters in B, apAnd bqRepresents the value of the word vector at positions p and q, n represents the vector dimension;
step S1.4: word vectors Vec from the internal segment a and the internal segment B in step S1.3A、VecBCalculating the semantic relevance sim (A, B) of the internal segment A and the internal segment B by adopting the following method:
Figure FDA0003040022080000044
step S1.5: according to mutual information MI in step S1.2CAnd semantic relevance sim (A, B) of step S1.4, calculating the enhanced mutual information ENMI of the candidate word C by the following methodC
Figure FDA0003040022080000045
Wherein beta is1Weight coefficients representing semantic relevance in enhancing mutual information;
step S1.6: respectively finding out all left adjacent characters [ L ] of candidate word C from target field linguistic data1,...Lu...LH]And the right adjacent word [ R1,...Rv...RD]Where H and D represent the number of left and right adjacent words, respectively, and the number of occurrences of each left adjacent word on the left side of the candidate word [ n (L) ]1),...n(Lu)...n(Lp)]And the number of occurrences of each right adjacent word to the right of the candidate word [ n (R)1),...n(Rd)...n(Rq)]The outline of each adjacent word is calculated by the following formulaRate:
Figure FDA0003040022080000046
Figure FDA0003040022080000047
step S1.7: according to the probability p (L) of the left and right adjacent words in step S1.6u) And p (R)v) Calculating the left adjacent entropy H of the candidate word C by using the following formulal(C) And right adjacent entropy Hr(C):
Figure FDA0003040022080000051
Figure FDA0003040022080000052
Step S1.8: according to the left adjacent entropy H of the candidate word C in step S1.7l(C) And right adjacent entropy Hr(C) And calculating the adjacency entropy of the candidate word C by adopting the following method:
Figure FDA0003040022080000053
step S1.9: according to the enhanced mutual information ENMI in step S1.4CAnd the adjacency entropy BE in step S1.5CCalculating the overall score of the candidate word C by adopting the following method:
score(C)=sigmoid(β2*ENMIC+BEC)
wherein, beta2In order to enhance the weight of mutual information in the overall score, sigmoid represents normalization, and the calculation method comprises the following steps:
Figure FDA0003040022080000054
step S1.10: and setting a candidate word score threshold, comparing the overall score (C) of the step C in the step S1.6 with the threshold, if score (C) is greater than the threshold, determining that the candidate word is a reasonable word, and if not, removing the candidate word from the candidate word list to obtain a new word list.
8. The method of claim 6, wherein in step S2, the automatic labeling module combines the domain new word vocabulary obtained in step S1 and the first chinese word segmentation module and the second chinese word segmentation module to realize automatic labeling of the linguistic data of the target domain without labeling, and the process is as follows:
step S2.1: the first Chinese word segmentation module adopts a reverse maximum matching algorithm, sets a maximum matching length N, takes out a character string with the length of N from the last character of a sentence, inquires whether the character string is in a new word list, if the character string is segmented, moves the current character position to the left by N distances, if the character string is not segmented, subtracts 1 from the matching length, continues to match, if the matching length is still unsuccessful after subtracting 1 from the matching length, moves the current character position to the left by a distance, continues to match until the whole sentence is matched, and realizes the primary segmentation of the target corpus;
step S2.2: segmenting labeled source field corpus into words according to spaces, marking each character in the words with a lexeme label according to the length of each word, taking an input text as an input, and taking the lexeme label as an output to construct a training data set, wherein the lexeme labels comprise B, M, E and S, B represents a starting character of a multi-word, M represents a middle character of the multi-word, E represents an ending character of the multi-word, and S represents an independent word-forming character;
step S2.3: the second Chinese word segmentation module calculates the cost function by adopting the following method:
Figure FDA0003040022080000061
wherein Y ═ Y1,y2,y3,y4) Real lexeme labels, y, representing characterss∈{0,1},ysA value of 1 indicates that the character tag is the s-th lexeme tag,
Figure FDA0003040022080000062
representing the output lexeme labels through the model,
Figure FDA0003040022080000063
representing the probability that the model predicts the character tag belongs to the s-th lexeme tag;
step S2.4: training data is input into the model for training until loss1Is less than a preset value;
step S2.5: and (4) segmenting the unsingulated part in the incompletely segmented corpus obtained in the first Chinese segmentation module by using the second Chinese segmentation module obtained by training in the step (S2.4) to obtain the completely segmented target field automatic labeling corpus.
9. The method for Chinese segmentation across fields based on new word discovery as claimed in claim 6, wherein the step S3 is as follows:
step S3.1: segmenting the labeled source field corpus and the automatically labeled target field corpus into words, and marking a lexeme label B, M, E or S for each word according to the length of each word, wherein B represents the initial character of a multi-word, M represents the middle character of the multi-word, E represents the end character of the multi-word, S represents the character of an independent word, the corpus is used as input, and the lexeme labels are used as output to respectively construct a source field training set and a target field training set;
step S3.2: inputting the text of the source field training set obtained in the step S1 into the source field feature extraction submodule, where the vector output by the source field feature extraction submodule is the source field feature H unique to the source field corpussrc
Step S3.3: training the target area obtained in step S1Inputting the text of the set into a target field feature extraction submodule, wherein the vector output by the target field feature extraction submodule is the unique target field feature H of the target field corpustgt
Step S3.4: the texts in the source field training set and the target field corpus obtained in the step S1 are sequentially input into a common feature extraction submodule, and the vector output by the common feature extraction submodule is the feature H common to the source field and the target fieldshr
10. The method for Chinese segmentation across fields based on new word discovery as claimed in claim 6, wherein the step S4 is as follows:
step S4.1: using source domain feature HsrcAnd target Domain characteristic HtgtRespectively with a common feature HshrSplicing is carried out by adopting the following method to obtain the general characteristics of the source field
Figure FDA0003040022080000071
And general characteristics of the target area
Figure FDA0003040022080000072
Figure FDA0003040022080000073
Step S4.2: the cost functions of the source field lexeme labeling submodule, the target field lexeme labeling submodule and the text classification submodule are loss respectivelysrc、losstgtAnd lossshrThe following method is adopted for calculation:
Figure FDA0003040022080000074
Figure FDA0003040022080000075
Figure FDA0003040022080000081
wherein Y ═ Y1,y2,y3,y4) Real lexeme labels, y, representing characterst∈{0,1},ytWhen the value is 1, the character label is the t-th lexeme label, ytA value of 0 indicates that the character tag is not the tth lexeme tag,
Figure FDA0003040022080000082
representing the output lexeme labels through the model,
Figure FDA0003040022080000083
representing the probability that the model predicts the character tag belongs to the t-th lexeme tag; y ═ Y'1,y′2,y′3,y′4) Real lexeme tag, y 'representing a character'k∈{0,1},y′kWhen the number is 1, the character label is the kth lexeme label, y'kA value of 0 indicates that the character tag is not the kth lexeme tag,
Figure FDA0003040022080000084
representing the output lexeme labels through the model,
Figure FDA0003040022080000085
representing the probability that the model predicts that the character tag belongs to the ith lexeme tag; y ″ (Y ″)1,y″2) Real world tag, y ″, representing an input samplel∈{0,1},y″lA sample of 1 indicates that the sample is from the source domain, a sample of 0 indicates that the sample is from the target domain,
Figure FDA0003040022080000086
a sample domain label representing the output through the model,
Figure FDA0003040022080000087
representing the probability that the sample of the model output belongs to the l-th domain;
step S4.3: performing countermeasure training by using a source field lexeme labeling submodule, a target field lexeme labeling submodule and a text classification submodule, and realizing cross-field Chinese word segmentation from the source field to the target field, wherein the total cost function is as follows:
loss=losssrc3*losstgt4*lossshr
wherein, beta3And beta4Respectively represent losstgtAnd lossshrThe weight occupied in the total cost function;
step S4.4: the model is trained until the variation of loss is less than a predetermined value.
CN202110463683.4A 2021-04-26 2021-04-26 Cross-domain Chinese word segmentation system and method based on new word discovery Expired - Fee Related CN113076750B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110463683.4A CN113076750B (en) 2021-04-26 2021-04-26 Cross-domain Chinese word segmentation system and method based on new word discovery

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110463683.4A CN113076750B (en) 2021-04-26 2021-04-26 Cross-domain Chinese word segmentation system and method based on new word discovery

Publications (2)

Publication Number Publication Date
CN113076750A true CN113076750A (en) 2021-07-06
CN113076750B CN113076750B (en) 2022-12-16

Family

ID=76618905

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110463683.4A Expired - Fee Related CN113076750B (en) 2021-04-26 2021-04-26 Cross-domain Chinese word segmentation system and method based on new word discovery

Country Status (1)

Country Link
CN (1) CN113076750B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114743551A (en) * 2022-03-17 2022-07-12 携程旅游信息技术(上海)有限公司 Method, system, device and medium for recognizing domain words in speech
CN115841154A (en) * 2022-11-28 2023-03-24 福建新大陆软件工程有限公司 New word mining method for telecommunication industry based on statistical learning and semantic fusion

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6879951B1 (en) * 1999-07-29 2005-04-12 Matsushita Electric Industrial Co., Ltd. Chinese word segmentation apparatus
CN108509425A (en) * 2018-04-10 2018-09-07 中国人民解放军陆军工程大学 Chinese new word discovery method based on novelty
CN110008338A (en) * 2019-03-04 2019-07-12 华南理工大学 A kind of electric business evaluation sentiment analysis method of fusion GAN and transfer learning
CN110196980A (en) * 2019-06-05 2019-09-03 北京邮电大学 A kind of field migration based on convolutional network in Chinese word segmentation task
US20200082221A1 (en) * 2018-09-06 2020-03-12 Nec Laboratories America, Inc. Domain adaptation for instance detection and segmentation
CN111507103A (en) * 2020-03-09 2020-08-07 杭州电子科技大学 Self-training neural network word segmentation model using partial label set
CN111950274A (en) * 2020-07-31 2020-11-17 中国工商银行股份有限公司 Chinese word segmentation method and device for linguistic data in professional field
US20210012199A1 (en) * 2019-07-04 2021-01-14 Zhejiang University Address information feature extraction method based on deep neural network model

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6879951B1 (en) * 1999-07-29 2005-04-12 Matsushita Electric Industrial Co., Ltd. Chinese word segmentation apparatus
CN108509425A (en) * 2018-04-10 2018-09-07 中国人民解放军陆军工程大学 Chinese new word discovery method based on novelty
US20200082221A1 (en) * 2018-09-06 2020-03-12 Nec Laboratories America, Inc. Domain adaptation for instance detection and segmentation
CN110008338A (en) * 2019-03-04 2019-07-12 华南理工大学 A kind of electric business evaluation sentiment analysis method of fusion GAN and transfer learning
CN110196980A (en) * 2019-06-05 2019-09-03 北京邮电大学 A kind of field migration based on convolutional network in Chinese word segmentation task
US20210012199A1 (en) * 2019-07-04 2021-01-14 Zhejiang University Address information feature extraction method based on deep neural network model
CN111507103A (en) * 2020-03-09 2020-08-07 杭州电子科技大学 Self-training neural network word segmentation model using partial label set
CN111950274A (en) * 2020-07-31 2020-11-17 中国工商银行股份有限公司 Chinese word segmentation method and device for linguistic data in professional field

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CEN CHEN ET AL.: "Cross-Domain Review Helpfulness Prediction based on Convolutional Neural Networks with Auxiliary Domain Discriminators", 《PROCEEDINGS OF NAACL-HLT 2018》 *
尚高慧: "基于互信息的中文新词发现算法研究及系统实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
杜丽萍 等: "基于互信息改进算法的新词发现对中文分词系统改进", 《北京大学学报(自然科学版)》 *
郭星星: "基于模型迁移的中文分词领域适应性研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114743551A (en) * 2022-03-17 2022-07-12 携程旅游信息技术(上海)有限公司 Method, system, device and medium for recognizing domain words in speech
CN115841154A (en) * 2022-11-28 2023-03-24 福建新大陆软件工程有限公司 New word mining method for telecommunication industry based on statistical learning and semantic fusion

Also Published As

Publication number Publication date
CN113076750B (en) 2022-12-16

Similar Documents

Publication Publication Date Title
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
Cao et al. A joint model for word embedding and word morphology
CN108984526B (en) A deep learning-based document topic vector extraction method
CN110362819B (en) Text emotion analysis method based on convolutional neural network
CN108268447B (en) A Labeling Method of Tibetan Named Entity
CN106886580B (en) A deep learning-based image sentiment polarity analysis method
CN113505200B (en) A method for sentence-level Chinese event detection combining key information of documents
CN116955699B (en) Video cross-mode search model training method, searching method and device
Jia Sentiment classification of microblog: A framework based on BERT and CNN with attention mechanism
CN112069312B (en) A text classification method and electronic device based on entity recognition
CN114818717B (en) Chinese named entity recognition method and system integrating vocabulary and syntax information
CN112417823B (en) A Chinese text word order adjustment and quantifier completion method and system
CN110852089B (en) Operation and maintenance project management method based on intelligent word segmentation and deep learning
CN111967267B (en) XLNET-based news text region extraction method and system
CN112861524A (en) Deep learning-based multilevel Chinese fine-grained emotion analysis method
CN114372470A (en) Chinese legal text entity identification method based on boundary detection and prompt learning
CN111159405B (en) Irony detection method based on background knowledge
Ye et al. Improving cross-domain Chinese word segmentation with word embeddings
CN113076750B (en) Cross-domain Chinese word segmentation system and method based on new word discovery
CN112528653A (en) Short text entity identification method and system
CN112287240A (en) Method and device for case microblog evaluation object extraction based on double-embedded multi-layer convolutional neural network
CN115994204A (en) A structured semantic analysis method for national defense science and technology texts suitable for few-sample scenarios
Shahade et al. Deep learning approach-based hybrid fine-tuned Smith algorithm with Adam optimiser for multilingual opinion mining
CN119807419A (en) Judgment document paragraph classification method, device, electronic device and storage medium
CN114298048A (en) Named Entity Recognition Method and Device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20221216

CF01 Termination of patent right due to non-payment of annual fee