CN110059185B

CN110059185B - An automatic tagging method for professional vocabulary in medical documents

Info

Publication number: CN110059185B
Application number: CN201910265223.3A
Authority: CN
Inventors: 王嫄; 高铭; 王栋; 赵婷婷; 赵青; 陈亚瑞; 史艳翠; 孔娜; 王洁
Original assignee: Tianjin University of Science and Technology
Current assignee: Beijing Contention Technology Co ltd
Priority date: 2019-04-03
Filing date: 2019-04-03
Publication date: 2022-10-04
Anticipated expiration: 2039-04-03
Also published as: CN110059185A

Abstract

The invention relates to an automatic labeling method for professional vocabulary of medical documents, which includes: preprocessing input medical documents to obtain preprocessed medical document text; obtaining letter-level feature vectors, word-level feature vectors and language feature vectors of words And fuse it as the encoding vector of the word; classify the word labeling of the medical document text after word segmentation to obtain the labeling data set; output a multi-dimensional vector for each word as the spatial representation of the word; obtain the enhanced labeling data set; Modeling, and finally output the annotation results. The present invention has a reasonable design, uses a semi-supervised learning algorithm to label a large amount of unlabeled data, successfully overcomes the defect of too little labelled data in the existing medical industry, effectively increases the amount of data that can be used by the model, and greatly improves the algorithm for The labeling accuracy of keywords and specialized vocabulary can be widely used in medical literature processing.

Description

An automatic tagging method for professional vocabulary in medical documents

技术领域technical field

本发明属于机器学习技术领域，尤其是一种医学文档专业词汇自动化标注方法。The invention belongs to the technical field of machine learning, in particular to an automatic labeling method for professional vocabulary of medical documents.

背景技术Background technique

随着医疗研究社区的发展，每年都会有越来越多的论文发表出来。人们越来越需要寻找对于论文的改进方法，并自动理解这些论文中的关键思想。然而，由于各种各样的领域和极其有限的注释资源，对科学信息的提取相对较少。As the medical research community grows, more and more papers are published each year. There is a growing need to find ways to improve papers and automatically understand the key ideas in those papers. However, due to the wide variety of domains and extremely limited annotation resources, there is relatively little extraction of scientific information.

同时，随着人们对于医疗资源的需求、相应的医学文档及病例数量激增，导致研究人员和医护人员需要快速对于病人的过去的医疗资料进行整理。从病人病例中可以快速帮助医护人员作出判断的往往是一些专业上的词汇或关键词，人工整理这些词汇和关键词需要非常多的时间，由于人力限制，不可能很快地完成大量病例、医疗资料的整理工作。At the same time, with the surge in people's demand for medical resources, corresponding medical documents and the number of cases, researchers and medical staff need to quickly organize the past medical data of patients. It is often some professional words or keywords that can quickly help medical staff make judgments from patient cases. It takes a lot of time to manually sort out these words and keywords. Due to manpower constraints, it is impossible to quickly complete a large number of cases and medical treatment. Data collation work.

综上所述，随着对于医疗资源需求的上升，如何自动对专业词汇或关键词进行标注以提升医护人员对于病例、医疗资料的处理速度并帮助他们更好地为病人治疗是目前迫切需要解决的问题。To sum up, with the increasing demand for medical resources, how to automatically mark professional words or keywords to improve the processing speed of medical staff for cases and medical data and help them better treat patients is an urgent need to solve. The problem.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于克服现有技术的不足，提出一种医学文档专业词汇自动化标注方法，其采用半监督学习算法对数据量进行扩充，克服了以往医疗文本标注数据量不足导致的模型性能较差的问题，并最终提高了在文本中识别专业词汇和关键词的准确性。The purpose of the present invention is to overcome the deficiencies of the prior art, and to propose an automatic labeling method for professional vocabulary in medical documents, which uses a semi-supervised learning algorithm to expand the data volume, and overcomes the poor model performance caused by the insufficient data volume of medical text labeling in the past. problems, and ultimately improve the accuracy of identifying specialized words and keywords in the text.

本发明解决其技术问题是采取以下技术方案实现的：The present invention solves its technical problem by adopting the following technical solutions to realize:

一种医学文档专业词汇自动化标注方法，包括以下步骤：An automatic labeling method for professional vocabulary of medical documents, comprising the following steps:

步骤1、对输入的医学文档进行数据预处理，得到预处理后的医学文档文本；Step 1. Perform data preprocessing on the input medical document to obtain the preprocessed medical document text;

步骤2、使用biLSTM建模文本，得到词的字母级特征向量；Step 2. Use biLSTM to model the text to obtain the letter-level feature vector of the word;

步骤3、使用word2vec建模文本，得到词的单词级特征向量；Step 3. Use word2vec to model text to obtain word-level feature vectors of words;

步骤4、基于文本语言语用特点，得到词的语言特征向量；Step 4. Based on the pragmatic characteristics of the text language, the language feature vector of the word is obtained;

步骤5、将步骤2、步骤3及步骤4得到词的字母级特征向量、单词级特征向量和语言特征向量进行融合，作为词的编码向量；Step 5, fuse the letter-level feature vector, word-level feature vector and language feature vector of the word obtained in step 2, step 3 and step 4, as the encoding vector of the word;

步骤6、将分词后的医学文档文本的词标注为如下四类医学实体：疾病名称、疾病症状、治疗手段和药物名称，每一类实体用IOBES表示词在该实体中的具体位置，得到标注数据集；Step 6. Mark the words of the segmented medical document text as the following four types of medical entities: disease name, disease symptom, treatment method and drug name. Each type of entity uses IOBES to indicate the specific position of the word in the entity, and get the labeling data set;

步骤7、将步骤1得到的文本以及步骤5得到的词的编码向量作为biLSTM的输入，对每一个词输出一个多维向量作为词的空间表示；Step 7. Use the text obtained in step 1 and the encoding vector of the word obtained in step 5 as the input of biLSTM, and output a multi-dimensional vector as the spatial representation of the word for each word;

步骤8、使用标签传播算法扩展标注数据集得到增强后的标注数据集；Step 8, using the label propagation algorithm to expand the labeling data set to obtain the enhanced labeling data set;

步骤9、将步骤7的多维向量作为词的空间表示作为词的向量，将步骤8得到的增强后的标注数据集输入条件随机场进行训练建模，并最终输出标注结果。Step 9. Use the multi-dimensional vector of step 7 as the spatial representation of the word as the vector of the word, input the enhanced labeling data set obtained in step 8 into the conditional random field for training and modeling, and finally output the labeling result.

进一步，所述步骤1的具体实现方法为：首先对输入的医学文档进行分词，形成一个数组，存储文本中的每个词和标点符号，然后去除停用词，最后提取词干和词形还原，得到单词的基本形式，并构成未标注的单词数组。Further, the specific implementation method of step 1 is as follows: firstly perform word segmentation on the input medical document, form an array, store each word and punctuation in the text, then remove stop words, and finally extract stem and morphological restoration , get the basic form of the word, and form an array of unlabeled words.

进一步，所述步骤2的具体实现方法为：使用biLSTM对预处理后的医学文档文本的字母级特征进行编码，使用每个单词的首五个字母进行编码，最终得出长度为5d的字母级特征向量。Further, the specific implementation method of the step 2 is: use biLSTM to encode the letter-level features of the preprocessed medical document text, use the first five letters of each word to encode, and finally obtain a letter-level length of 5d. Feature vector.

进一步，所述步骤3的具体实现方法为：使用Google的Word2Vec算法对预处理后的医学文档文本的单词级特征进行编码，最终得到长度为d的针对每个单词的单词级特征向量。Further, the specific implementation method of step 3 is: using Google's Word2Vec algorithm to encode the word-level features of the preprocessed medical document text, and finally obtain a word-level feature vector of length d for each word.

进一步，所述步骤4的具体实现方法为：根据文本语言语用特点，采用手工定义方法，对预处理后的医学文档文本定义如下特征：首字母大小写、单词全部小写、单词全部大写、词性和语法结构，形成长度为21的特征向量，每个特征用0或1来表示。Further, the specific implementation method of the step 4 is: according to the pragmatic characteristics of the text language, using a manual definition method, the following characteristics are defined for the preprocessed medical document text: first letter case, all lowercase words, all uppercase words, part of speech and grammatical structures to form a feature vector of length 21, each feature represented by 0 or 1.

进一步，所述步骤5的具体实现方法为：将字母级特征向量、单词级特征向量和语言特征向量连接在一起，形成一个长度为5d+d+21的对于每个词的综合的特征向量。Further, the specific implementation method of the step 5 is: connecting the letter-level feature vector, the word-level feature vector and the language feature vector together to form a comprehensive feature vector with a length of 5d+d+21 for each word.

进一步，所述步骤6的标注数据集为包括20个类别的组合标签。Further, the labeling dataset in step 6 is a combined label including 20 categories.

进一步，所述步骤7的具体实现方法为：利用步骤5得到的三种特征形成的组合特征向量，并将整个词语数组的所有特征向量进行排布，形成训练数据矩阵，该矩阵的行的数量是词语数组中的词的数量，矩阵的列数是5d+d+21；使用biLSTM，通过向前和向后计算过程的隐藏层作为输入传递给线性层，该线性层将维度投影到标签类型空间的大小为20，并用作 CRF层的输入。Further, the specific implementation method of step 7 is: using the combined feature vector formed by the three features obtained in step 5, and arranging all the feature vectors of the entire word array to form a training data matrix, the number of rows of the matrix is is the number of words in the word array, the number of columns of the matrix is 5d+d+21; using biLSTM, the hidden layer through the forward and backward computation process is passed as input to the linear layer, which projects the dimensions to the label type The size of the space is 20 and is used as the input to the CRF layer.

进一步，所述步骤8的具体实现方法为：首先，基于单词所对应的特征向量构建图，并作为图中的节点，使用特征向量之间的相似度定义他们的距离以及权重w_uv，图中节点的总数等于未标记数据和已标记数据之和；然后，使用标签传播算法通过优化最小化Kullback- Leibler距离的目标函数，使相邻节点之间的标签分布尽可能彼此相似，最终使得所有图中节点对应的词获得标注，得到增强后的数据集。Further, the specific implementation method of step 8 is: first, construct a graph based on the feature vector corresponding to the word, and as a node in the graph, use the similarity between the feature vectors to define their distance and weight w _uv , in the figure The total number of nodes is equal to the sum of the unlabeled data and the labeled data; then, the label propagation algorithm is used to minimize the Kullback-Leibler distance by optimizing the objective function, so that the label distributions between adjacent nodes are as similar to each other as possible, and finally all graphs The words corresponding to the middle nodes are labeled, and the enhanced dataset is obtained.

进一步，所述步骤9的具体实现方法为：将步骤7得到的多维的词的空间表示作为词的向量，biLSTM最终会输出一个标注矩阵P，该P标注矩阵包括对于各个标签的概率分布，将其倒入CRF层中，得出一个标注序列y，计算序列y的得分φ(y；x，θ)，再计算标注序列y在所有标注序列中出现的概率P_θ(y|x)，最后使用反向传播对于目标函数log-

进行最大化，以完成监督学习，同时该CRF模型作为最终的结果输出。Further, the specific implementation method of the step 9 is: the multi-dimensional word space representation obtained in step 7 is used as the word vector, and biLSTM will eventually output a labeling matrix P, the P labeling matrix includes the probability distribution for each label, the It is poured into the CRF layer, a label sequence y is obtained, the score φ(y; x, θ) of the sequence y is calculated, and the probability P _θ (y|x) of the label sequence y appearing in all the label sequences is calculated, and finally Using backpropagation for the objective function log-

Maximize to complete supervised learning, and the CRF model is output as the final result.

本发明的优点和积极效果是：The advantages and positive effects of the present invention are:

1、本发明将医疗文献中的关键词分为疾病名称(disease)、症状(symptom)、治疗手段 (treatment-method)和药物名称(Drug-name)这四种类别，并基于半监督学习标注方法对于医学文档或者病例进行专业词汇上的标注，可在人力物力消耗极低的情况下，为医护人员或学者快速地理解文本中的内容，更好地作出医疗决策或研究。1. The present invention divides the keywords in the medical literature into four categories: disease name (disease), symptom (symptom), treatment method (treatment-method) and drug name (Drug-name), and label them based on semi-supervised learning Methods Labeling medical documents or cases on professional vocabulary can help medical staff or scholars to quickly understand the content of the text and make better medical decisions or research under the condition of extremely low human and material consumption.

2、本发明采用半监督学习算法对大量未标注数据进行标注，成功地克服了现有医疗行业标注数据过少的缺陷，有效地提高了模型能够使用的数据量，并大幅提升算法对于关键词和专业词汇的标注准确率，可广泛用于医疗文献处理中。2. The present invention uses a semi-supervised learning algorithm to label a large amount of unlabeled data, which successfully overcomes the defect of too little labelled data in the existing medical industry, effectively increases the amount of data that can be used by the model, and greatly improves the algorithm’s ability to use keywords for keywords. The accuracy of labeling and professional vocabulary can be widely used in medical document processing.

附图说明Description of drawings

图1为本发明的处理流程图。FIG. 1 is a process flow chart of the present invention.

具体实施方式Detailed ways

以下结合附图对本发明实施例做进一步详述。The embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

本发明的设计思想：利用机器学习算法和技术，并基于半监督学习标注方法对于医学文档或者病例进行专业词汇上的标注。本发明构建了一个三层的分层的神经网络来对文本进行标记：(1)文本中的单词使用三种方式进行向量化的特征提取，BiLSTM提取基于字母的特征，Word2Vec对单词做词嵌入，以及基于语法结构的特征提取。(2)BiLSTM提取在同一个句子中，围绕在单词周围的上下文信息，并进行编码。(3)CRF标记层联合使用CRF目标函数对单词以及标记标签建模，并作出最终的标签判断。The design idea of the present invention is to use machine learning algorithms and technologies, and based on the semi-supervised learning labeling method, to label medical documents or cases on professional vocabulary. The present invention constructs a three-layer hierarchical neural network to mark the text: (1) the words in the text are extracted by vectorization in three ways, BiLSTM extracts the features based on letters, and Word2Vec does word embedding for the words , and feature extraction based on grammatical structure. (2) BiLSTM extracts and encodes the context information surrounding words in the same sentence. (3) The CRF labeling layer jointly uses the CRF objective function to model the word and the label, and makes the final label judgment.

基于上述设计思想，本发明的医学文档专业词汇自动化标注方法，如图1所示，包括以下步骤：Based on the above-mentioned design idea, the automatic labeling method for medical document professional vocabulary of the present invention, as shown in FIG. 1 , includes the following steps:

步骤1：对输入的医学文档进行数据预处理，得到预处理后的医学文档文本。Step 1: Perform data preprocessing on the input medical document to obtain the preprocessed medical document text.

在本步骤中，输入为医学文档，输出为单词数组。数据预处理方法为：对医学文档首先进行分词，形成一个数组，存储文本中的每个词和标点符号，然后去除停用词，如is”、“but”、“shall”、“by”，之后提取词干和词形还原，得到单词的基本形式。例如，running，ran，runs，进行提取词干后，得到run单词，词形还原基本类似，能把任何形式的词汇还原为一般形式，经过数据预处理得到由一般形式构成的未标注的单词数组。In this step, the input is a medical document and the output is an array of words. The data preprocessing method is: firstly perform word segmentation on medical documents, form an array, store each word and punctuation in the text, and then remove stop words, such as is", "but", "shall", "by", Then extract the stem and morphological restoration to get the basic form of the word. For example, running, ran, runs, after extracting the stem, get the word run, the morphological restoration is basically similar, and can restore any form of vocabulary to the general form, After data preprocessing, an unlabeled word array composed of general forms is obtained.

步骤2：使用BiLSTM建模文本，得到词的字母级特征向量。Step 2: Use BiLSTM to model the text and get the letter-level feature vector of the word.

本步骤的输入为进行数据预处理后的词语数组，输出为基于字母特征的特征向量，长度为5d。The input of this step is the word array after data preprocessing, and the output is a feature vector based on letter features with a length of 5d.

本发明使用biLSTM对于字母特征进行编码，称之为Character-BasedEmbedding。词的字母级特征由BiLSTM的向前传播和向后计算过程中的隐藏层向量生成，构建基于字符的嵌入层的优点是它可以处理词汇表之外的词汇和公式，这些词汇和公式在这些数据中很常见，生成的特征向量长度设置为d。但是本发明采用的是首部5-gram(即在取单词的从左往右，最开始的5个字母做编码，如果没有5个字母，剩余的长度作补零处理)提取特征，最终特征向量长度为5d。The present invention uses biLSTM to encode letter features, which is called Character-BasedEmbedding. The letter-level features of words are generated by the hidden layer vectors during the forward propagation and backward computation of BiLSTM. The advantage of building a character-based embedding layer is that it can handle words and formulas outside the vocabulary, which are in these Common in the data, the resulting feature vector length is set to d. However, the present invention uses the first 5-gram (that is, the first 5 letters are coded from left to right when taking words, if there are no 5 letters, the remaining length is filled with zeros) to extract features, and the final feature vector Length is 5d.

步骤3：使用word2vec建模文本，得到词的单词级特征向量。Step 3: Use word2vec to model text and get word-level feature vectors of words.

在本步骤中，使用固定词汇表(加上未知单词标记)的单词被映射到向量空间，使用具有不同语料库组合的Word2vec预训练进行初始化，单词使用Google的Word2Vec算法进行编码，称之为WordEmbedding，最终得出针对每个单词的特征向量长度为d。In this step, words using a fixed vocabulary (plus unknown word tokens) are mapped to a vector space, initialized using Word2vec pre-training with different corpora combinations, and words are encoded using Google's Word2Vec algorithm, called WordEmbedding, Finally, the length of the feature vector for each word is d.

步骤4：基于文本语言语用特点，设计得到词的语言特征向量。Step 4: Based on the pragmatic characteristics of the text language, the language feature vector of the word is designed.

在本步骤中，输入为只对原文本进行分词的词语数组，输出为基于语言特征设计的特征向量，长度为21。特征不单独训练，为手工定义，称之为FeatureEmbedding。该部分定义的特征包括：首字母大小写，单词全部小写，单词全部大写，词性和语法结构等共计21个特征，所形成的特征向量长度为21，每个特征用0或1来表示，表示他们有没有相应的特征。In this step, the input is an array of words that only segment the original text, and the output is a feature vector designed based on language features, with a length of 21. Features are not trained separately, but are manually defined and called FeatureEmbedding. The features defined in this part include: first letter case, all lower case words, all upper case words, part of speech and grammatical structure, a total of 21 features, the length of the formed feature vector is 21, each feature is represented by 0 or 1, indicating Do they have corresponding characteristics.

步骤5：将步骤2、3、4得到的词的字母级特征、单词级特征及语言特征进行融合，作为词的编码向量。Step 5: Integrate the letter-level features, word-level features and language features of the words obtained in steps 2, 3, and 4 as the encoding vector of the word.

本步骤的输入为字符级的特征向量、单词级的特征向量、词的语言特征向量向量，将上述三种特征向量连接在一起，形成一个长度为5d+d+21的对于每个词的综合的特征向量。The input of this step is the character-level feature vector, the word-level feature vector, and the language feature vector vector of the word. The above three feature vectors are connected together to form a length of 5d+d+21 for each word. eigenvectors of .

步骤6：标注数据：将分词后的电子病历文本的词标注为四类医学实体(疾病，症状，治疗手段，药物名称)，每一类实体用IOBES表示词在该实体中的具体位置，共标注为20类，得到标注数据集。Step 6: Labeling data: label the words in the electronic medical record text after word segmentation as four types of medical entities (disease, symptom, treatment method, drug name). Annotated as 20 categories, the labeled dataset is obtained.

为了能够区分相同类型的两个连续关键短语的跨度，本发明为句子中的每个单词指定标签，指定他们在短语中的位置和类型。在预处理数据的基础上，为每一个词都标注上相应的标签，表示他们所在短语的位置和相应的类别，首先标示他们的位置，本发明统一使用IOBES (Inside，Outside，Begining，End和Singleton)来描述一个单词在专业短语或词汇中的位置，I表示单词在短语的内部，B表示他们在短语的开头，E表示他们在短语的结束，S表示这个词是一个单独的专业词汇，0表示在该短语的外部，包含在句子里。同时再结合本发明将这些专业词汇和短语标注的类别，包括疾病名称(disease)，症状(symptom)，治疗手段 (treatment-method)，药物名称(Drug-name)，综合形成一个完整的标签，例如，criticialill patients(危重症病人)被标注为”B-symptom I-symptom E-symptom”。这样形成的组合标签一共就有20个类别。由于训练集数量非常庞大，本发明标注了其中一部分数据，这样就构成了数据集中已标注数据和未标注数据。In order to be able to distinguish the span of two consecutive key phrases of the same type, the present invention assigns labels to each word in the sentence, specifying their position and type in the phrase. On the basis of the preprocessed data, each word is marked with a corresponding label, indicating the position and corresponding category of the phrase they are in, and their position is first marked. The present invention uniformly uses IOBES (Inside, Outside, Begining, End and Singleton) to describe the position of a word in a specialized phrase or vocabulary, I indicates that the words are inside the phrase, B indicates that they are at the beginning of the phrase, E indicates that they are at the end of the phrase, S indicates that the word is a separate specialized vocabulary, 0 means outside the phrase, contained within the sentence. At the same time, in combination with the present invention, the categories of these professional words and phrases, including disease name (disease), symptom (symptom), treatment method (treatment-method), drug name (Drug-name), are integrated to form a complete label, For example, criticalill patients are labeled "B-symptom I-symptom E-symptom". The combined tags thus formed have a total of 20 categories. Since the number of training sets is very large, the present invention labels a part of the data, thus forming labeled data and unlabeled data in the data set.

步骤7：将步骤1中预处理后的文本以及步骤5中的词的编码向量作为biLSTM的输入，将输出设置为20，对每一个词输出一个20维的向量作为词的空间表示。Step 7: Use the preprocessed text in Step 1 and the encoded vector of the word in Step 5 as the input of biLSTM, set the output to 20, and output a 20-dimensional vector for each word as the spatial representation of the word.

在本步骤中，利用步骤5得到的三种特征形成的组合特征向量，并将整个词语数组的所有特征向量进行排布，形成训练数据矩阵，矩阵的行的数量是词语数组中的词的数量，矩阵的列数是5d+d+21，本发明使用biLSTM，通过向前和向后计算过程的隐藏层作为输入传递给线性层，该线性层将维度投影到标签类型空间的大小，为20，并用作CRF层的输入。In this step, the combined feature vector formed by the three features obtained in step 5 is used, and all feature vectors of the entire word array are arranged to form a training data matrix. The number of rows in the matrix is the number of words in the word array. , the number of columns of the matrix is 5d+d+21, the present invention uses biLSTM to pass the hidden layer of the forward and backward calculation process as input to the linear layer, which projects the dimension to the size of the label type space, which is 20 , and used as the input to the CRF layer.

步骤8：对步骤6得到的标注数据集进行数据增强，使用标签传播算法扩展标注数据集得到增强后的标注数据集。Step 8: Perform data enhancement on the labeled data set obtained in step 6, and use the label propagation algorithm to expand the labeled data set to obtain the enhanced labeled data set.

本步骤包括两部分：第一部分是基于单词所对应的特征向量构建图，并作为图中的节点，使用特征向量之间的相似度定义他们的距离以及权重w_uv，图中节点的总数等于未标记数据和已标记数据之和。第二部分是使用标签传播算法，该算法旨在通过优化最小化Kullback- Leibler距离的目标函数，使相邻节点之间的标签分布尽可能彼此相似。该部分最终使得所有图中节点对应的词获得标注，得到增强后的数据集。具体方法为：This step includes two parts: the first part is to construct a graph based on the feature vector corresponding to the word, and as a node in the graph, use the similarity between feature vectors to define their distance and weight w _uv , the total number of nodes in the graph is equal to Sum of tagged data and tagged data. The second part is to use the label propagation algorithm, which aims to make the label distributions between adjacent nodes as similar to each other as possible by optimizing the objective function that minimizes the Kullback-Leibler distance. This part finally enables the words corresponding to all the nodes in the graph to be labeled, and the enhanced data set is obtained. The specific method is:

第一部分先来构建标签传播算法所需要的关系图，图中的顶点与词的特征向量所相对应，边是词特征之间的距离，捕获语义相似性。图的总大小等于标记数据量V_l和未标记数据量V_u之和。用一组预先训练的单词嵌入(维度为d)建模，其中5-gram以当前词的首5个字母，最接近动词的单词嵌入，以及包括部分的一组词性的标签和大小写(43和4维度的one-hot 向量)。然后使用PCA降维算法将长度为5d+d+43+4的所得特征向量投影到100维。本发明定义节点u和v之间的边的权重w_uv如下：w_uv＝d_e(u，v)ifv∈κ(u)oru∈κ(v)，其中κ(u)是u的k-nearest邻居的集合，d_e(u，v)是图中任意两个节点u和v之间的欧几里德距离。The first part first builds the relationship graph required by the label propagation algorithm. The vertices in the graph correspond to the feature vectors of words, and the edges are the distances between word features to capture semantic similarity. The total size of the graph is equal to the sum of the marked data volume _Vl and the unmarked data volume _Vu . Modeled with a set of pre-trained word embeddings (dimension d), where the 5-gram consists of the first 5 letters of the current word, the word embedding closest to the verb, and a set of part-of-speech labels and case including parts (43 and a 4-dimensional one-hot vector). The resulting eigenvectors of length 5d+d+43+4 are then projected to 100 dimensions using the PCA dimensionality reduction algorithm. The present invention defines the weight w _uv of the edge between nodes u and v as follows: w _uv =d _e (u,v)ifv∈κ(u)oru∈κ(v), where κ(u) is the k- of u The set of nearest neighbors, _de (u, v) is the Euclidean distance between any two nodes u and v in the graph.

对于图中的每个节点i，本发明使用向前和向后计算过程计算边缘概率{q_i}。设θ_i表示第n 次迭代后CRF参数的估计，本发明对于标记和未标记数据中句子j中每个标记位置i的IOBES 标记计算边缘概率

For each node i in the graph, the present invention computes edge probabilities {q _i } using forward and backward computation processes. Let _θi represent the estimation of the CRF parameters after the nth iteration, the present invention calculates the edge probability for the IOBES tag of each tag position i in sentence j in the tagged and unlabeled data

第二部分使用标签传播算法增强数据，对于数据集中未标注数据进行标注，该算法旨在通过优化最小化Kullback-Leibler距离的目标函数，使相邻节点之间的标签分布尽可能彼此相似，通过最小化如下Kullback-Leibler距离：i)对于图中所有词的节点，标记数据的分布r_u和预测的标签分布q_u。ii)图中所有节点u及其相邻节点v，q_u和q_v的分布。iii)所有分布节点的q_u和CRF边缘概率

如果节点未连接到已标记顶点，则第三项将预测分布规则化为CRF预测，确保算法至少与标准自我训练一样好。The second part uses the label propagation algorithm to enhance the data to label the unlabeled data in the dataset. The algorithm aims to make the label distributions between adjacent nodes as similar to each other as possible by optimizing the objective function that minimizes the Kullback-Leibler distance. Minimize the Kullback-Leibler distance as follows: _i ) For nodes in the graph for all words, the distribution ru of labeled data and the predicted distribution of labels _qu . ii) The distribution of all nodes u and their neighbors v, q _u and q _v in the graph. iii) _qu and CRF edge probabilities for all distributed nodes

If the node is not connected to a labeled vertex, the third term regularizes the prediction distribution to CRF predictions, ensuring that the algorithm is at least as good as standard self-training.

步骤9：将步骤7的20维的词的空间表示作为词的向量，将步骤8得到的增强后的标注数据集输入条件随机场进行训练建模。Step 9: The 20-dimensional word space representation in Step 7 is used as a word vector, and the enhanced labeling data set obtained in Step 8 is input into a conditional random field for training and modeling.

将步骤7的20维的词的空间表示作为词的向量，biLSTM最终会输出一个标注矩阵P，P 矩阵基本已经包括对于各个标签的概率分布，然后将其倒入CRF层中，会得出一个标注序列y，通过方法计算序列y的得分φ(y；x，θ)，再计算标注序列y在所有标注序列中出现的概率 P_θ(y|x)，最后使用反向传播对于目标函数

进行最大化，以完成监督学习，同时该CRF模型也作为最终的结果输出。具体方法如下：Taking the spatial representation of the 20-dimensional word in step 7 as the word vector, biLSTM will eventually output a label matrix P, which basically already includes the probability distribution for each label, and then pour it into the CRF layer to get a Label the sequence y, calculate the score φ(y; x, θ) of the sequence y by the method, then calculate the probability P _θ (y|x) of the label sequence y appearing in all the label sequences, and finally use back propagation for the objective function.

Maximize to complete supervised learning, and the CRF model is also output as the final result. The specific method is as follows:

关键词分类是一种在输出标签之间存在强依赖性的任务例如，I-disease不能在B-treatment-method后面，因此，本发明不是对每个输出做出独立的标记决策，而是使用条件随机场对它们进行联合建模。对于输入句子x＝(x1，x2，x3，...，xn)，本发明认为P 是biLSTM网络输出的分数矩阵。P的大小为n×m，其中n是句子中的标记数，m是不同标记的数量。P_t，i对应于句子中第亡个单词的第i个标签的分数。本发明使用一阶马尔可夫模型并定义转换矩阵T，其中T_i，j表从标签i到标签j的分数。本发明同时增加了y₀和y_n作为开始和结束标识符。因此T矩阵的维度变成了m+2。Keyword classification is a task that has strong dependencies between output labels. For example, I-disease cannot follow B-treatment-method, so instead of making independent labeling decisions for each output, the present invention uses Conditional random fields model them jointly. For the input sentence x=(x1, x2, x3, . . . , xn), the present invention considers P to be the score matrix output by the biLSTM network. The size of P is n × m, where n is the number of tokens in the sentence and m is the number of distinct tokens. P _t,i corresponds to the score of the ith label of the ith word in the sentence. The present invention uses a first-order Markov model and defines a transition matrix T, where T _i,j represents the score from label i to label j. The present invention simultaneously adds y ₀ and _yn as start and end identifiers. So the dimension of the T matrix becomes m+2.

给定一个可能的输出y和神经网络参数θ，本发明将得分定义为Given a possible output y and a neural network parameter θ, the present invention defines the score as

通过在所有可能的标签序列上应用softmax来获得序列y的概率：The probability of sequence y is obtained by applying softmax over all possible label sequences:

P_θ(y|x)＝exp(φ(y；x，θ))/∑_y′∈Yexp(φ(y′；x，θ))P _θ (y|x)=exp(φ(y; x, θ))/∑ _y′∈Y exp(φ(y′; x, θ))

其中Y表示所有可能的标签序列。使用前向算法能有效地计算归一化项。where Y represents all possible tag sequences. The normalization term can be efficiently computed using the forward algorithm.

最后使用数据集中的标签数据进行初步训练，在训练期间，本发明最大化给定的语料库 {X，Y}的正确标签序列的log-probabilityL(Y；X，θ)。同时基于使用句子的总分数计算的梯度来完成反向传播。Finally, initial training is performed using the label data in the dataset, during which the present invention maximizes the log-probability L(Y; X, θ) of the correct label sequence for a given corpus {X, Y}. At the same time backpropagation is done based on the gradient computed using the total score of the sentence.

得到训练好的CRF算法后，将其与之前所构建的特征提取部分结合，就可以开始对于文本进行标注。即输入一个句子x＝(x₁，x₂，x₃...，x_n)，会得到一个标注序列y＝(y₁，y₂，y₃...，y_n)。After getting the trained CRF algorithm, combine it with the feature extraction part constructed earlier, and you can start annotating the text. That is, inputting a sentence x=(x ₁ , x ₂ , x ₃ ..., x _n ) will get a label sequence y=(y ₁ , y ₂ , y ₃ ..., _yn ).

上述公式中的参数含义说明如下：The meanings of the parameters in the above formula are explained as follows:

本发明未述及之处适用于现有技术。What is not described in the present invention applies to the prior art.

需要强调的是，本发明所述的实施例是说明性的，而不是限定性的，因此本发明包括并不限于具体实施方式中所述的实施例，凡是由本领域技术人员根据本发明的技术方案得出的其他实施方式，同样属于本发明保护的范围。It should be emphasized that the embodiments described in the present invention are illustrative rather than restrictive, so the present invention includes but is not limited to the embodiments described in the specific implementation manner. Other embodiments derived from the scheme also belong to the protection scope of the present invention.

Claims

1. a medical document professional vocabulary automatic labeling method, is characterized in that comprising the following steps:

Step 1. Perform data preprocessing on the input medical document to obtain the preprocessed medical document text;

Step 2. Use biLSTM to model the text to obtain the letter-level feature vector of the word;

Step 3. Use word2vec to model text to obtain word-level feature vectors of words;

Step 4. Based on the pragmatic characteristics of the text language, the language feature vector of the word is obtained;

Step 5, fuse the letter-level feature vector, word-level feature vector and language feature vector of the word obtained in step 2, step 3 and step 4, as the encoding vector of the word;

Step 6. Mark the words of the segmented medical document text as the following four types of medical entities: disease name, disease symptom, treatment method and drug name. Each type of entity uses IOBES to indicate the specific position of the word in the entity, and get the labeling data set;

Step 7. Use the text obtained in step 1 and the encoding vector of the word obtained in step 5 as the input of biLSTM, and output a multi-dimensional vector as the spatial representation of the word for each word;

Step 8, using the label propagation algorithm to expand the labeling data set to obtain the enhanced labeling data set;

Step 9: Input the multi-dimensional vector obtained in step 7 and the enhanced labeling data set obtained in step 8 into the conditional random field for training and modeling, and finally output the labeling result.

2. a kind of medical document professional vocabulary automatic labeling method according to claim 1, is characterized in that: the concrete realization method of described step 1 is: firstly carry out word segmentation to the input medical document, form an array, store in the text. For each word and punctuation, then stop words are removed, and finally stemming and lemmatization are extracted to get the basic form of the word and form an array of unlabeled words.

3. a kind of medical document professional vocabulary automatic labeling method according to claim 1, is characterized in that: the concrete realization method of described step 2 is: use biLSTM to encode the letter-level feature of the preprocessed medical document text, Use the first five letters of each word to encode, resulting in a letter-level feature vector of length 5d.

4. a kind of medical document professional vocabulary automatic labeling method according to claim 1, is characterized in that: the concrete realization method of described step 3 is: use Google's Word2Vec algorithm to the word-level feature of the preprocessed medical document text Encode, and finally get a word-level feature vector of length d for each word.

5. a kind of medical document professional vocabulary automatic labeling method according to claim 1, is characterized in that: the concrete realization method of described step 4 is: according to text language pragmatic characteristic, adopt manual definition method, to preprocessed The medical document text defines the following features: first letter case, all lower case words, all capital words, part of speech and grammatical structure, forming a feature vector of length 21, each feature is represented by 0 or 1.

6. a kind of medical document professional vocabulary automatic labeling method according to claim 1, is characterized in that: the concrete realization method of described step 5 is: connect letter-level feature vector, word-level feature vector and language feature vector together , forming a comprehensive feature vector of length 5d+d+21 for each word.

7 . The automatic labeling method for professional vocabulary of medical documents according to claim 1 , wherein the labeling data set in step 6 is a combination label including 20 categories. 8 .

8. a kind of medical document professional vocabulary automatic labeling method according to claim 1 is characterized in that: the concrete realization method of described step 7 is: utilize the combined feature vector formed by three kinds of features obtained in step 5, All eigenvectors of the word array are arranged to form a training data matrix, the number of rows of the matrix is the number of words in the word array, and the number of columns of the matrix is 5d+d+21; using biLSTM, by forward and backward The hidden layer of the computational process is passed as input to the linear layer, which projects the dimension to the label type space of size 20, and is used as the input to the CRF layer.

9. a kind of medical document professional vocabulary automatic labeling method according to claim 1, it is characterized in that: the concrete realization method of described step 8 is: first, build a graph based on the feature vector corresponding to the word, and as the graph in the graph nodes, using the similarity between feature vectors to define their distance and weight w _uv , the total number of nodes in the graph is equal to the sum of the unlabeled data and the labeled data; then, the label propagation algorithm is used to minimize the Kullback-Leibler distance by optimizing The objective function is to make the label distribution between adjacent nodes as similar to each other as possible, and finally make the words corresponding to all nodes in the graph get labeled, and get an enhanced dataset.