CN111078875B

CN111078875B - Method for extracting question-answer pairs from semi-structured document based on machine learning

Info

Publication number: CN111078875B
Application number: CN201911222877.4A
Authority: CN
Inventors: 黄少滨; 颜伟; 申林山; 李熔盛; 李轶; 余日昌; 张柏嘉; 何荣博
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2019-12-03
Filing date: 2019-12-03
Publication date: 2022-12-13
Anticipated expiration: 2039-12-03
Also published as: CN111078875A

Abstract

The invention belongs to the technical field of natural language processing, and in particular relates to a method for extracting question-answer pairs from semi-structured documents based on machine learning. The invention applies a machine learning method to classify by using Apriori for feature selection and a naive Bayesian classification method to obtain the answer sentence in the semi-structured text. The invention combines named entity recognition and dependency syntax analysis theory to convert answer sentences into corresponding question sentences. Named entity recognition adopts the crf+BiLstm neural network model to recognize the entities in the answer sentence and add them to the entities crawled from the web. Syntax analysis reveals the interdependence relationship among the words in the sentence, so that when the question is generated, the word that depends on the entity is replaced, and a reasonable question is obtained. The invention lays a good foundation for constructing question answering systems in the future by extracting high-quality question-answer pairs from semi-structured documents.

Description

A machine learning-based method for extracting question-answer pairs from semi-structured documents

技术领域technical field

本发明属于自然语言处理技术领域，具体涉及一种基于机器学习的从半结构化文档中提取问答对的方法。The invention belongs to the technical field of natural language processing, and in particular relates to a method for extracting question-answer pairs from semi-structured documents based on machine learning.

背景技术Background technique

目前大多数限定域问答系统的问答对来源于互动式知识分享平台。其中构建知识库的数据源包括百度百科、问答社区和领域网站等。在面向心脏病咨询的问答系统中，对于知识库内部的知识，往往可以得到准确的回答，而对于不包含在知识库中的知识，往往很难回答。因此，知识库构建的好坏之间直接影响问答系统性能的高低。在医疗相关的问答系统中，知识库中的知识关系到患者的生命安全。而互动式知识分享平台中的知识，往往是来源于广大的网民回答，缺乏权威性的保证。由于网络社区中的问答对是无法保证知识的准确性，因此，在医疗相关的问答系统知识库的构建中,需要更具有权威性的问答对。The question-answer pairs of most current domain-limited question answering systems come from interactive knowledge sharing platforms. The data sources for constructing the knowledge base include Baidu Encyclopedia, question-and-answer communities, and domain websites. In the question answering system for cardiology consultation, it is often possible to get accurate answers to the knowledge inside the knowledge base, but it is often difficult to answer the knowledge not included in the knowledge base. Therefore, the construction of the knowledge base directly affects the performance of the question answering system. In medical-related question answering systems, the knowledge in the knowledge base is related to the life safety of patients. However, the knowledge in the interactive knowledge sharing platform is often derived from the answers of the majority of netizens, which lacks an authoritative guarantee. Since question-answer pairs in online communities cannot guarantee the accuracy of knowledge, more authoritative question-answer pairs are needed in the construction of knowledge bases for medical-related question-answer systems.

获取高质量的问答对是构建医疗领域问答系统的基本任务之一。目前以问答社区作为数据源来获取问答对的方法无法保证问答对的质量。为了获取高质量的问答对，本文将提出了从心脏病电子病例中提取问答对的方法。电子病例是由专业的医生来填写，其知识的准确性是具有保障的，因此可以作为可靠的问答对数据源。心脏病电子病例中的病程记录、医嘱和诊疗计划等包含了医生对疾病的描述，诊断，药品介绍和使用方法等知识。从这些记录中抽取的问答对将具有极高的准确性，可以极大的提高之后构建的心脏病问答系统的性能。Obtaining high-quality question-answer pairs is one of the fundamental tasks in building question-answering systems in the medical field. The current method of using the question-answering community as a data source to obtain question-answer pairs cannot guarantee the quality of question-answer pairs. In order to obtain high-quality question-answer pairs, this paper proposes a method for extracting question-answer pairs from cardiac electronic records. Electronic medical records are filled in by professional doctors, and the accuracy of their knowledge is guaranteed, so it can be used as a reliable source of question-and-answer data. The course record, doctor's order and diagnosis and treatment plan in the electronic heart disease case include the doctor's description of the disease, diagnosis, drug introduction and use method and other knowledge. The question-answer pairs extracted from these records will have extremely high accuracy, which can greatly improve the performance of the cardiac question-answer system constructed later.

随着我国医院信息化水平的不断提高，积累了大量的临床数据，如何有效利用这些数据成为目前数据科学领域关注的重点之一。目前可采用机器学习相关的方法，从非结构化的电子病历中提取所需要的数据，并重新组织成结构化的文本。具体有提取医疗事件和时间信息等。同时可通过对需要抽取的信息进行标注，根据标注结果，进行抽取模版归纳，通过重写抽取模板，生成抽取规则，并利用这些规则进行实际信息抽取，从而从非结构化电子病历中进行有效信息抽取并组织成可分析利用形式。也有通过上下文感知方法来处理糖尿病患者的电子病历，提取出包含有关心脏病危险因素的信息。With the continuous improvement of the level of hospital informatization in our country, a large amount of clinical data has been accumulated. How to effectively use these data has become one of the focuses in the field of data science. At present, methods related to machine learning can be used to extract the required data from unstructured electronic medical records and reorganize them into structured text. Specifically, medical events and time information are extracted. At the same time, by marking the information to be extracted, according to the marking results, the extraction templates can be summarized, and the extraction rules can be generated by rewriting the extraction templates, and these rules can be used to extract the actual information, so as to obtain effective information from unstructured electronic medical records. Extract and organize into a form that can be analyzed and exploited. There is also a context-aware approach to processing electronic medical records of diabetic patients to extract information containing risk factors for heart disease.

问题生成(QG)的目的是从给定的句子或段落中创建自然问题。这些方法的成功与否，关键在于是否存在精心设计的陈述句到疑问句的转换规则。纯粹基于规则的方法往往依赖于深层的语言知识。除了采用各种NLP技术，包括术语提取和浅层解析，还利用语料库和本体等语言资源。改进的基于规则的系统通过过度生成和排序方法，使用一种基于规则的方法生成输入句的多个问题，然后使用基于监督学习的Ranker对它们进行排序。引进深度学习的方法通过借鉴机器翻译领域里的编码器-解码器神经网络模型，不依赖于手工制作的规则或复杂的NLP管道，通过序列到序列学习进行端到端的训练，生成陈述句的疑问句。Question Generation (QG) aims to create natural questions from a given sentence or paragraph. The key to the success of these methods lies in whether there are well-designed conversion rules from declarative sentences to interrogative sentences. Purely rule-based approaches often rely on deep linguistic knowledge. In addition to adopting various NLP techniques, including term extraction and shallow parsing, it also utilizes linguistic resources such as corpora and ontologies. The improved rule-based system uses a rule-based approach to generate multiple questions of input sentences by over-generating and ranking methods, and then uses a supervised learning-based Ranker to rank them. The method of introducing deep learning borrows from the encoder-decoder neural network model in the field of machine translation, does not rely on handcrafted rules or complex NLP pipelines, and performs end-to-end training through sequence-to-sequence learning to generate interrogative sentences for declarative sentences.

发明内容Contents of the invention

本发明的目的在于提供一种基于机器学习的从半结构化文档中提取问答对的方法。The purpose of the present invention is to provide a method for extracting question-answer pairs from semi-structured documents based on machine learning.

本发明的目的通过如下技术方案来实现：包括以下步骤：The purpose of the present invention is achieved through the following technical solutions: comprising the following steps:

步骤1：输入待提取的半结构化文档集合，将pdf格式的半结构化文档转化为txt文档；Step 1: Input the collection of semi-structured documents to be extracted, and convert the semi-structured documents in pdf format into txt documents;

步骤2：使用正则表达式提取txt文本中陈述句，将得到的陈述句集合按照文档排序，在文档内部对陈述句随机排序；随机抽取各个文档中的部分句子，对句子是否是答案句进行判断，标注答案句，构造训练集；Step 2: Use regular expressions to extract the declarative sentences in the txt text, sort the obtained set of declarative sentences according to the documents, and randomly sort the declarative sentences inside the documents; randomly extract some sentences in each document, judge whether the sentences are answer sentences, and mark the answers sentence, construct the training set;

步骤3：对训练集使用Apriori算法提取集合中的频繁项集作为特征，通过关联规则扩展特征，得到特征集合；Step 3: Use the Apriori algorithm to extract the frequent itemsets in the set as features on the training set, expand the features through association rules, and obtain the feature set;

步骤4：根据特征集合，将句子表示成特征向量，将训练集的特征向量输入到朴素贝叶斯分类模型中训练，得到训练好的分类模型；Step 4: According to the feature set, express the sentence as a feature vector, input the feature vector of the training set into the naive Bayesian classification model for training, and obtain the trained classification model;

步骤5：利用训练好的分类模型对未标注的句子进行分类，提取出答案句，得到答案句集合；Step 5: Use the trained classification model to classify the unlabeled sentences, extract the answer sentences, and obtain the answer sentence set;

步骤6：通过网络爬虫爬取部分领域实体；对答案句集合中的部分句子进行字标注，将字标注为BIO，建立命名实体识别的训练集；Step 6: Crawl some domain entities through the web crawler; mark some sentences in the answer sentence set, mark the words as BIO, and establish a training set for named entity recognition;

步骤7：构建BiLstm+crf神经网络模型，使用训练集训练；将训练好的模型对答案句集合中未标注的句子进行序列标注，将标注转化为命名实体；结合网络爬虫爬取的部分领域实体和命名实体，得到实体集合；Step 7: Build a BiLstm+crf neural network model and use the training set for training; use the trained model to sequentially label the unlabeled sentences in the answer sentence set, and convert the labels into named entities; combine some domain entities crawled by web crawlers and named entities to get the entity collection;

步骤8：对答案句集合进行分词，词性标注，依存句法分析，分析句子中词之间以来关系，通过将句子中的实体以及依存于实体的词进行替换，得到答案句对应的问句，并输出问答对，完成问答对的提取。Step 8: Carry out word segmentation, part-of-speech tagging, and dependency syntactic analysis on the answer sentence set, analyze the relationship between words in the sentence, and obtain the question sentence corresponding to the answer sentence by replacing the entity in the sentence and the word dependent on the entity, and Output question-answer pairs to complete the extraction of question-answer pairs.

本发明还可以包括：The present invention may also include:

所述的步骤3中对训练集使用Apriori算法提取集合中的频繁项集作为特征，通过关联规则扩展特征，得到特征集合的方法具体为：In the described step 3, use the Apriori algorithm to extract the frequent itemsets in the set as a feature on the training set, expand the feature by association rules, and obtain the method of feature set specifically as follows:

句子集合S包含已提取的所有陈述句{s₁,s₂,…,s_m}，一共m条；通过使用分词工具对陈述句分词，得到词集合{x₁,x₂,…,x_n}；遍历所有陈述句，计算每个词的支持度，计算方法为：The sentence set S contains all the extracted statement sentences {s ₁ , s ₂ ,…,s _m }, a total of m items; by using the word segmentation tool to segment the statement sentences, the word set {x ₁ , x ₂ ,…,x _n } is obtained; Traverse all statements and calculate the support of each word, the calculation method is:

其中num(x)为句子集合S中包含词x的句子数量，m为S中句子的总数；设定阈值K，将sup(x)大于K的词x放入到特征集合中；设定关联规则为(x,y)，即词x为特征则y也为特征，初始设定所有的二元组为关联规则，计算关联规则的置信度：Among them, num(x) is the number of sentences containing word x in the sentence set S, and m is the total number of sentences in S; set the threshold K, put the word x with sup(x) greater than K into the feature set; set the association The rule is (x, y), that is, if the word x is a feature, y is also a feature. Initially set all the binary groups as association rules, and calculate the confidence of the association rules:

其中sup(x∪y)为词x，y一起出现在句子中的概率；设定阈值K₂，将置信度大于该阈值的关联规则作为特征扩充规则，即将词y放入到特征集合中。Where sup(x∪y) is the probability that words x and y appear together in a sentence; set a threshold K ₂ , and use the association rules with confidence greater than the threshold as feature expansion rules, that is, put word y into the feature set.

所述的步骤5中利用训练好的分类模型对未标注的句子进行分类，提取出答案句，得到答案句集合的方法具体为：根据得到的特征结合，将训练集中的句子表示成特征向量，作为朴素贝叶斯分类模型的输入；特征向量是1×n维的，n表示n个特征；如果句子中包含该特征，则对应位置为1；如果句子中不包含，对应位为0；分类模型使用朴素贝叶斯分类模型，设x＝{a₁,a₂,…,a_n}为待分类项，类别集合为C＝{y₁,y₂}，通过计算P(y₁|x),p(y₂|x)得到x各属于两个类别的概率，选取概率最大的类别作为x的所属类别p(y_k|x)＝max{p(y₁|x),p(y₂|x)}；其中P(y₁|x),p(y₂|x)是根据贝叶斯定理求得的；通过计算出每个类别所对应的概率，从而找到最大概率的类，通过分类模型，对陈述句分类，找到陈述句中的答案句，最终得到答案句集合。In the step 5, the trained classification model is used to classify the unmarked sentences, and the answer sentences are extracted, and the method of obtaining the answer sentence set is specifically: according to the obtained feature combination, the sentences in the training set are represented as feature vectors As the input of the naive Bayesian classification model; the feature vector is 1×n dimensional, and n represents n features; if the feature is included in the sentence, the corresponding position is 1; if it is not included in the sentence, the corresponding position is 0; classification The model uses the naive Bayesian classification model, let x={a ₁ ,a ₂ ,…,a _n } be the item to be classified, and the category set is C={y ₁ ,y ₂ }, by calculating P(y ₁ |x ), p(y ₂ |x) to get the probability that x belongs to two categories, and select the category with the highest probability as the category of x to which p(y _k |x)=max{p(y ₁ |x), p(y ₂ |x)}; among them, P(y ₁ |x), p(y ₂ |x) are obtained according to Bayes'theorem; by calculating the probability corresponding to each category, the class with the highest probability can be found, Through the classification model, the statement sentences are classified, the answer sentences in the statement sentences are found, and finally the set of answer sentences is obtained.

所述的步骤7中构建BiLstm+crf神经网络模型具体为：BiLstm+crf神经网络模型包含两个层BiLstm层和crf层；BiLstm层是双向LSTM神经网络，输入的是一句中的字向量，字向量使用的One-Hot编码，输出的是每个类别标签概率，一个句子最后以矩阵的形式给出P，Pij为第i个词标注为第j个标签的概率；Crf层是条件随机场，可以从输入序列中学习到标签的约束，从而在多类别实体的识别时更精确，其中有一个转移矩阵A，A_ij表示第i个标签转移为第j个标签的概率；因此输入句子序列X，得到标签y，最后评分函数为：

通过softmax函数，可得到概率函数

在训练时，最大化似然概率p(y|X)，即损失函数为-log(p(y|X))。Constructing BiLstm+crf neural network model in the described step 7 is specifically: BiLstm+crf neural network model comprises two layers BiLstm layer and crf layer; The One-Hot encoding used by the vector outputs the probability of each category label, and a sentence finally gives P in the form of a matrix, and Pij is the probability that the i-th word is marked as the j-th label; the Crf layer is a conditional random field, The label constraints can be learned from the input sequence, so that it is more accurate in the recognition of multi-category entities. There is a transfer matrix A, and A _ij represents the probability that the i-th label is transferred to the j-th label; therefore, the input sentence sequence X , get the label y, and the final scoring function is:

Through the softmax function, the probability function can be obtained

During training, the likelihood probability p(y|X) is maximized, that is, the loss function is -log(p(y|X)).

本发明的有益效果在于：The beneficial effects of the present invention are:

本发明提供了一种基于机器学习的从半结构化文档中提取问答对的方法。本发明应用机器学习的方法，通过应用Apriori进行特征选择和朴素贝叶斯分类方法进行分类，得到半结构化文本中的答案句。本发明结合命名实体识别和依存句法分析理论，将答案句转为对应的问句。命名实体识别采用crf+BiLstm神经网络模型，识别答案句中的实体，补充到网络爬取的实体中。句法分析通过揭示句子中各个词之间的依存关系，从而在问句生成时替换依存于实体的词，得到合理的问句。本发明通过从半结构化文档中提取高质量的问答对，为以后构建问答系统奠定了良好的基础。The invention provides a method for extracting question-answer pairs from semi-structured documents based on machine learning. The invention applies the method of machine learning to classify by using Apriori for feature selection and naive Bayesian classification method to obtain the answer sentence in the semi-structured text. The invention combines named entity recognition and dependency syntax analysis theory to convert answer sentences into corresponding question sentences. Named entity recognition adopts the crf+BiLstm neural network model to recognize the entities in the answer sentence and supplement them to the entities crawled from the web. Syntax analysis reveals the interdependence relationship among the words in the sentence, so that when the question is generated, the word that depends on the entity is replaced, and a reasonable question is obtained. The invention lays a good foundation for constructing a question-answer system in the future by extracting high-quality question-answer pairs from semi-structured documents.

附图说明Description of drawings

图1为本发明的总体流程图。Fig. 1 is the general flowchart of the present invention.

具体实施方式detailed description

下面结合附图对本发明做进一步描述。The present invention will be further described below in conjunction with the accompanying drawings.

本发明公开了一种基于机器学习的从半结构化文档中提取问答对的方法。该方法包含数据预处理模块，PDF文档解析，陈述句采样与标记，得到文档中的句子和已标注的训练集；答案句提取模块，短文本中频繁词集挖掘和特征扩展，基于朴素贝叶斯的句子分类，提取出句子中的答案句；问句生成模块，基于BiLstm-crf模型的命名实体识别，网络爬取命名实体，依存句法分析，将答案句转换成问句；该方法通过从半结构化文档中提取高质量的问答对，为以后构建问答系统奠定了良好的基础。The invention discloses a method for extracting question-answer pairs from semi-structured documents based on machine learning. The method includes data preprocessing module, PDF document parsing, statement sentence sampling and labeling, and obtains sentences in the document and marked training sets; answer sentence extraction module, frequent word set mining and feature expansion in short texts, based on Naive Bayes The sentence classification of the sentence extracts the answer sentence in the sentence; the question sentence generation module, based on the named entity recognition of the BiLstm-crf model, crawls the named entity from the network, and converts the answer sentence into a question sentence by relying on syntactic analysis; Extracting high-quality question-answer pairs from structured documents lays a good foundation for building question-answer systems in the future.

一种基于机器学习的从半结构化文档中提取问答对的方法，包括以下步骤：A method for extracting question-answer pairs from semi-structured documents based on machine learning, comprising the following steps:

通过softmax函数，可得到概率函数

Through the softmax function, the probability function can be obtained

实施例1：Example 1:

本发明是从半结构化的文档中提取问答对，具体流程示意图如图1所示。电子病历是医生用来记录病人病情的文档，包含诊断记录和一些检查报告，是半结构化的文档，本发明将以此为例子，介绍从半结构化的文档中提取问答对的具体实施方式。The present invention extracts question-answer pairs from semi-structured documents, and the specific flowchart is shown in FIG. 1 . Electronic medical records are documents used by doctors to record patients' conditions, including diagnostic records and some inspection reports, and are semi-structured documents. This invention will take this as an example to introduce the specific implementation of extracting question-answer pairs from semi-structured documents .

一种基于机器学习的从半结构化文档中提取问答对的方法，具体步骤如下：A method for extracting question-answer pairs from semi-structured documents based on machine learning, the specific steps are as follows:

1)将pdf格式的半结构化文档转化为txt文档。1) Convert semi-structured documents in pdf format to txt documents.

使用pdf解析工具，解析pdf文档，解码其中的二进制流，提取文本信息，将文档转化为txt文档。Use the pdf parsing tool to parse the pdf document, decode the binary stream in it, extract the text information, and convert the document into a txt document.

2)从半结构化文档中提取陈述句，并对部分陈述句进行标注。2) Extract declarative sentences from semi-structured documents, and mark some declarative sentences.

将txt文档中的文本内容通过正则表达式匹配提取出来，对文本进行断句。将得到的句子集合按照文档排序，在文档内部对句子随机排序，抽样出部分句子，对句子是否是答案句进行判断，标注答案句。Extract the text content in the txt document through regular expression matching, and segment the text. Sorting the obtained set of sentences according to the documents, randomly sorting the sentences inside the documents, sampling some sentences, judging whether the sentences are answer sentences, and marking the answer sentences.

3)使用Apriori算法挖掘句子中的频繁项集，生成关联规则扩充特征。3) Use the Apriori algorithm to mine the frequent itemsets in the sentence, and generate the extended features of the association rules.

以所有的词作为特征候选集，以所有的二元组作为关联规则，遍历句子集合，计算每个词的支持度，设定阈值，筛选出初始特征集，再次遍历句子集合，计算关联规则的置信度，设定阈值，保留有效特征集。通过关联规则，扩充初始特征集，得到特征集合。Use all the words as the feature candidate set, and all the binary groups as the association rules, traverse the sentence set, calculate the support of each word, set the threshold, filter out the initial feature set, traverse the sentence set again, and calculate the association rule Confidence, set the threshold, and keep the valid feature set. Through the association rules, the initial feature set is expanded to obtain the feature set.

4)使用朴素贝叶斯分类模型对陈述句分类，获取答案句。4) Use the naive Bayesian classification model to classify the declarative sentences and obtain the answer sentences.

用特征表示句子，将特征向量输入到朴素贝叶斯分类模型中训练，利用训练好的模型对未标注句子进行分类，得到答案句集合。Sentences are represented by features, and the feature vectors are input into the Naive Bayesian classification model for training, and the trained model is used to classify unlabeled sentences to obtain a set of answer sentences.

5)网络爬取部分领域实体，标注部分答案句。5) Crawl some domain entities from the web, and mark some answer sentences.

通过网络爬虫爬取部分命名实体，对答案句中的部分句子进行字标注，将字标注为BIO,建立命名实体识别的训练集。Part of the named entities is crawled through the web crawler, and part of the sentences in the answer sentence are marked with words, and the words are marked as BIO, and a training set for named entity recognition is established.

6)构建BiLstm+crf神经网络模型，使用训练集训练，提取实体。6) Build a BiLstm+crf neural network model, use the training set to train, and extract entities.

构建BiLstm+crf神经网络模型，输入字向量，对句子进行序列标注。将训练好的模型对未标注的句子进行序列标注，最后将标注转化为命名实体，得到实体集合。Construct the BiLstm+crf neural network model, input the word vector, and sequence the sentences. Sequence labeling of unlabeled sentences by the trained model, and finally convert the labeling into named entities to obtain an entity set.

7)对答案句进行依存句法分析，使用实体集合，生成问句。7) Perform dependency syntactic analysis on the answer sentence, use the entity set, and generate a question sentence.

对答案句进行分词，词性标注，依存句法分析，分析句子中词之间以来关系，通过将句子中的实体以及依存于实体的词进行替换，得到答案句对应的问句。Perform word segmentation, part-of-speech tagging, and dependency syntactic analysis on the answer sentence to analyze the relationship between words in the sentence. By replacing the entities in the sentence and the words that depend on the entities, the corresponding question sentence of the answer sentence is obtained.

在步骤1)中，所述的将pdf格式的半结构化文档转化为txt文档。电子病历一般是用pdf格式保存，方便医生的记录和打印。为了便于之后的处理，需要将其转化为txt格式的文档。pdf文档的解析过程包含如下部分：读取本地存储的pdf文件，解析文件的文件头、文件体、交叉引用表和文件尾，使用过滤器对二进制流进行解析，提取字符串，将其保存至txt文件中。常用的解析工具有pdfminer和pdfBOX等工具。In step 1), the semi-structured document in pdf format is converted into a txt document. Electronic medical records are generally saved in pdf format, which is convenient for doctors to record and print. In order to facilitate subsequent processing, it needs to be converted into a document in txt format. The parsing process of the pdf document includes the following parts: read the pdf file stored locally, parse the file header, file body, cross-reference table and file trailer, use filters to parse the binary stream, extract the string, and save it to txt file. Commonly used parsing tools include tools such as pdfminer and pdfBOX.

在步骤2)中，所述的从半结构化文档中提取陈述句，并对部分陈述句进行标注。以电子病历为例，即从“诊断依据”和“医患沟通记录”这两个部分提取陈述句。由于在电子病历中，该部分是由专业医生书写，因此往往包含了大量的医学领域知识。通过编写正则表达式，即首先根据“鉴别依据”作为开头，以“分析”作为结尾，提取出鉴别依据部分。然后在以序号“1.”、“2.”等作为分割符，分割出若干具有完整语义的陈述句组。由于该部分文本会出现跨页，因此需要对包含“**大学”、“姓名”和“医疗表格”的句子进行删除。对于医患沟通部分，需要以“沟通记录”作为文本起始部分，以“沟通结果”作为文本的结束部分，并以句号作为分隔符，划分陈述句。其中包含“入院后”和“初步诊断”的句子由于在逻辑上是具有因果关系的，因此需要合并成一句话。在得到了陈述句后，需要对部分陈述句打标签，从而构建训练集。训练集包含两个部分，能成为答案句的陈述句和不能成为答案句的陈述句。其中能成为答案句的陈述句标为1，不能成为答案句的陈述句标为0。抽取陈述句做训练集的规则是以每个文档为单位，每个文档随机抽取若干个陈述句。In step 2), the declarative sentence is extracted from the semi-structured document, and part of the declarative sentence is marked. Taking electronic medical records as an example, declarative sentences are extracted from the two parts of "diagnosis basis" and "doctor-patient communication records". Since this part of the electronic medical record is written by a professional doctor, it often contains a lot of medical field knowledge. By writing a regular expression, that is, according to "identification basis" as the beginning and "analysis" as the end, the identification basis part is extracted. Then use serial numbers "1.", "2.", etc. as separators to separate out several declarative sentence groups with complete semantics. Since this part of the text will spread across pages, the sentences containing "**University", "Name" and "Medical Form" need to be deleted. For the doctor-patient communication part, it is necessary to use "communication records" as the beginning of the text, "communication results" as the end of the text, and use periods as separators to divide the statement sentences. The sentences containing "after admission" and "preliminary diagnosis" need to be combined into one sentence because they are logically causal. After obtaining the declarative sentences, it is necessary to label some of the declarative sentences to construct a training set. The training set consists of two parts, declarative sentences that can be answer sentences and declarative sentences that cannot be answer sentences. Among them, the declarative sentence that can be the answer sentence is marked as 1, and the declarative sentence that cannot be the answer sentence is marked as 0. The rule for extracting declarative sentences as the training set is to take each document as a unit, and each document randomly extracts several declarative sentences.

在步骤3)中,所述的使用Apriori算法挖掘句子中的频繁项集，生成关联规则扩充特征。句子集合S包含已提取的所有陈述句{s₁,s₂,…,s_m}，一共m条。通过使用分词工具，例如结巴分词，对陈述句分词，得到词集合{x₁,x₂,…,x_n}。遍历所有陈述句，计算每个词的支持度，计算方法为

其中num(x)为句子集合S中包含词x的句子数量，m为S中句子的总数。设定阈值K，将sup(x)大于K的词x放入到特征集合中。设定关联规则为(x,y)，即词x为特征则y也为特征，初始设定所有的二元组为关联规则，计算关联规则的置信度

其中sup(x∪y)为词x，y一起出现在句子中的概率。设定阈值K₂，将置信度大于该阈值的关联规则作为特征扩充规则，即将词y放入到特征集合中。In step 3), the Apriori algorithm is used to mine the frequent itemsets in the sentence, and the extended features of the association rules are generated. The sentence set S contains all the extracted declarative sentences {s ₁ , s ₂ ,...,s _m }, a total of m items. By using a word segmentation tool, such as stammering word segmentation, to segment the declarative sentence, a word set {x ₁ , x ₂ ,…,x _n } is obtained. Traverse all declarative sentences and calculate the support of each word, the calculation method is

Where num(x) is the number of sentences containing word x in the sentence set S, and m is the total number of sentences in S. Set the threshold K, put the word x with sup(x) greater than K into the feature set. Set the association rule as (x, y), that is, if the word x is a feature, y is also a feature. Initially set all the two-tuples as association rules, and calculate the confidence of the association rules

where sup(x∪y) is the probability that words x and y appear together in a sentence. The threshold K ₂ is set, and the association rules whose confidence is greater than the threshold are used as feature expansion rules, that is, the word y is put into the feature set.

在步骤4)中，所述的使用朴素贝叶斯分类模型对陈述句分类，获取答案句。根据得到的特征结合，将训练集中的句子表示成特征向量，作为朴素贝叶斯分类模型的输入。特征向量是1×n维的，n表示n个特征。如果句子中包含该特征，则对应位置为1。如果句子中不包含，对应位为0。分类模型使用朴素贝叶斯分类模型，设x＝{a₁,a₂,…,a_n}为待分类项，类别集合为C＝{y₁,y₂}通过计算P(y₁|x),p(y₂|x)得到x各属于两个类别的概率，选取概率最大的类别作为x的所属类别p(y_k|x)＝max{p(y₁|x),p(y₂|x)}。其中P(y₁|x),p(y₂|x)是根据贝叶斯定理求得的。贝叶斯定理假设各个特征属性是相互独立的，故

由于分母部分是常数项，这里只需将分子最大化，即

而上式中的右边每一项，都可以通过统计得到，因此就可以计算出每个类别所对应的概率，从而找到最大概率的类。通过分类模型，对陈述句分类，找到陈述句中的答案句，最终得到答案句集合。In step 4), the declarative sentence is classified using the naive Bayesian classification model to obtain the answer sentence. According to the combination of features obtained, the sentences in the training set are represented as feature vectors, which are used as the input of the naive Bayesian classification model. The feature vector is 1×n dimensional, where n represents n features. If the feature is included in the sentence, the corresponding position is 1. If not included in the sentence, the corresponding bit is 0. The classification model uses the naive Bayesian classification model, let x={a ₁ ,a ₂ ,…,a _n } be the item to be classified, and the category set is C={y ₁ ,y ₂ } by calculating P(y ₁ |x ), p(y ₂ |x) to get the probability that x belongs to two categories, and select the category with the highest probability as the category of x to which p(y _k |x)=max{p(y ₁ |x), p(y ₂ |x)}. Among them, P(y ₁ |x), p(y ₂ |x) are obtained according to Bayes' theorem. Bayesian theorem assumes that each feature attribute is independent of each other, so

Since the denominator part is a constant term, it is only necessary to maximize the numerator here, that is

And each item on the right side of the above formula can be obtained through statistics, so the probability corresponding to each category can be calculated, so as to find the category with the highest probability. Through the classification model, the statement sentences are classified, the answer sentences in the statement sentences are found, and finally the set of answer sentences is obtained.

在步骤5)中,网络爬取部分领域实体，标注部分答案句。使用网络爬虫从医学相关网站上爬取疾病、药品和器官等命名实体，得到实体集。但是网络爬取的实体有时是不够的，因此需要对句子集合中的命名实体进行识别。通过对答案句进行标注，得到训练集。基于crf的命名实体识别是将命名实体识别过程看做是一个序列标注问题，对句子中的字标注为BIO，B表示实体的开头，I表示实体的内部，O表示不属于实体。对于疾病，药品和器官，可分别标记为DISB，MEDB,ORGB来表示对应实体的开头。这里从句子集合中抽取部分句子，对句子进行人工标注，得到命名实体识别的训练集。In step 5), some domain entities are crawled from the web, and some answer sentences are marked. Use a web crawler to crawl named entities such as diseases, drugs, and organs from medical-related websites to obtain entity sets. But the entities crawled from the web are sometimes not enough, so the named entities in the sentence collection need to be identified. The training set is obtained by labeling the answer sentences. The crf-based named entity recognition regards the named entity recognition process as a sequence labeling problem, and marks the words in the sentence as BIO, B indicates the beginning of the entity, I indicates the interior of the entity, and O indicates that it does not belong to the entity. For diseases, medicines and organs, they can be marked as DISB, MEDB, ORGB to represent the beginning of the corresponding entity. Here, some sentences are extracted from the sentence collection, and the sentences are manually marked to obtain the training set for named entity recognition.

在步骤6)中，所述的构建BiLstm+crf神经网络模型，使用训练集训练，提取实体。本发明构建BiLstm+crf神经网络模型来识别实体，该模型包含两个层BiLstm层和crf层。BiLstm层是双向LSTM神经网络，输入的是一句中的字向量，字向量使用的One-Hot编码，输出的是每个类别标签概率，一个句子最后以矩阵的形式给出P，Pij为第i个词标注为第j个标签的概率。Crf层是条件随机场，可以从输入序列中学习到标签的约束，从而在多类别实体的识别时更精确，其中有一个转移矩阵A，A_ij表示第i个标签转移为第j个标签的概率。因此输入句子序列X，得到标签y，最后评分函数为

通过softmax函数，可得到概率函数

在训练时，最大化似然概率p(y|X)，即损失函数为-log(p(y|X))。通过使用python编程，利用各种开源的机器学习平台，可以搭建BiLstm+crf神经网络模型，输入训练集训练。将训练好的模型处理剩余的未标注的句子，得到答案句集合中的命名实体。In step 6), the BiLstm+crf neural network model is constructed, trained using a training set, and entities are extracted. The present invention builds a BiLstm+crf neural network model to identify entities, and the model includes two layers, a BiLstm layer and a crf layer. The BiLstm layer is a bidirectional LSTM neural network. The input is the word vector in a sentence. The word vector uses One-Hot encoding, and the output is the probability of each category label. A sentence is finally given in the form of a matrix, and Pij is the i-th The probability that a word is marked as the jth label. The Crf layer is a conditional random field, which can learn the label constraints from the input sequence, so that it is more accurate in the identification of multi-category entities. There is a transfer matrix A, and A _ij indicates that the i-th label is transferred to the j-th label. probability. Therefore, enter the sentence sequence X, get the label y, and the final scoring function is

Through the softmax function, the probability function can be obtained

During training, the likelihood probability p(y|X) is maximized, that is, the loss function is -log(p(y|X)). By using python programming and using various open source machine learning platforms, a BiLstm+crf neural network model can be built and input into the training set for training. Process the trained model on the remaining unlabeled sentences to obtain the named entities in the answer sentence set.

在步骤7)中，所述的对答案句进行依存句法分析，使用实体集合，生成问句。答案句先进行分词，在词性标注，最后进行依存句法分析，这里得到的结果即每个词在句子中与其他词的关系，例如SBV表示主谓关系。这样，在得到句子的语法结构后，可以分析句子中领域实体的位置以及依存于领域实体的词的情况。目前可以使用开源的自然语言处理平台如ltp来对句子进行依存句法分析，一般采用的是生成式依存句法分析、判别式依存句法分析、确定性依存句法分析和基于序列标注的分层式句法分析方法。在得到了句子的句法分析表后，本发明通过遍历句法分析表，查找出现在领域实体集合中的词。找到句子里包含的领域实体后，使用疑问词将该疾病的实体进行替换和对依存与该实体的词进行删除，从而将答案句转化为事实型问句。In step 7), the answer sentence is subjected to dependency syntax analysis, and the entity set is used to generate a question sentence. The answer sentence is first divided into words, tagged in the part of speech, and finally the dependency syntactic analysis is performed. The result obtained here is the relationship between each word and other words in the sentence. For example, SBV indicates the subject-predicate relationship. In this way, after the grammatical structure of the sentence is obtained, the position of the domain entity in the sentence and the situation of words dependent on the domain entity can be analyzed. At present, open source natural language processing platforms such as ltp can be used to perform dependency parsing on sentences. Generally, generative dependency parsing, discriminative dependency parsing, deterministic dependency parsing, and hierarchical parsing based on sequence annotations are used. method. After obtaining the syntactic analysis table of the sentence, the present invention searches the words appearing in the domain entity set by traversing the syntactic analysis table. After finding the domain entity contained in the sentence, use interrogative words to replace the entity of the disease and delete the words that depend on the entity, so as to convert the answer sentence into a fact-type question sentence.

本发明的创新点在于：The innovation point of the present invention is:

1、提出了答案句提取方法。本发明应用机器学习的方法，从半结构化的文档中提取出答案句。通过pdf解析处理半结构化文档，提取出包含候选答案句文本。通过应用Apriori进行特征选择和朴素贝叶斯分类方法进行分类，得到半结构化文本中的答案句。1. A method for extracting answer sentences is proposed. The invention applies a machine learning method to extract answer sentences from semi-structured documents. The semi-structured document is processed through pdf parsing, and the text containing candidate answer sentences is extracted. By applying Apriori for feature selection and naive Bayesian classification method for classification, answer sentences in semi-structured texts are obtained.

2、提出了问句的生成方法。本发明结合命名实体识别和依存句法分析理论，将答案句转为对应的问句。命名实体识别采用crf+BiLstm神经网络模型，识别答案句中的实体，补充到网络爬取的实体中。句法分析通过揭示句子中各个词之间的依存关系，从而在问句生成时替换依存于实体的词，得到合理的问句。2. Propose a question generation method. The invention combines named entity recognition and dependency syntax analysis theory to convert answer sentences into corresponding question sentences. Named entity recognition adopts the crf+BiLstm neural network model to recognize the entities in the answer sentence and supplement them to the entities crawled from the web. Syntax analysis reveals the interdependence relationship among the words in the sentence, so that when the question is generated, the word that depends on the entity is replaced, and a reasonable question is obtained.

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. A method for extracting question and answer pairs from semi-structured documents based on machine learning, characterized in that, comprising the following steps:

Step 1: Input the collection of semi-structured documents to be extracted, and convert the semi-structured documents in pdf format into txt documents;

Step 2: Use regular expressions to extract the declarative sentences in the txt text, sort the obtained set of declarative sentences according to the documents, and randomly sort the declarative sentences inside the documents; randomly extract some sentences in each document, judge whether the sentences are answer sentences, and mark the answers sentence, construct the training set;

Step 3: Use the Apriori algorithm to extract the frequent itemsets in the set as features on the training set, expand the features through association rules, and obtain the feature set;

Step 4: According to the feature set, express the sentence as a feature vector, input the feature vector of the training set into the naive Bayesian classification model for training, and obtain the trained classification model;

Step 5: Use the trained classification model to classify the unlabeled sentences, extract the answer sentences, and obtain the answer sentence set;

Step 6: Crawl some domain entities through the web crawler; mark some sentences in the answer sentence set, mark the words as BIO, and establish a training set for named entity recognition;

Step 7: Build a BiLstm+crf neural network model and use the training set for training; use the trained model to sequentially label the unlabeled sentences in the answer sentence set, and convert the labels into named entities; combine some domain entities crawled by web crawlers and named entities to get the entity collection;

The BiLstm+crf neural network model includes two layers, the BiLstm layer and the crf layer; the BiLstm layer is a bidirectional LSTM neural network, the input is the word vector in a sentence, the One-Hot encoding used by the word vector, and the output is each category Label probability, a sentence finally gives P in the form of a matrix, Pij is the probability that the i-th word is marked as the j-th label; the Crf layer is a conditional random field, which is used to learn the constraints of the label from the input sequence, so that in The recognition of multi-category entities is more accurate. There is a transition matrix A, and A _ij represents the probability that the i-th label is transferred to the j-th label; therefore, input the sentence sequence X and get the label y, and the final scoring function is:

Through the softmax function, the probability function is obtained

During training, maximize the likelihood probability p(y|X), that is, the loss function is -log(p(y|X));

Step 8: Carry out word segmentation, part-of-speech tagging, and dependency syntactic analysis on the answer sentence set, analyze the relationship between words in the sentence, and obtain the question sentence corresponding to the answer sentence by replacing the entity in the sentence and the word dependent on the entity, and Output question-answer pairs to complete the extraction of question-answer pairs.

2. a kind of method based on machine learning according to claim 1 extracts question-and-answer pair from semi-structured document, it is characterized in that: in described step 3, use Apriori algorithm to extract the frequent item set in the collection to training set As a feature, the method of extending the feature through the association rules to obtain the feature set is as follows:

The sentence set S contains all the extracted statement sentences {s ₁ , s ₂ ,…,s _m }, a total of m items; by using the word segmentation tool to segment the statement sentences, the word set {x ₁ , x ₂ ,…,x _n } is obtained; Traverse all statements and calculate the support of each word, the calculation method is:

Among them, num(x) is the number of sentences containing word x in the sentence set S, and m is the total number of sentences in S; set the threshold K, put the word x with sup(x) greater than K into the feature set; set the association The rule is (x, y), that is, if the word x is a feature, y is also a feature. Initially set all the binary groups as association rules, and calculate the confidence of the association rules:

Where sup(x∪y) is the probability that words x and y appear together in a sentence; set a threshold K ₂ , and use the association rules with confidence greater than the threshold as feature expansion rules, that is, put word y into the feature set.

3. A kind of machine learning-based method for extracting question-and-answer pairs from semi-structured documents according to claim 1 or 2, characterized in that: in the described step 5, utilize the trained classification model to unmarked sentences Carry out classification, extract the answer sentence, and the method of obtaining the answer sentence set is as follows: according to the combination of the obtained features, the sentences in the training set are expressed as feature vectors, which are used as the input of the naive Bayesian classification model; the feature vector is 1×n dimension , n represents n features; if the feature is included in the sentence, the corresponding position is 1; if it is not included in the sentence, the corresponding bit is 0; the classification model uses the naive Bayesian classification model, set x={a ₁ ,a ₂ ,...,a _n } are the items to be classified, the category set is C={y ₁ ,y ₂ }, by calculating P(y ₁ |x),p(y ₂ |x) to get x belongs to two categories Probability, select the category with the highest probability as the category of x p(y _k |x)=max{p(y ₁ |x), p(y ₂ |x)}; where P(y ₁ |x), p( y ₂ |x) is obtained according to Bayes'theorem; by calculating the probability corresponding to each category, the class with the highest probability can be found, and the statement sentences can be classified through the classification model to find the answer sentences in the statement sentences, and finally get A collection of answer sentences.