CN109635108B

CN109635108B - A remote-supervised entity relation extraction method based on human-computer interaction

Info

Publication number: CN109635108B
Application number: CN201811396642.2A
Authority: CN
Inventors: 杨静; 李梦婷
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2018-11-22
Filing date: 2018-11-22
Publication date: 2020-02-18
Anticipated expiration: 2038-11-22
Also published as: CN109635108A

Abstract

The invention discloses a remote supervision entity relationship extraction method based on human-computer interaction. Evaluation, according to the feedback of the model, adjust the crowdsourcing strategy in time to obtain new data and put it into the model, until all the data is cleaned or the model performance is no longer improved. Compared with the prior art, the present invention has the advantages of high quality of crowdsourcing results and low cost of crowdsourcing, and can be processed in parallel by multiple workers without relying on experts for labeling, and the obtained labeling data has higher guiding significance for model learning. It ensures the accuracy of relation extraction between entities, and better solves the problem that noise data reduces model performance in relation extraction between remote supervision entities.

Description

A remote-supervised entity relation extraction method based on human-computer interaction

技术领域technical field

本发明涉及远程监督实体间关系抽取技术领域，尤其是一种基于人机交互的远程监督实体关系抽取方法。The invention relates to the technical field of relationship extraction between remote supervision entities, in particular to a method for extracting relationship between remote supervision entities based on human-computer interaction.

背景技术Background technique

信息抽取(Information Extraction)是自然语言处理中一项基本的任务，通过对非结构化文本进行处理抽取结构化信息，作为后续自然语言处理任务的输入。在知识爆炸的时代，人们每天都需要面对海量数据，合理利用信息抽取系统高效地处理文本提取有用的信息就变得极为重要。信息抽取作为自然语言处理中极为重要的一环，本身也是由一系列子任务构成的，例如命名实体识别、关系抽取和事件抽取等。Information extraction (Information Extraction) is a basic task in natural language processing. It extracts structured information by processing unstructured text as the input of subsequent natural language processing tasks. In the era of knowledge explosion, people need to face massive data every day, and it is extremely important to use information extraction system to efficiently process text and extract useful information. As an extremely important part of natural language processing, information extraction itself is also composed of a series of sub-tasks, such as named entity recognition, relation extraction and event extraction.

早期的实体间关系抽取方法主要依赖于专家编写的规则和模板，通过匹配得到正确的关系。后期关系抽取任务开始关注基于统计的机器学习方法，通常分为基于特征的方法和基于核的方法，该方法通过学习特征或者核函数得到模型对实体对之间的关系进行预测或分类。由于特征的抽取和核函数的确定都容易出现错误，而这些错误可能导致之后的工作产生误差，并且特征和核函数也不能保证抽取出文本中所有的相关信息。为了应对使用统计机器学习方法产生的问题，研究者们采用深度学习方法自动地学习特征进行模型训练。基于深度学习的主流方法可以分为有监督和远程监督两类，但有监督方法的模型训练需要大量的人工标注数据，所以在数据量较少且希望降低人工参与的情况下研究者尝试运用远程监督的方式进行训练。Early inter-entity relationship extraction methods mainly rely on rules and templates written by experts to obtain correct relationships through matching. Later relation extraction tasks began to focus on statistical-based machine learning methods, which are usually divided into feature-based methods and kernel-based methods. This method obtains models by learning features or kernel functions to predict or classify the relationship between entity pairs. Since both feature extraction and kernel function determination are prone to errors, these errors may lead to errors in subsequent work, and features and kernel functions cannot guarantee to extract all relevant information in the text. In order to deal with the problems arising from the use of statistical machine learning methods, researchers use deep learning methods to automatically learn features for model training. The mainstream methods based on deep learning can be divided into two categories: supervised and remote supervision. However, the model training of supervised methods requires a large amount of manually labeled data. Therefore, when the amount of data is small and the manual participation is expected to be reduced, researchers try to use remote supervision. supervised training.

现有的技术的远程监督大多将远程监督学习和多实例学习相结合，希望能够从一些未标注的数据集中得到有意义的数据并加入到训练集当中，这样可以有效的扩展已有数据集，使得模型效果得到部分提升。但远程监督的数据集中包含大量的噪音数据，这些噪音数据在模型训练的过程中会对模型性能造成巨大的影响，例如，对于包含同一实体对的句子，有些句子中的实体对间存在某一关系类型，但更多的是实体间并不包含任何关系的句子，然而深度学习模型在训练的过程中通常不考虑此种情况。虽然，也提出了一些多实例方法用于降低噪音数据对于模型训练效果的影响，选择实例中置信度较高的句子进行训练或在训练时将不同的实例按照其与关系类型的匹配程度赋予不同的权重等，这些方法都在一定程度上提升了模型的性能。上述都是一些完全基于深度学习模型本身学习能力的方法，虽然这些方法已经取得了不错的效果，但仍然无法完全避免噪音数据的干扰以及缺乏有效标注数据的问题，例如，可能存在所有包含某一实体对的句子对于这一实体对来说都不存在任何关系，在这情况下，这些实例都应被归为负样本集，但目前的模型并不会这么做。因此需要研发一种利用众包技术进行远程监督数据集清洗并实现实体间关系抽取的方法，以克服噪音数据对模型性能的影响。The remote supervision of the existing technology mostly combines remote supervision learning and multi-instance learning. It is hoped that meaningful data can be obtained from some unlabeled data sets and added to the training set, which can effectively expand the existing data sets. The model effect is partially improved. However, the remotely supervised dataset contains a large amount of noise data, which will have a huge impact on the model performance during the model training process. For example, for sentences containing the same entity pair, there is a certain relationship between entity pairs in some sentences Relationship types, but more are sentences that do not contain any relationship between entities, but deep learning models usually do not consider this situation during training. Although, some multi-instance methods have also been proposed to reduce the influence of noise data on the model training effect, selecting sentences with higher confidence in the instances for training, or assigning different instances to different instances according to their matching degree with the relation type during training. These methods have improved the performance of the model to a certain extent. The above are some methods based entirely on the learning ability of the deep learning model itself. Although these methods have achieved good results, they still cannot completely avoid the interference of noise data and the lack of effective labeling data. Sentences of an entity pair do not have any relationship to this entity pair, in which case these instances should all be classified as negative examples, but the current model does not do this. Therefore, it is necessary to develop a method for remotely supervised dataset cleaning and extraction of inter-entity relations using crowdsourcing technology to overcome the impact of noisy data on model performance.

发明内容SUMMARY OF THE INVENTION

本发明的目的是针对现有技术的不足而设计的一种基于人机交互的远程监督实体关系抽取方法，采用众包技术加入到关系抽取任务的模型训练方法，通过人工标注的方式降低噪音数据对于模型性能的影响，将人工标注后的数据投入到深度学习模型中进行训练和评测，根据模型的反馈及时调整众包策略得到新的数据并投入模型，直到将所有数据清洗完毕或模型性能不再提升，经众包清洗后的数据作为真实训练集，较好的解决了远程监督实体间关系抽取中噪音数据降低模型性能的问题，可根据用户和数据分析制定众包策略，在提高众包结果质量的同时降低众包成本，并具有较强的可移植性，通过简单培训之后就可方便的运用在不同领域，解决专业领域中较为常识的问题，无需依赖专家进行标注可多工作者并行处理，在很大程度上减少时间和金钱的成本，众包技术得到的标注数据对模型学习来说拥有更高的指导意义，保证了实体间关系抽取的准确率。The purpose of the present invention is to design a method for extracting entity relationships based on human-computer interaction for remote supervision based on the shortcomings of the prior art, using crowdsourcing technology to join the model training method of relation extraction tasks, and reducing noise data by manual labeling For the impact of model performance, the artificially labeled data is put into the deep learning model for training and evaluation, and the crowdsourcing strategy is adjusted in time according to the feedback of the model to obtain new data and input into the model, until all data is cleaned or the model performance is not satisfactory. Further improvement, the data after crowdsourcing and cleaning is used as a real training set, which better solves the problem of noise data reducing model performance in the extraction of relationships between remote supervision entities. Crowdsourcing strategies can be formulated according to users and data analysis to improve crowdsourcing. The quality of the results reduces the cost of crowdsourcing, and has strong portability. After simple training, it can be easily used in different fields to solve common-sense problems in the professional field. It does not need to rely on experts for annotation, and multiple workers can run in parallel. Processing, to a large extent, reduces the cost of time and money, and the labeled data obtained by crowdsourcing technology has higher guiding significance for model learning, ensuring the accuracy of relationship extraction between entities.

本发明的目的是这样实现的：一种基于人机交互的远程监督实体关系抽取方法，其特点是采用众包技术加入到关系抽取任务的模型训练方法，通过人工标注的方式降低噪音数据对于模型性能的影响，将人工标注后的数据投入到深度学习模型中进行训练和评测，根据模型的反馈及时调整众包策略得到新的数据并投入模型，直到将所有数据清洗完毕或模型性能不再提升为止，其人机交互的远程监督实体关系抽取过程包括以下步骤：The purpose of the present invention is to achieve this: a remote supervision entity relationship extraction method based on human-computer interaction, which is characterized by using crowdsourcing technology to join the model training method of the relationship extraction task, and reducing noise data by manual labeling. Influence of performance, put the manually labeled data into the deep learning model for training and evaluation, and adjust the crowdsourcing strategy in time according to the feedback of the model to obtain new data and put it into the model, until all the data is cleaned or the model performance is no longer improved. So far, its remote-supervised entity relationship extraction process for human-computer interaction includes the following steps:

a步骤：将可用于实体关系抽取的数据集作为目标文本，抽取出包含关系类型为R的实体对E(R)，以及这些实体所对应的自然语言语句S(ep_i)，其中ep包含头实体和尾实体，所述数据集为“freebase”数据集和“NYT”数据集。Step a: Take the dataset that can be used for entity relationship extraction as the target text, extract the entity pair E(R) containing the relationship type R, and the natural language sentence S(ep _i ) corresponding to these entities, where ep contains the header Entities and tail entities, the datasets are the "freebase" dataset and the "NYT" dataset.

b步骤：从E(R)中挑选一个未处理过的实体对ep_k及其对应的自然语句序列S(ep_k)，利用word2vec将ep_k和S(ep_k)进行向量化处理，将得到的实值向量V_k和V_sk(包含多个单词w)，利用“transE”模型进行相似度计算，并将得到ep_k与各句子V_skj之间的相似度得分M_kj进行排序。Step b: Select an unprocessed entity pair ep _k and its corresponding natural sentence sequence S(ep _k ) from E(R), and use word2vec to vectorize ep _k and S(ep _k ), and we will get The real-valued vectors V _k and V _sk (including multiple words w) of , use the "transE" model to calculate the similarity, and sort the similarity score M _kj between the obtained ep _k and each sentence V _skj .

c步骤：利用机器学习挑选需要清洗的疑似噪音数据并投入众包进行重标注，然后将重新标注后的数据作为训练数据投入卷积神经网络模型，根据模型的反馈及时调整挑选策略进行迭代清洗，并在迭代的过程中逐步提升模型性能，其具体过程包括下述步骤：Step c: Use machine learning to select the suspected noise data that needs to be cleaned and put it into crowdsourcing for re-labeling, and then put the re-labeled data into the convolutional neural network model as training data, and adjust the selection strategy in time according to the feedback of the model for iterative cleaning. And in the iterative process, the model performance is gradually improved. The specific process includes the following steps:

(一)、设定阈值(1) Setting the threshold

阈值的设定按学习率为0.05，且阈值在[0.5,0.8]的范围内进行迭代实验，每一轮迭代中阈值都自动增加0.05，直到阈值的增加不再提升模型效果或达到0.8为止，获得自动调参的最优阈值，并假设这一阈值代表的是相似度M高于此阈值的句子默认为正样例，而低于此阈值的句子为疑似噪音数据。The threshold is set according to the learning rate of 0.05, and the threshold is in the range of [0.5, 0.8] for iterative experiments. The threshold is automatically increased by 0.05 in each iteration until the increase of the threshold no longer improves the model effect or reaches 0.8. Obtain the optimal threshold for automatic parameter tuning, and assume that this threshold represents that sentences with a similarity M higher than this threshold are default positive examples, while sentences below this threshold are suspected noise data.

(二)、众包标注(2) Crowdsourced labeling

根据阈值收集疑似噪音数据Z(ep_k)编写众包问题生成器，并批量生成内容为Z(ep_k)的句子中包含两个实体间是否存在指定关系的单项选择题，当工作者选择肯定答案时，需给出句子中哪一个词w促使你做出了肯定的回答，并将上述内容将作为神经网络模型的输入数据。Collect suspected noise data Z(ep _k ) according to the threshold to write a crowdsourcing question generator, and batch generate single-choice questions about whether there is a specified relationship between two entities in sentences with content Z(ep _k ), when the worker chooses yes When answering, you need to give which word w in the sentence prompted you to answer in the affirmative, and use the above content as the input data of the neural network model.

(三)、众包结果质量控制(3) Quality control of crowdsourcing results

准备一批已有正确答案的问题作为黄金数据，将其夹杂在真正需要进行众包的数据中，一同发放给工作者，当黄金数据的准确率高于80％时，则认为工作者提交的答案为有效标注，反之则答案存在质量问题，所述众包是将同一个问题发放给五或七个工作者进行标注，众包将会对收集到的答案进行众数投票，并以超过半数人的答案为最终结果，即可立刻停止答题并回收结果。Prepare a batch of questions with correct answers as gold data, mix them with the data that really needs to be crowdsourced, and distribute them to workers together. When the accuracy rate of the gold data is higher than 80%, it is considered that workers submitted The answer is valid labeling, otherwise there is a quality problem in the answer. The crowdsourcing is to distribute the same question to five or seven workers for labeling. The person's answer is the final result, you can immediately stop answering the question and recover the result.

(四)、标注后数据投入深度学习模型(4) The labeled data is put into the deep learning model

收集并整理通过众包标注得到的最终结果，将其加入到原有的且剔除了参与众包数据的训练集中，一同作为深度学习模型训练集投入到卷积神经网络中进行训练得到实体间关系的分类模型，所述卷积神经网络的输入为：训练集中句子的词向量表示(wordembedding)和实体对在句子中位置的向量表示(position embedding)，输出为：句子所包含的实体对之间存在某一类型关系的概率，且根据输出结果与标准结果迭代的调整卷积神经网络的参数。Collect and organize the final results obtained through crowdsourcing annotation, add them to the original training set that excludes the crowdsourcing data, and put them into the convolutional neural network as the training set of the deep learning model for training to obtain the relationship between entities The input of the convolutional neural network is: the word embedding of the sentence in the training set and the vector representation of the position of the entity pair in the sentence (position embedding), and the output is: between the entity pairs contained in the sentence The probability of a certain type of relationship exists, and the parameters of the convolutional neural network are iteratively adjusted according to the output results and the standard results.

(五)、模型反馈机制(5) Model feedback mechanism

将测试集数据投入到上述已训练好的分类模型中进行预测，并根据预测值对模型性能进行评测，若模型性能得到提升则代表经众包清洗后的数据是有效的标注数据，可作为真实训练集；若模型性能没有得到提升则认为这批标注数据没有实际意义，则返回到c步骤，按步骤(一)重新设定阈值进行迭代清洗，将调整后的阈值重复步骤(二)～(五)进行重新标注，并根据模型性能的提升和众包标注的反馈决定迭代清洗的终止，当一轮的迭代清洗结束后返回到b步骤，在E(R)中选择下一个未处理的实体对ep_k及其对应的自然语句序列S(ep_k)进入新一轮的迭代清洗，如此反复迭代清洗，直至完成E(R)中所有未处理的实体对ep_k及其对应的自然语句序列S(ep_k)进行众包清洗后，得到噪音数据清洗后的数据集。Put the test set data into the above-trained classification model for prediction, and evaluate the model performance according to the predicted value. If the model performance is improved, it means that the data after crowdsourcing and cleaning is valid labeled data, which can be used as real data. Training set; if the model performance is not improved, it is considered that this batch of labeled data has no practical significance, then return to step c, reset the threshold according to step (1) for iterative cleaning, and repeat steps (2)~( 5) Carry out re-labeling, and decide the termination of iterative cleaning according to the improvement of model performance and the feedback of crowdsourced labeling. When one round of iterative cleaning is over, return to step b, and select the next unprocessed entity in E(R). Enter a new round of iterative cleaning for ep _k and its corresponding natural sentence sequence S(ep _k ), and repeat this iterative cleaning until all unprocessed entity pairs ep _k and their corresponding natural sentence sequences in E(R) are completed After S(ep _k ) is crowdsourced and cleaned, a data set after noise data cleaning is obtained.

本发明与现有技术相比具有以下优点：Compared with the prior art, the present invention has the following advantages:

(1)能够利用以众包技术为主要实现手段的人机交互方法有效地对包含大量噪音数据的半监督训练集进行数据清洗，并且使得模型的性能得到提升；(1) The human-computer interaction method with crowdsourcing technology as the main implementation method can effectively clean the semi-supervised training set containing a large amount of noisy data, and the performance of the model can be improved;

(2)根据用户和数据分析后制定的众包策略可以在提高众包结果质量的同时降低众包成本，避免了大规模众包标注价格过于昂贵的问题；(2) The crowdsourcing strategy formulated based on user and data analysis can improve the quality of crowdsourcing results while reducing the cost of crowdsourcing, avoiding the problem of excessively expensive annotations for large-scale crowdsourcing;

(3)具有很强的可移植性，可召集不同知识背景的工作者解决不同领域的问题，并且这些工作者通过简单培训之后就可以解决专业领域中较为常识的问题，无需依赖专家进行标注可多工作者并行处理，在很大程度上减少时间和金钱的成本；(3) It has strong portability, and can convene workers with different knowledge backgrounds to solve problems in different fields, and these workers can solve common-sense problems in the professional field after simple training, without relying on experts for annotation. Multi-worker parallel processing, greatly reducing the cost of time and money;

(4)众包技术得到的标注数据对模型学习来说拥有更高的指导意义，保证了实体间关系抽取的准确率。(4) The labeled data obtained by crowdsourcing technology has higher guiding significance for model learning, which ensures the accuracy of relation extraction between entities.

附图说明Description of drawings

图1为本发明的流程示意图；Fig. 1 is the schematic flow chart of the present invention;

图2为噪音数据实例示意图。FIG. 2 is a schematic diagram of an example of noise data.

具体实施方式Detailed ways

本发明以实体关系抽取任务中常用的“Freebase”数据集作为目标文本，并且通过将此数据集与包含确切三元组关系的“NYT”数据集进行映射，针对每一个句子抽取出包含的实体对和实体间关系类型等信息，通过计算包含某实体对的句子与远程监督标注得到的关系类型进行相似度匹配，并根据相似度的大小对句子进行排序，并针对相似度低于某一阈值的句子，将其投入到众包当中进行人工标注，判断其是否某一关系类型，标注后的数据作为标注数据和额外信息重新投入训练。实验的模型以卷积神经网络作为关系类型分类器，其输入为句子经过google的word2vec处理后得到的实值向量，并且包含了实体对的位置信息，模型的输出为此句子中实体对可能包含的各种关系类型的得分，其具体包括下述步骤：The present invention uses the "Freebase" data set commonly used in the entity relationship extraction task as the target text, and by mapping this data set with the "NYT" data set containing the exact triple relationship, the contained entities are extracted for each sentence. For information such as the relationship type between pairs and entities, the similarity between the sentence containing a certain entity pair and the relationship type obtained by the remote supervision annotation is matched, and the sentences are sorted according to the similarity, and the similarity is lower than a certain threshold. The sentence is put into crowdsourcing for manual annotation to determine whether it is of a certain relationship type, and the annotated data is re-trained as annotated data and additional information. The experimental model uses a convolutional neural network as a relationship type classifier. The input is a real-valued vector obtained after the sentence is processed by google's word2vec, and contains the location information of the entity pair. The output of the model is this sentence. The entity pair may contain The scores of various relationship types, which specifically include the following steps:

a步骤：将经过可用于实体间关系抽取的数据集(如“freebase”数据集和“NYT”数据集等)作为目标文本，其中每一行的数据为头实体和尾实体，实体间关系类型，包含此实体对的句子；因为存在多个不同句子包含同一个实体对(实体间关系类型相同)的情况，所以挑选关系类型中的某一类关系R进行数据处理。并且，抽取出包含关系类型为R的实体对E(R)＝{ep₁,ep₂,…,ep_n}，以及这些实体对对应的自然语言语句S(ep_i)＝{s_i1,s_i2,…,s_im}，其中ep_n包含头实体e_hn和尾实体e_tn。Step a: Take the datasets that can be used for relation extraction between entities (such as "freebase" dataset and "NYT" dataset, etc.) as the target text, where the data of each row is the head entity and the tail entity, the relationship type between entities, The sentence containing this entity pair; because there are multiple different sentences containing the same entity pair (the relationship type between entities is the same), a certain type of relationship R in the relationship type is selected for data processing. And, extract the entity pairs E(R)={ep ₁ ,ep ₂ ,...,ep _n } containing the relation type R, and the natural language sentences corresponding to these entity pairs S(ep _i )={s _i1 ,s _i2 ,…,s _im }, where ep _n contains the head entity e _hn and the tail entity e _tn .

b步骤：从E(R)中挑选一个未处理过的实体对ep_k及其对应的自然语句序列S(ep_k)＝{s_k1,s_k2,…,s_km}，并利用word2vec将ep_k和S(ep_k)进行向量化处理，得到实值向量V_k和V_sk(包含多个单词w)；按下式(I)和(II)进行相似度计算，得到ep_k与各句子V_skj之间的相似度得分M_kj，并进行排序；Step b: Select an unprocessed entity pair ep _k and its corresponding natural sentence sequence S(ep _k )={s _k1 ,s _k2 ,...,s _km } from E(R), and use word2vec to convert ep _k and S(ep _k ) are vectorized to obtain real-valued vectors V _k and V _sk (including multiple words w); the similarity calculation is performed according to formulas (I) and (II) to obtain ep _k and each sentence The similarity score M _kj between V _skj and sorted;

M_kj＝cos(V_k,V_skj) (I)M _kj =cos(V _k ,V _skj ) (I)

其中：cos()函数代表求取两个维度相同的向量表示之间的欧氏距离；Among them: the cos() function represents the Euclidean distance between two vector representations with the same dimension;

V_skj＝w₁+w₂+…+w_n (II)V _skj = w ₁ +w ₂ +...+w _n (II)

其中：w为每一个句子中包含的单词，因为无法控制句子中的单词数量，为了保证每一个句子向量表示维度相同我们进行归一化处理，将句子中的单词向量按位相加，单词向量之和表示整个句子的向量表示。Among them: w is the word contained in each sentence, because the number of words in the sentence cannot be controlled, in order to ensure that each sentence vector has the same dimension, we normalize it, add the word vectors in the sentence bitwise, and the word vector The sum represents the vector representation of the entire sentence.

第一步：设定阈值Step 1: Set Thresholds

假设这一阈值代表的是相似度M高于此阈值的句子默认包含此类关系即为正样例，而低于此阈值的句子不包含此类关系为负样例，收集低于此阈值的句子Z(ep_k)＝{z_k1,z_k2,…,z_kl}(Z(ep_k)包含于S(ep_k))并投入众包平台进行标注。此时，获取阈值的自动调参过程的学习率为0.05，并且阈值范围控制在[0.5,0.8]，得到实验效果最好的阈值，可以达到减少需众包数据数量和降低众包成本的目的，因为低于阈值的句子将不在实验的考虑范围中。Assuming that this threshold represents that sentences with a similarity M higher than this threshold are positive examples by default containing such relations, while sentences below this threshold do not contain such relations are negative examples, and collect the sentences below this threshold. The sentence Z(ep _k )={z _k1 ,z _k2 ,…,z _kl }(Z(ep _k ) is included in S(ep _k )) and put into the crowdsourcing platform for annotation. At this time, the learning rate of the automatic parameter tuning process for obtaining the threshold is 0.05, and the threshold range is controlled at [0.5, 0.8], and the threshold with the best experimental effect can be obtained, which can achieve the purpose of reducing the amount of crowdsourcing data and reducing the cost of crowdsourcing. , because sentences below the threshold will not be considered in the experiment.

第二步：众包标注Step 2: Crowdsource Labeling

对于需要进行众包的数据Z(ep_k)，将其设置为单项选择题，题目内容为对于给出的句子z_kl，其包含的两个实体(e_hl,e_tl)之间是否存在指定关系R，选项包含两种可能——存在和不存在；同时，当工作者选择肯定答案时，需额外给出一个信息，即句子中的哪一个词w_pl促使你做出了肯定的回答。For the data Z(ep _k ) that needs to be crowdsourced, it is set as a single-choice question, and the content of the question is whether there is a specified sentence between the two entities (e _hl , e _tl ) contained in the given sentence z _kl . In relation R, the options include two possibilities—existence and nonexistence; at the same time, when the worker chooses an affirmative answer, additional information needs to be given, that is, which word w _pl in the sentence prompts you to make an affirmative answer.

第三步：结果质量控制Step 3: Results Quality Control

将需人工标注的数据预处理完毕后，为保证众包结果的准确率我们还需进行质量控制，准备一批已有正确答案的问题作为黄金数据，将其夹杂在真正需要进行众包的数据中，一同发放给工作者，并观察回收得到的黄金数据的准确率，当准确率高于80％时认为这一工作者提交的结果为有效标注，并将此数据用于之后的流程，给予工作者一定的报酬；反之则会告知工作者其答案质量存在问题，不能试之为有效工作。After preprocessing the data that needs to be manually labeled, in order to ensure the accuracy of the crowdsourcing results, we also need to carry out quality control, prepare a batch of questions with correct answers as golden data, and mix them with the data that really needs crowdsourcing. When the accuracy rate is higher than 80%, the result submitted by this worker is considered to be a valid mark, and this data is used in the subsequent process to give The worker will be paid a certain amount; otherwise, the worker will be informed that there is a problem with the quality of the answer, and it cannot be tried to work effectively.

第四步：标注后数据投入深度学习模型Step 4: Put the labeled data into the deep learning model

收集并整理通过众包标注得到的最终结果，将其加入到原有的且剔除了参与众包数据的训练集中，一同作为深度学习模型训练集投入到卷积神经网络中进行训练得到实体间关系的分类模型，卷积神经网络的输入为：训练集中句子的词向量表示(wordembedding)和实体对在句子中位置的向量表示(position embedding)，输出为：句子所包含的实体对之间存在某一类型关系的概率，且根据输出结果与标准结果迭代的调整卷积神经网络的参数。Collect and organize the final results obtained through crowdsourcing annotation, add them to the original training set that excludes the crowdsourcing data, and put them into the convolutional neural network as the training set of the deep learning model for training to obtain the relationship between entities The input of the convolutional neural network is: the word embedding of the sentence in the training set and the vector representation of the position of the entity pair in the sentence (position embedding), and the output is: the entity pair contained in the sentence There is a certain The probability of a type of relationship, and the parameters of the convolutional neural network are iteratively adjusted according to the output results and the standard results.

除此之外，可以有效利用之前已获得的额外标注信息w_pl，即促使标注者做出正项选择的单词，将其作为卷积神经网络的一个特征数据输入到模型中，即被标记的单词在句子的相应位置处的值由0变为被标注的次数。同时，对于每一个需要投入训练的句子，也会在相似度的基础上给予不同的权重In addition, the additional annotation information w _pl obtained before can be effectively used, that is, the word that prompts the annotator to make a positive choice, and input it into the model as a feature data of the convolutional neural network, that is, the labeled The number of times that the value of the word at the corresponding position in the sentence changed from 0 to being labeled. At the same time, for each sentence that needs to be trained, different weights will be given based on the similarity.

α，采取α, take

“attention”机制来调整不同实例对于训练模型的的贡献度，其权重α和贡献度P_k(θ)分别按下式(III)和(IV)计算：The "attention" mechanism is used to adjust the contribution of different instances to the training model, and its weight α and contribution P _k (θ) are calculated as follows (III) and (IV) respectively:

P_k(θ)＝α₁s_k1+α₂s_k2+…+α_ns_kn (III)P _k (θ)=α ₁ s _k1 +α ₂ s _k2 +…+α _n s _kn (III)

其中：α_j代表包含某一个关系类型句子集合中的第j个句子在卷积神经网络模型中的贡献度；Among them: α _j represents the contribution of the jth sentence in the convolutional neural network model containing a certain relation type sentence set;

其中：P_k(θ)代表对于某一个实体对K，其所有对模型预测起正项作用的句子综合对最终预测所做出的贡献。Among them: P _k (θ) represents for a certain entity pair K, the comprehensive contribution of all sentences that play a positive role in the model prediction to the final prediction.

第五步：模型反馈机制Step 5: Model Feedback Mechanism

将测试集数据投入到已训练好的分类模型中进行预测并得到最终结果，根据预测值评测模型的性能，若模型性能得到提升则代表经过众包清洗后的数据是有效的标注数据，可作为真实训练集；若模型性能没有得到提升则认为这批标注数据没有实际意义，无需作为训练数据；重整训练集后重新选择需要处理的关系类型，并回到c步骤的第一步调整阈值进行迭代清洗，得到指定实体对的众包清洗后的去噪数据集。The test set data is put into the trained classification model for prediction and the final result is obtained, and the performance of the model is evaluated according to the predicted value. Real training set; if the performance of the model is not improved, it is considered that this batch of labeled data has no practical significance and does not need to be used as training data; after reorganizing the training set, re-select the relationship type to be processed, and go back to the first step of step c to adjust the threshold value. Iteratively cleans to obtain a crowdsourced cleaned denoising dataset for the specified entity pair.

通过以下具体实施例对本发明作进一步的详细说明。The present invention will be further described in detail through the following specific examples.

实施例1Example 1

参阅图1，将大规模自由文本进行输入，接着对输入的自由文本进行预处理。Referring to Figure 1, large-scale free text is input, followed by preprocessing of the input free text.

参阅图2，远程监督数据集中存在的噪音数据和正确标注数据See Figure 2, noisy data and correctly labeled data present in the remote supervision dataset

按下述步骤进行：Proceed as follows:

a步骤：将经过“NYT”数据集(包含明确三元组关系)对齐后的“Freebase”数据集作为目标文本，其中每一行的数据为：头实体e_hn和尾实体e_tn以及实体间关系类型，包含此实体对的句子。因为存在多个不同句子包含同一个实体对(实体间关系类型相同)的情况，所以挑选六十三种关系类型中的某一类关系R进行数据处理，例如：/location/location/contains。抽取出包含关系类型为R的实体对E(R)＝{queens belle_harbor,ohiocelina,…}，以及这些实体对对应的自然语言语句S(queens belle_harbor)＝{butinstead there was a funeral,at st.francis de sales roman catholic church,inbelle_harbor,queens,the parish of his birth.###END###,…}。Step a: Use the "Freebase" dataset aligned with the "NYT" dataset (including explicit triplet relationships) as the target text, where the data in each row is: head entity e _hn and tail entity e _tn and the relationship between entities Type, a sentence containing this entity pair. Because there are multiple different sentences containing the same entity pair (the relationship types between entities are the same), a certain type of relationship R among the sixty-three relationship types is selected for data processing, for example: /location/location/contains. Extract the entity pair E(R)={queens belle_harbor,ohiocelina,…} containing the relation type R, and the corresponding natural language sentence S(queens belle_harbor)={but instead there was a funeral, at st.francis de sales roman catholic church,inbelle_harbor,queens,the parish of his birth.###END###,…}.

b步骤：从E(R)中挑选一个未处理过的实体对(queens belle_harbor)及其对应的自然语句序列，并利用word2vec将ep_k和S(ep_k)进行向量化处理，得到实值向量V_k和V_sk(包含多个单词w)。通过相似度计算得到ep_k与各句子V_skj之间的相似度得分M_kj，并进行排序，例如包含实体对(queens belle_harbor)的句子{s₁,s₂,s₃,s₄}与关系类型/location/location/contains的相似度分别为0.5、0.34、0.15、0.42；Step b: Select an unprocessed entity pair (queens belle_harbor) and its corresponding natural sentence sequence from E(R), and use word2vec to vectorize ep _k and S(ep _k ) to obtain a real-valued vector _Vk and _Vsk (contains multiple words w). The similarity score M _kj between ep _k and each sentence V _skj is obtained by similarity calculation, and sorted, for example, the sentence {s ₁ ,s ₂ ,s ₃ ,s ₄ } containing the entity pair (queens belle_harbor) and the relationship The similarity of type/location/location/contains is 0.5, 0.34, 0.15, 0.42;

c步骤：利用机器学习挑选需要清洗的疑似噪音数据并投入众包进行重标注，然后将重新标注后的数据作为训练数据投入卷积神经网络模型，根据模型的反馈及时调整挑选策略进行迭代清洗，具体操作如下：Step c: Use machine learning to select the suspected noise data that needs to be cleaned and put it into crowdsourcing for re-labeling, and then put the re-labeled data into the convolutional neural network model as training data, and adjust the selection strategy in time according to the feedback of the model for iterative cleaning. The specific operations are as follows:

第一步：设定阈值Step 1: Set Thresholds

假设初始阈值为0.5，收集低于此阈值的句子{s₂,s₃}并投入众包平台进行标注，且在下一轮迭代中阈值将会增加0.05，直到增加阈值无法提升模型性能或阈值已达到0.8。根据得到的阈值对数据集按照(二)～(五)步进行数据清洗，直到人工标注阶段确认所有的潜在噪音数据都被重新标注，得到去噪后数据集。Suppose the initial threshold is 0.5, collect sentences {s ₂ , s ₃ } below this threshold and put them into the crowdsourcing platform for annotation, and the threshold will increase by 0.05 in the next iteration, until increasing the threshold cannot improve the model performance or the threshold has been to 0.8. According to the obtained threshold, the data set is cleaned according to steps (2) to (5), until the manual labeling stage confirms that all potential noise data has been relabeled, and the denoised data set is obtained.

第二步：众包标注Step 2: Crowdsource Labeling

对于需要进行众包的数据我们将其设置为单项选择题，题目内容为对于给出的句子，其包含的两个实体之间是否存在指定关系R，选项包含两种可能——存在和不存在。当工作者选择肯定答案时，需额外给出一个信息，即句子中的哪一个词促使你做出了肯定的回答，例如：句子1为“one was for st.francis de sales roman catholic church inbelle_harbor；another board studded with electromechanical magnets will gounder the pipes of an organ at the evangelical lutheran church of christ inrosedale,queens.###END###”，用户答案为“不存在”。For the data that needs to be crowdsourced, we set it as a single-choice question. The content of the question is whether there is a specified relationship R between the two entities contained in the given sentence. The options include two possibilities-existence and non-existence . When the worker chooses an affirmative answer, additional information is required, that is, which word in the sentence prompted you to make an affirmative answer, for example: sentence 1 is "one was for st.francis de sales roman catholic church inbelle_harbor; another board studded with electromechanical magnets will gounder the pipes of an organ at the evangelical lutheran church of christ inrosedale,queens.###END###", the user answered "no".

句子2为“but instead there was a funeral,at st.francis de sales romancatholic church,in belle_harbor,queens,the parish of his birth.###END###”，用户答案为“存在”，额外标注信息为单词“in”。Sentence 2 is "but instead there was a funeral, at st. francis de sales romancatholic church, in belle_harbor, queens, the parish of his birth. ###END###", the user's answer is "exist", additional annotation information for the word "in".

第三步：质量控制Step 3: Quality Control

将需人工标注的数据预处理完毕后，为保证众包结果的准确率我们还需进行质量控制：准备一批已有正确答案的问题作为黄金数据，将其夹杂在真正需要进行众包的数据中，一同发放给工作者。若在回收的答案中，黄金数据的正确率低于80％，则不采用此工作者的答案。为保证数据的准确性，将同一个问题发放给五个工作者进行标注，收集到的答案为四个“存在”，一个“不存在”，进行众数投票认为答案“存在”。After preprocessing the data that needs to be manually labeled, in order to ensure the accuracy of the crowdsourcing results, we also need to carry out quality control: prepare a batch of questions with correct answers as golden data, and mix them with the data that really needs crowdsourcing. , distributed to workers together. If the correct rate of gold data is less than 80% among the recovered answers, the worker's answer will not be used. In order to ensure the accuracy of the data, the same question was distributed to five workers for annotation, and the collected answers were four "exist" and one "nonexistent", and the majority voted that the answer was "exist".

第四步：投入深度学习模型训练Step 4: Invest in deep learning model training

收集并整理通过众包标注得到的最终结果，作为深度学习模型训练集投入到卷积神经网络中进行训练得到实体间关系的分类模型，卷积神经网络的输入为：训练集中句子的词向量表示和实体对在句子中位置的向量表示，以及额外的特征信息，其中“in”被标注了三次，“of”被标注了一次，所以特征向量为V＝{0,0,…,3,0,…,1,0,…},即句子中“in”和“of”的对应位置数字为被标注的次数。同时，对于每一个需要投入训练的句子，在相似度的基础上给予不同的权重α，采取“attention”机制来调整不同实例对于训练模型的的贡献度，其中对于所有句子{s₁,s₂,s₃,s₄}的权重分别为{α₁,α₂,α₃,α₄}。Collect and organize the final results obtained through crowdsourcing annotation, and put them into the convolutional neural network as a training set of deep learning model for training to obtain a classification model of the relationship between entities. The input of the convolutional neural network is: the word vector representation of the sentences in the training set and the vector representation of the position of the entity pair in the sentence, and additional feature information, where "in" is labeled three times and "of" is labeled once, so the feature vector is V={0,0,...,3,0 ,…,1,0,…}, that is, the number of the corresponding positions of “in” and “of” in the sentence is the number of times they are marked. At the same time, for each sentence that needs to be trained, different weights α are given on the basis of similarity, and the "attention" mechanism is adopted to adjust the contribution of different instances to the training model, where for all sentences {s ₁ , s ₂ The weights of , s ₃ , s ₄ } are {α ₁ , α ₂ , α ₃ , α ₄ } respectively.

第五步：根据模型的反馈机制调整众包策略：Step 5: Adjust the crowdsourcing strategy according to the feedback mechanism of the model:

将处理后关系类型为/location/location/contains的数据投入到已训练好的分类模型中进行预测并得到最终结果，根据预测值评测模型的性能，发现模型性能得到提升，所以之前的清洗是有效操作，回到c步骤中的第一步，继续调整阈值，直到数据处理完成无需标注最终得到的效果最优阈值为0.75，至此，在b步骤中挑选的关系类型为/location/location/contains的包含未处理过的实体对(queens belle_harbor)的训练集已经去噪成功。然后返回到b步骤，在E(R)中继续选择关系类型为/location/location/contains中包含的其他未处理实体，对其进行c步骤的新一轮数据清洗，其清洗步骤同实体对(queensbelle_harbor)的训练集相同，如此反复迭代清洗，直至完成E(R)中所有未处理的实体对ep_k及其对应的自然语句序列S(ep_k)进行众包清洗，即所有需去噪的关系类型为/location/location/contains的数据集都已众包完毕或模型的准确率维持稳定不再提升，则可结束众包清洗的过程，此时得到针对需处理的关系类型的去噪数据集。Put the processed data with the relation type of /location/location/contains into the trained classification model for prediction and get the final result, evaluate the performance of the model according to the predicted value, and find that the performance of the model has been improved, so the previous cleaning is effective Operation, go back to the first step in step c, and continue to adjust the threshold until the data processing is completed without labeling. The optimal threshold for the final effect is 0.75. So far, the relationship type selected in step b is /location/location/contains The training set containing unprocessed entity pairs (queens belle_harbor) has been denoised successfully. Then go back to step b, continue to select other unprocessed entities whose relationship type is /location/location/contains in E(R), and perform a new round of data cleaning in step c. The cleaning step is the same as the entity pair ( queensbelle_harbor) is the same as the training set, so iterative cleaning is repeated until all unprocessed entities in E(R) are crowdsourced and cleaned for ep _k and its corresponding natural sentence sequence S(ep _k ), that is, all the unprocessed entities in E(R) are cleaned by crowdsourcing. If the data sets whose relation type is /location/location/contains have been crowdsourced or the accuracy of the model remains stable and no longer improves, the crowdsourcing cleaning process can be ended, and the denoised data for the relation type to be processed can be obtained at this time. set.

以上只是对本发明作进一步的说明，并非用以限制本专利，凡为本发明等效实施，均应包含于本专利的权利要求范围之内。The above is only a further description of the present invention, and is not intended to limit this patent. Any equivalent implementation of the present invention should be included within the scope of the claims of this patent.

Claims

1. a remote supervision entity relationship extraction method based on human-computer interaction is characterized in that adopting crowdsourcing technology to join the model training method of the relation extraction task, reducing the impact of noise data on model performance by manual labeling, and manually labeling the method. The resulting data is put into the deep learning model for training and evaluation. According to the feedback of the model, the crowdsourcing strategy is adjusted in time to obtain new data and put into the model until all data is cleaned or the model performance is no longer improved. The remotely supervised entity relation extraction process includes the following steps:

Step a: Take the dataset that can be used for entity relationship extraction as the target text, extract the entity pair E(R) containing the relationship type R, and the natural language sentence S(ep _i ) corresponding to these R entities, where ep Contains the head entity and the tail entity, and the data set is the "freebase" data set and the "NYT" data set;

Step b: Select an unprocessed entity pair ep _k and its corresponding natural sentence sequence S(ep _k ) from E(R), and use word2vec to vectorize ep _k and S(ep _k ), and we will get The real-valued vectors V _k and V _sk of , where each natural sentence represented by V _sk contains multiple words, the “transE” model is used for similarity calculation, and the similarity between ep _k and each sentence V _skj will be obtained Sort by degree score M _kj ;

Step c: Use machine learning to select the suspected noise data that needs to be cleaned and put it into crowdsourcing for re-labeling, and then put the re-labeled data into the convolutional neural network model as training data, and adjust the selection strategy in time according to the feedback of the model for iterative cleaning. And in the iterative process, the model performance is gradually improved. The specific process includes the following steps:

(1) Set the similarity threshold

The similarity threshold is set according to the learning rate of 0.05, and the threshold is in the range of [0.5, 0.8] for iterative experiments. The threshold is automatically increased by 0.05 in each iteration, until the increase of the threshold no longer improves the model effect or reaches 0.8 So far, the optimal threshold for automatic parameter adjustment is obtained, and it is assumed that this threshold represents that sentences whose similarity M is higher than this threshold are positive examples by default, while sentences lower than this threshold are suspected noise data;

(2) Crowdsourced labeling

Collect suspected noise data Z(ep _k ) according to the threshold to write a crowdsourcing question generator, and batch generate single-choice questions about whether there is a specified relationship between two entities in sentences with content Z(ep _k ), when the worker chooses yes When answering, the keyword w related to the answer in the sentence needs to be given, and the above content will be used as the input data of the neural network model;

(3) Quality control of crowdsourcing results

Prepare a batch of questions with correct answers as gold data, mix them with the data that really needs to be crowdsourced, and distribute them to workers together. When the accuracy rate of the gold data is higher than 80%, it is considered that workers submitted The answer is valid labeling, otherwise there is a quality problem in the answer. The crowdsourcing is to distribute the same question to five or seven workers for labeling. The person's answer is the final result, you can immediately stop answering the question and recover the result;

(4) The labeled data is put into the deep learning model

Collect and organize the final results obtained through crowdsourcing annotation, add them to the original training set that excludes the crowdsourcing data, and put them into the convolutional neural network as the training set of the deep learning model for training to obtain the relationship between entities The input of the convolutional neural network is: the word vector of the sentence in the training set represents the word embedding and the vector of the position of the entity pair in the sentence represents the position embedding, and the output is: there is a certain entity between the pair of entities contained in the sentence. The probability of the type relationship, and iteratively adjust the parameters of the convolutional neural network according to the output result and the standard result;

(5) Model feedback mechanism

Put the test set data into the above-trained classification model for prediction, and evaluate the model performance according to the predicted value. If the model performance is improved, it means that the data after crowdsourcing and cleaning is valid labeled data, which can be used as real data. Training set; if the model performance is not improved, it is considered that this batch of labeled data has no practical significance, then return to step c, reset the threshold according to step (1) for iterative cleaning, and repeat steps (2)~( 5) Carry out re-labeling, and decide the termination of iterative cleaning according to the improvement of model performance and the feedback of crowdsourced labeling. When one round of iterative cleaning is over, return to step b and select the next unprocessed entity in E(R). Enter a new round of iterative cleaning for ep _m and its corresponding natural sentence sequence S(ep _m ), and repeat iterative cleaning until all unprocessed entity pairs ep _m and its corresponding natural sentence sequence in E(R) are completed. After S(ep _m ) is crowdsourced and cleaned, a data set after noise data cleaning is obtained.