CN111160564B

CN111160564B - A Chinese Knowledge Graph Representation Learning Method Based on Feature Tensor

Info

Publication number: CN111160564B
Application number: CN201911300781.5A
Authority: CN
Inventors: 李巧勤; 郑子强; 刘勇国; 杨尚明
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-12-17
Filing date: 2019-12-17
Publication date: 2023-05-19
Anticipated expiration: 2039-12-17
Also published as: CN111160564A

Abstract

The present invention provides a Chinese knowledge map representation learning method based on feature tensor, the method includes: data preparation, establishment of data structure, construction of entity feature vector matrix, definition of relational vector and distance formula of marked triplets, and obtaining training set 1. Train the knowledge graph representation learning model, update the model parameters, iteratively train, and use the model to predict the relationship between unlabeled triples, and iteratively train again until no new unlabeled triples can be learned. The present invention proposes to use Chinese pinyin, word information, word information, and description information to form feature tensors and convert them into feature vectors to replace the method of randomly initializing entity vectors in traditional knowledge representation learning, making full use of Chinese characteristics. In addition, the training corpus is supplemented by a two-layer iterative method, so that the relationship matrix can be continuously revised, and the accuracy and convergence speed of the knowledge graph representation learning model can be improved.

Description

A Chinese knowledge graph representation learning method based on feature tensor

技术领域Technical Field

本发明涉及知识图谱领域，尤其涉及一种基于特征张量的中文知识图谱表示学习方法。The present invention relates to the field of knowledge graphs, and in particular to a Chinese knowledge graph representation learning method based on feature tensors.

背景技术Background Art

知识图谱以结构化的形式描述客观世界中概念、实体间的复杂关系，提供了一种更好地组织、管理和理解互联网海量信息的能力。知识图谱技术通常包括知识表示、知识图谱构建和知识图谱应用三个方面的研究内容，其中，知识表示是知识图谱构建和应用的基础，反映人类对客观世界的认知，并能够从不同层次和粒度表达客观世界所呈现的语义。首先要了解人类本身是如何表示知识并利用他们解决问题的，然后将其形式化表示成计算机可以推理和计算的表达形式，建立基于知识的系统，提供智能知识服务。同时，知识表示也需要结合计算机对符号表示、处理和计算的能力。知识表示需要解决的关键问题是1)建立什么样的知识表示形式能够准确地反映客观世界的知识；2)建立什么样的知识表示可以具备语义表示能力；3)知识表示如何支持高效知识推理和计算，从而使知识表示具有得到新知识的推理能力。当前的知识表示方法可以分成基于符号逻辑的知识表示、互联网资源的开放知识表示方法和基于知识图谱的表示学习。Knowledge graphs describe the complex relationships between concepts and entities in the objective world in a structured form, providing a better way to organize, manage and understand the massive amount of information on the Internet. Knowledge graph technology usually includes three aspects of research: knowledge representation, knowledge graph construction and knowledge graph application. Among them, knowledge representation is the basis for knowledge graph construction and application, reflecting human cognition of the objective world and being able to express the semantics presented by the objective world from different levels and granularities. First, we need to understand how humans themselves represent knowledge and use them to solve problems, and then formalize it into an expression form that computers can reason and calculate, establish a knowledge-based system, and provide intelligent knowledge services. At the same time, knowledge representation also needs to combine the computer's ability to represent, process and calculate symbols. The key issues that need to be solved in knowledge representation are: 1) what kind of knowledge representation form can accurately reflect the knowledge of the objective world; 2) what kind of knowledge representation can have semantic representation capabilities; 3) how knowledge representation can support efficient knowledge reasoning and calculation, so that knowledge representation has the ability to reason about new knowledge. Current knowledge representation methods can be divided into knowledge representation based on symbolic logic, open knowledge representation methods of Internet resources, and representation learning based on knowledge graphs.

1)基于符号逻辑的知识表示：主要包括逻辑表示法、产生式表示法和框架表示等，虽然基于符号逻辑的知识表示技术虽然可以很好地描述逻辑推理，但是由于在推理中机器生成规则的能力很弱，推理规则的获取需要大量的人力，并且对数据的质量要求较高，在目前大规模数据时代，基于符号逻辑的知识表示已经不能很好地解决知识表示的问题。1) Knowledge representation based on symbolic logic: mainly includes logical representation, production representation and framework representation. Although knowledge representation technology based on symbolic logic can well describe logical reasoning, the ability of machines to generate rules in reasoning is very weak, the acquisition of reasoning rules requires a lot of manpower, and the quality of data is relatively high. In the current era of large-scale data, knowledge representation based on symbolic logic can no longer solve the problem of knowledge representation well.

2)万维网内容的知识表示：Tim Berners-Lee提出了语义网的概念，在语义网中，网络内容都应该有确定的意义，而且可以很容易地被计算机理解、获取和集成。万维网内容知识表示包括半结构基于标记的置标语言XML、基于RDF万维网资源语义元数据描述框架和基于描述逻辑的OWL本体描述语言等；以及当前在工业界得到大规模应用的基于三元组的知识图谱知识表示方法，三元组可以表示为<h，r，t>，表示头实体h和尾实体t之间存在关系r。这些技术使我们可以将机器理解和处理的语义信息发布在万维网上。但是万维网内容以数百万亿计，这对于知识存储和知识表示学习是一个巨大的挑战。2) Knowledge representation of Web content: Tim Berners-Lee proposed the concept of the semantic web, in which all web content should have a definite meaning and can be easily understood, acquired and integrated by computers. Web content knowledge representation includes semi-structured tag-based markup language XML, RDF-based Web resource semantic metadata description framework, and description logic-based OWL ontology description language, as well as triple-based knowledge graph knowledge representation methods that are currently widely used in the industry. A triple can be represented as <h, r, t>, indicating that there is a relationship r between the head entity h and the tail entity t. These technologies enable us to publish semantic information that can be understood and processed by machines on the Web. However, the content of the Web is in the hundreds of trillions, which is a huge challenge for knowledge storage and knowledge representation learning.

3)表示学习：表示学习的目标是通过机器学习或深度学习将研究对象的语义信息表示为稠密低维的向量。对不同粒度知识单元进行隐式的向量化表示，以支持大数据环境下知识的快速计算。表示学习主要包括张量重构和势能函数的方法。张量重构综合整个知识库的信息，但在大数据环境下张量维度很高，重构的计算量较大；势能函数方法认为关系是头实体向尾实体的一种翻译操作，Bordes等人提出的TransE模型是翻译模型的代表，但缺少了显示的语义信息，尤其是中文这种拥有拼音、结构、字词信息的文字，通过机器学习或深度学习到的中文低维向量仅仅是计算机的参数拟合，缺乏解释性。3) Representation learning: The goal of representation learning is to represent the semantic information of the research object as a dense low-dimensional vector through machine learning or deep learning. Implicit vectorization is performed on knowledge units of different granularities to support the rapid calculation of knowledge in a big data environment. Representation learning mainly includes tensor reconstruction and potential energy function methods. Tensor reconstruction integrates the information of the entire knowledge base, but in a big data environment, the tensor dimension is very high and the reconstruction calculation is large; the potential energy function method believes that the relationship is a translation operation from the head entity to the tail entity. The TransE model proposed by Bordes et al. is a representative of the translation model, but it lacks explicit semantic information, especially for Chinese, which has phonetic, structural, and word information. The low-dimensional Chinese vector obtained through machine learning or deep learning is only a computer parameter fitting and lacks interpretability.

综上，基于符号逻辑的知识表示和互联网资源的开放知识表示方法使知识具有显示的语义定义，但存在数据稀疏问题，难以实现大规模的知识图谱应用；基于深度学习的知识表示可以将知识单元(实体、关系和规则)映射到低维的连续实数空间表示，但缺少了显示的语义定义。In summary, knowledge representation based on symbolic logic and open knowledge representation methods based on Internet resources give knowledge an explicit semantic definition, but there is a data sparsity problem, making it difficult to realize large-scale knowledge graph applications; knowledge representation based on deep learning can map knowledge units (entities, relationships, and rules) to a low-dimensional continuous real number space representation, but lacks an explicit semantic definition.

此外，国外对于知识图谱表示的学习研究较多，但仅限于英文知识图谱，由于语言差异，英文单词只有简单的字符串信息、短语信息，在对知识表示进行学习时，只需要随机初始化向量即可，而中文则包含了丰富的语义信息，现有的研究方法无法在中文知识图谱上取得良好的效果，目前国内主要还停留在如何构建知识图谱的阶段，缺乏对中文知识图谱表示学习的研究。In addition, there are many studies on learning knowledge graph representation abroad, but they are limited to English knowledge graphs. Due to language differences, English words only have simple string information and phrase information. When learning knowledge representation, only random initialization vectors are needed. Chinese contains rich semantic information. Existing research methods cannot achieve good results on Chinese knowledge graphs. At present, China is still mainly at the stage of how to build knowledge graphs, and lacks research on learning Chinese knowledge graph representation.

发明内容Summary of the invention

针对上述问题，本发明提出一种基于特征张量的中文知识图谱表示学习方法，相比较于随机初始化向量，本发明引入中文拼音、字、词、描述信息四个特征作为中文显示语义信息构成特征张量，使得中文知识图谱表示的学习过程变得可解释，并且与深度学习相结合，将学习的知识表示映射到低维连续实数空间，便于对中文知识及其之间的关系进行学习。In response to the above problems, the present invention proposes a Chinese knowledge graph representation learning method based on feature tensor. Compared with random initialization vectors, the present invention introduces four features of Chinese pinyin, characters, words, and description information as Chinese display semantic information to constitute the feature tensor, which makes the learning process of Chinese knowledge graph representation explainable. It is combined with deep learning to map the learned knowledge representation to a low-dimensional continuous real number space, which facilitates the learning of Chinese knowledge and the relationships between them.

本发明提出的一种基于特征张量的中文知识图谱表示学习方法，包括如下步骤：The present invention proposes a Chinese knowledge graph representation learning method based on feature tensor, which includes the following steps:

步骤1)数据准备Step 1) Data preparation

将来自一个开放的中文链接数据集zhishi.me的数据构成三元组数据，所述三元组数据由大量三元组组成，三元组形如<h，r，t>，h表示头实体，t表示尾实体，r表示头实体h和尾实体t之间的关系；The data from an open Chinese link dataset zhishi.me is used to form triple data, wherein the triple data consists of a large number of triples, and the triple is in the form of <h, r, t>, where h represents the head entity, t represents the tail entity, and r represents the relationship between the head entity h and the tail entity t;

步骤2)建立数据结构Step 2) Create data structure

将所述三元组数据分为标记三元组和未标记三元组，并构建字典、实体词典、关系词典、实体拼音矩阵、字嵌入矩阵、词嵌入矩阵和描述矩阵的数据结构；The triple data are divided into labeled triples and unlabeled triples, and data structures of a dictionary, an entity dictionary, a relationship dictionary, an entity pinyin matrix, a character embedding matrix, a word embedding matrix and a description matrix are constructed;

步骤3)构建实体特征向量矩阵Step 3) Construct entity feature vector matrix

对于每个标记三元组中的实体，首先由实体拼音向量、字向量、词向量和描述向量构成实体的特征张量；并将标记三元组中的所有实体的特征张量转换为实体的特征向量，并按实体词典的顺序构建实体特征向量矩阵；For each entity in the tag triple, firstly, the entity's feature tensor is constructed by the entity's pinyin vector, character vector, word vector and description vector; then the feature tensors of all entities in the tag triple are converted into the entity's feature vector, and the entity feature vector matrix is constructed in the order of the entity dictionary;

步骤4)取一个标记三元组T_l＝<h，r，t>，通过实体特征向量矩阵得到头实体h和尾实体t的特征向量h_ft和t_ft，为了表示实体h与实体t存在关系r，即h+r＝t，所以标记三元组T_l＝<h，r，t>的关系向量可表示为：Step 4) Take a labeled triple T _l = <h, r, t>, and obtain the feature vectors h _ft and t _ft of the head entity h and the tail entity t through the entity feature vector matrix. In order to indicate that there is a relationship r between entity h and entity t, that is, h+r=t, the relationship vector of the labeled triple T _l = <h, r, t> can be expressed as:

r＝t_ft-h_ft r = t _ft -h _ft

为了计算实体h和实体t之间的距离，通过向量转换来表示实体之间的关系，采用欧式距离定义三元组<h，r，t>的距离公式为：In order to calculate the distance between entity h and entity t, the relationship between entities is represented by vector transformation, and the distance formula of the triple <h, r, t> is defined using Euclidean distance:

其中下标“2”表示2范数，即欧几里得范数，上标“2”表示求平方；The subscript "2" indicates the 2-norm, i.e. the Euclidean norm, and the superscript "2" indicates square.

步骤5)将标记三元组作为训练集，并初始化实体向量，即实体特征向量矩阵，初始化关系向量，构建关系向量矩阵，顺序与关系词典一致，关系计算由公式r＝t_ft-h_ft得到，若有多个实体对存在同一个关系，则关系向量为多个实体对向量差值取平均，对所有的关系向量初始化后要进行归一化，使得精度提高，并且收敛加强；Step 5) Use the labeled triples as training sets, and initialize entity vectors, i.e., entity feature vector matrices, initialize relationship vectors, and construct relationship vector matrices. The order is consistent with the relationship dictionary. The relationship calculation is obtained by the formula r = t _ft -h _ft . If multiple entity pairs have the same relationship, the relationship vector is the average of the differences between the vectors of multiple entity pairs. All relationship vectors are normalized after initialization to improve accuracy and enhance convergence.

步骤6)在所述训练集中随机选取一个正三元组<h，r，t>，在负三元组中将<h′，r，t>和<h，r，t′>选出，与<h，r，t>组成元组对，构成一个训练批次T_batch＝[(<h，r，t>，<h′，r，t>)，(<h，r，t>，<h，r，t′>)]，采用S_p＝{<h，r，t>}表示正三元组，S_f＝{<h′，r，t>|h′∈E，<h，r，t′>|t′∈E}表示负三元组，将T_batch作为知识图谱表示学习模型的输入，对T_batch进行知识图谱表示学习训练，其中，E表示实体集合，结合公式

所述知识图谱表示学习模型的损失函数定义为：Step 6) Randomly select a positive triplet <h, r, t> in the training set, select <h′, r, t> and <h, r, t′> from the negative triplet, and form a tuple pair with <h, r, t> to form a training batch T _batch = [(<h, r, t>, <h′, r, t>), (<h, r, t>, <h, r, t′>)], use S _p = {<h, r, t>} to represent the positive triplet, S _f = {<h′, r, t>|h′∈E, <h, r, t′>|t′∈E} to represent the negative triplet, use T _batch as the input of the knowledge graph representation learning model, and perform knowledge graph representation learning training on T _batch , where E represents the entity set, combined with the formula

The loss function of the knowledge graph representation learning model is defined as:

其中γ是大于0的间隔距离参数，是一个超参数，[x]₊表示正值函数，即当x＞0时，[x]₊＝x；当x≤0时，[x]₊＝0；这种训练方法叫做margin-based ranking criterion，目的是将正三元组和负三元组尽可能地分开，找出最大距离的支持向量；Where γ is a distance parameter greater than 0, which is a hyperparameter. [x] ₊ represents a positive function, that is, when x>0, [x] ₊ =x; when x≤0, [x] ₊ =0. This training method is called margin-based ranking criterion, and its purpose is to separate positive triplets and negative triplets as much as possible and find the support vector with the maximum distance.

步骤7)采用随机梯度下降(SGD)对知识图谱表示学习模型参数进行更新，梯度更新只需计算距离d(h+r，t)和d(h′+r，t′)，设一共有|E|个实体和|R|个关系，每个实体向量长度为m维，关系向量长度为n维，则一共需要更新(|E|*m+|R|*n)个参数；Step 7) Use stochastic gradient descent (SGD) to update the parameters of the knowledge graph representation learning model. The gradient update only needs to calculate the distance d(h+r, t) and d(h′+r, t′). Assume there are a total of |E| entities and |R| relationships, each entity vector is m-dimensional, and the relationship vector is n-dimensional, then a total of (|E|*m+|R|*n) parameters need to be updated;

步骤8)重复步骤6)-步骤7)进行迭代训练，迭代训练完成之后，使用知识图谱表示学习模型参数对未标记三元组进行关系预测，预测步骤为：从未标记三元组中任取一个三元组<h，r，t>_unlabel，使用知识图谱表示学习模型参数预测h与t之间的关系r′，若r′＝r，则预测正确；同时，将预测正确的三元组作为正三元组，随机替换这些正三元组的头实体或尾实体作为负三元组，然后将这些新的正三元组和负三元组结合至原来的标记三元组中，形成新的标记三元组；Step 8) Repeat step 6)-step 7) to perform iterative training. After the iterative training is completed, the knowledge graph is used to represent the learning model parameters to predict the relationship between unlabeled triples. The prediction step is: randomly select a triple <h, r, t> _unlabel from the unlabeled triples, and use the knowledge graph to represent the learning model parameters to predict the relationship r′ between h and t. If r′=r, the prediction is correct; at the same time, the correctly predicted triples are used as positive triples, and the head entities or tail entities of these positive triples are randomly replaced as negative triples. Then, these new positive triples and negative triples are combined with the original labeled triples to form new labeled triples;

步骤9)采用新的标记三元组重复步骤4)-8)进行迭代训练，直至无法学习到新的未标记三元组为止，说明知识图谱表示学习模型已经无法从当前的训练数据中学习到更多的中文知识特征，知识图谱表示学习模型所输出的实体向量和关系向量是中文链接数据集zhishi.me所对应的最佳的中文知识图谱表示，训练完成。Step 9) Repeat steps 4)-8) for iterative training using new labeled triples until no new unlabeled triples can be learned, indicating that the knowledge graph representation learning model can no longer learn more Chinese knowledge features from the current training data. The entity vectors and relationship vectors output by the knowledge graph representation learning model are the best Chinese knowledge graph representation corresponding to the Chinese link dataset zhishi.me, and the training is completed.

本发明针对现有知识表示的学习方法无法结合中文字词信息的问题，提出使用中文拼音、字信息、词信息、描述信息组成特征张量，并转换为特征向量，用以替代传统知识表示学习中随机初始化实体向量的方法，充分利用了中文特点。另外，本发明采用双层迭代的方式，对训练语料进行补充，使得关系矩阵能够不断修正，提高知识图谱表示学习模型的精度和收敛速度。In view of the problem that the existing knowledge representation learning method cannot combine Chinese word information, the present invention proposes to use Chinese pinyin, character information, word information, and description information to form a feature tensor, and convert it into a feature vector to replace the method of randomly initializing entity vectors in traditional knowledge representation learning, making full use of the characteristics of Chinese. In addition, the present invention adopts a double-layer iteration method to supplement the training corpus, so that the relationship matrix can be continuously corrected, improving the accuracy and convergence speed of the knowledge graph representation learning model.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明的方法处理过程框架图FIG. 1 is a framework diagram of the method processing process of the present invention

图2为本发明的方法处理过程流程图FIG. 2 is a flow chart of the method processing process of the present invention

图3为描述矩阵LSTM编码实体描述向量Figure 3 is the description matrix LSTM encoding entity description vector

图4为张量转换为向量示意图Figure 4 is a schematic diagram of converting a tensor into a vector

具体实施方式DETAILED DESCRIPTION

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。The technical solutions in the embodiments of the present invention will be described clearly and completely below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, rather than all the embodiments.

如图2所示，本发明提出的一种基于特征张量的中文知识图谱表示学习方法包括下列步骤：As shown in FIG2 , a Chinese knowledge graph representation learning method based on feature tensor proposed in the present invention includes the following steps:

步骤1)数据准备Step 1) Data preparation

本发明所用三元组数据来自一个开放的中文链接数据集zhishi.me，由大量三元组组成，三元组形如<h，r，t>，h表示头实体，t表示尾实体，r表示头实体h和尾实体t之间的关系。The triple data used in the present invention comes from an open Chinese link data set zhishi.me, which consists of a large number of triples in the form of <h, r, t>, where h represents the head entity, t represents the tail entity, and r represents the relationship between the head entity h and the tail entity t.

步骤2)建立数据结构Step 2) Create data structure

如图1所示，将所述三元组数据分为标记三元组和未标记三元组，并构建字典、实体词典、关系词典、实体拼音矩阵、字嵌入矩阵、词嵌入矩阵、描述矩阵等数据结构。As shown in FIG1 , the triple data are divided into labeled triples and unlabeled triples, and data structures such as dictionaries, entity dictionaries, relationship dictionaries, entity pinyin matrices, character embedding matrices, word embedding matrices, and description matrices are constructed.

标记三元组：在所述数据集zhishi.me中随机抽取三元组，得到一个三元组数据集合，将所述三元组数据集合中的所有三元组作为正三元组，将每个正三元组的头实体或尾实体去掉，并随机在实体词典中选择一个不同于自身的实体代替，构成负三元组，每次只替换三元组中的一个实体，这样才有对照性。对上述这些三元组进行标记，将正三元组标记为1，负三元组标记为0。Marking triples: Randomly extract triples from the data set zhishi.me to obtain a triple data set, take all triples in the triple data set as positive triples, remove the head entity or tail entity of each positive triple, and randomly select an entity different from itself in the entity dictionary to replace it, forming a negative triple. Only one entity in the triple is replaced each time, so that there is a comparison. Mark the above triples, mark the positive triples as 1, and the negative triples as 0.

未标记三元组：所述数据集zhishi.me中任意未标记的三元组。Unlabeled triples: any unlabeled triples in the dataset zhishi.me.

字典：所述数据集zhishi.me中出现的所有字，包括所有的头实体、尾实体和关系构成的字典，字典形式为“字：序列号”，序列号为数字，从零开始递增。Dictionary: All the characters appearing in the dataset zhishi.me, including all head entities, tail entities and relations. The dictionary format is "character: serial number", where the serial number is a number and increases from zero.

实体词典：所述数据集zhishi.me中的实体集合，包括所有的头实体和尾实体构成的词典，词典形式为“实体名：序列号”，序列号为数字，从零开始递增。Entity dictionary: The entity set in the dataset zhishi.me, including the dictionary consisting of all head entities and tail entities. The dictionary format is "entity name: serial number", where the serial number is a number and increases from zero.

关系词典：所述数据集zhishi.me中的关系集合构成的词典，词典形式为“关系名：序列号”，序列号为数字，从零开始递增。Relation dictionary: a dictionary consisting of the relationship sets in the dataset zhishi.me. The dictionary format is "relation name: sequence number". The sequence number is a number, starting from zero and increasing.

实体拼音矩阵：为解决多音字不同含义的问题，调用百度翻译API得到实体拼音，构建实体拼音矩阵，行数与实体词典中实体数量一致，每行为使用one-hot编码方式得到的实体拼音向量。Entity Pinyin Matrix: To solve the problem of different meanings of polyphones, call the Baidu Translate API to get the entity Pinyin and construct an entity Pinyin matrix. The number of rows is consistent with the number of entities in the entity dictionary, and each row is an entity Pinyin vector obtained using one-hot encoding.

字嵌入矩阵：行数与字典的字数一致，每行为使用word2vec得到的字向量。Word embedding matrix: The number of rows is the same as the number of words in the dictionary, and each row is a word vector obtained using word2vec.

词嵌入矩阵：行数与实体词典中实体数量一致，每行为使用word2vec得到的词向量。Word embedding matrix: The number of rows is consistent with the number of entities in the entity dictionary, and each row is a word vector obtained using word2vec.

描述矩阵：行数与实体词典中实体数量一致，调用百度百科API获到实体描述信息，将描述信息输入双向长短期记忆网络(Bi-directional Long Short-Term Memory，BiLSTM)编码得到实体描述向量，如图3所示。该向量引入了实体的描述信息，可以解决中文同义词的问题。Description matrix: The number of rows is consistent with the number of entities in the entity dictionary. The Baidu Encyclopedia API is called to obtain entity description information, and the description information is input into the bidirectional long short-term memory network (Bi-directional Long Short-Term Memory, BiLSTM) to encode the entity description vector, as shown in Figure 3. This vector introduces the description information of the entity and can solve the problem of Chinese synonyms.

对于每个标记三元组中的实体，首先由实体拼音向量、字向量、词向量和描述向量构成实体的特征张量，在后续步骤中作为预先定义好的实体的特征张量。构建过程为：标记三元组表示为T_l，用E表示知识图谱中的实体集合，用R表示关系集合，任取一个标记三元组中的实体e∈E，查找实体拼音矩阵得到实体拼音向量e_p；设实体名称为c₁c₂...c_m，其中c_m表示组成实体名称的第m个字，根据字查找字嵌入矩阵得到实体的字向量e_c＝c₁c₂...c_m；查找词嵌入矩阵得到词向量e_w；查找描述矩阵得到实体的描述向量e_d，实体的特征张量FeatureTens or表示为

For each entity in a tag triple, the entity's feature tensor is first constructed by the entity's pinyin vector, character vector, word vector, and description vector, which is used as the predefined entity's feature tensor in subsequent steps. The construction process is as follows: the tag triple is represented as T _l , E is used to represent the entity set in the knowledge graph, and R is used to represent the relationship set. Take any entity e∈E in a tag triple, search the entity's pinyin matrix to obtain the entity's pinyin vector _ep ; let the entity name be c ₁ c ₂ ...c _m , where c _m represents the mth character that makes up the entity name, and search the character embedding matrix based on the character to obtain the entity's character vector e _c =c ₁ c ₂ ...c _m ; search the word embedding matrix to obtain the word vector e _w ; search the description matrix to obtain the entity's description vector _ed , and the entity's feature tensor FeatureTens or is represented as

将实体的特征张量转换为实体的特征向量，构建实体特征向量矩阵。如图4所示，将实体的特征张量的不同维度进行连接，连接方式为向量拼接。如给定向量A＝[x₁，x₂，x₃，...，x_m]和B＝[y₁，y₂，y₃，...，y_n]，连接得到向量C＝[x₁，x₂，x₃，...，x_m，y₁，y₂，y₃，...，y_n]，并采用dropout对向量进行随机丢失，防止知识表示学习过拟合。用这种方法将同一个实体e的拼音向量e_p、字向量e_c、词向量e_w、描述向量e_d进行连接，得到实体的特征向量e_ft＝[e_p；e_c；e_w；e_d]。The feature tensor of the entity is converted into the feature vector of the entity, and the entity feature vector matrix is constructed. As shown in Figure 4, the different dimensions of the feature tensor of the entity are connected, and the connection method is vector splicing. For example _, given vectors A = [x ₁ , x ₂ , x ₃ , ..., x _m ] and B = [y ₁ , y _{2, y 3, ..., yn], the connection is obtained to obtain vector C = [x 1, x 2} _, _x ₃ _, ..., x _m , y ₁ , y ₂ , y ₃ , ..., _yn ], and dropout is used to randomly lose the vector to prevent overfitting of knowledge representation learning. In this way, the pinyin vector _ep , character vector _ec , word vector _ew , and description vector _ed of the same entity e are connected to obtain the entity's feature vector e _ft = [ _ep ; e _c ; e _w ; _ed ].

将标记三元组中的所有实体的特征张量转换为实体的特征向量，并按实体词典的顺序构建实体特征向量矩阵。Convert the feature tensors of all entities in the tag triplet into feature vectors of the entity, and build the entity feature vector matrix in the order of the entity dictionary.

r＝t_ft-h_ft (1)r＝t _ft -h _ft (1)

公式(2)中的下标“2”表示2范数，即欧几里得范数，上标“2”表示求平方。The subscript "2" in formula (2) represents the 2-norm, i.e., the Euclidean norm, and the superscript "2" represents square.

步骤5)将标记三元组作为训练集，初始化实体向量，即实体特征向量矩阵，初始化关系向量，构建关系向量矩阵，顺序与关系词典一致，关系计算由公式(1)得到。若有多个实体对存在同一个关系，则关系向量为多个实体对向量差值取平均。对所有的向量初始化后要进行归一化，会使精度加大和收敛加强。Step 5) Use the labeled triples as training sets, initialize entity vectors, i.e., entity feature vector matrices, initialize relationship vectors, and construct relationship vector matrices. The order is consistent with the relationship dictionary, and the relationship calculation is obtained by formula (1). If there are multiple entity pairs with the same relationship, the relationship vector is the average of the differences between the vectors of multiple entity pairs. After all vectors are initialized, they should be normalized to increase accuracy and enhance convergence.

步骤6)在所述训练集中随机选取一个正三元组<h，r，t>，在负三元组中将<h′，r，t>和<h，r，t′>选出，与<h，r，t>组成元组对，构成一个训练批次T_batch＝[(<h，r，t>，<h′，r，t′>)，(<h，r，t>，<h，r，t′>)]。用S_p＝{<h，r，t>}表示正三元组，S_f＝{<h′，r，t>|h′∈E，<h，r，t′>|t′∈E}表示负三元组。将T_batch作为知识图谱表示学习模型的输入，对T_batch进行知识图谱表示学习训练，结合公式(2)，知识图谱表示学习模型的损失函数定义为：Step 6) Randomly select a positive triplet <h, r, t> in the training set, select <h′, r, t> and <h, r, t′> from the negative triplet, and form a tuple pair with <h, r, t> to form a training batch T _batch = [(<h, r, t>, <h′, r, t′>), (<h, r, t>, <h, r, t′>)]. Use _Sp = {<h, r, t>} to represent the positive triplet, and _Sf = {<h′, r, t>|h′∈E, <h, r, t′>|t′∈E} to represent the negative triplet. Take T _batch as the input of the knowledge graph representation learning model, perform knowledge graph representation learning training on T _batch , and combine formula (2), the loss function of the knowledge graph representation learning model is defined as:

其中γ是大于0的间隔距离参数，是一个超参数，[x]₊表示正值函数，即当x＞0时，[x]₊＝x；当x≤0时，[x]₊＝0。这种训练方法叫做margin-based ranking criterion，目的是将正三元组和负三元组尽可能地分开，找出最大距离的支持向量。Where γ is a distance parameter greater than 0, which is a hyperparameter, and [x] ₊ represents a positive function, that is, when x>0, [x] ₊ = x; when x≤0, [x] ₊ = 0. This training method is called margin-based ranking criterion, and its purpose is to separate positive triplets and negative triplets as much as possible and find the support vector with the maximum distance.

步骤7)采用随机梯度下降(SGD)对知识图谱表示学习模型参数进行更新，梯度更新只需计算距离d(h+r，t)和d(h′+r，t′)。设一共有|E|个实体和|R|个关系，每个实体向量长度为m维，关系向量长度为n维，则一共需要更新(|E|*m+|R|*n)个参数。Step 7) Use stochastic gradient descent (SGD) to update the parameters of the knowledge graph representation learning model. Gradient update only needs to calculate the distance d(h+r, t) and d(h′+r, t′). Assume there are a total of |E| entities and |R| relationships, each entity vector is m-dimensional, and the relationship vector is n-dimensional, then a total of (|E|*m+|R|*n) parameters need to be updated.

步骤8)重复步骤6)-7)进行迭代训练，迭代训练完成之后，使用知识图谱表示学习模型对未标记三元组进行关系预测。预测步骤为，从未标记三元组中任取一个三元组<h，r，t>_unlabel，使用知识图谱表示学习模型预测h与t之间的关系r′，若r′＝r，则预测正确。同时，将预测正确的三元组作为正三元组，随机替换这些正三元组的头实体或尾实体作为负三元组，然后将这些新的正三元组和负三元组结合至原来的标记三元组中，形成新的标记三元组。Step 8) Repeat steps 6)-7) for iterative training. After the iterative training is completed, the knowledge graph representation learning model is used to predict the relationship between unlabeled triples. The prediction step is to randomly select a triple <h, r, t> _unlabel from the unlabeled triples, and use the knowledge graph representation learning model to predict the relationship r′ between h and t. If r′＝r, the prediction is correct. At the same time, the correctly predicted triples are used as positive triples, and the head entities or tail entities of these positive triples are randomly replaced as negative triples. Then, these new positive triples and negative triples are combined with the original labeled triples to form new labeled triples.

步骤9)采用新的标记三元组重复步骤4)-8)进行迭代训练，直至无法学习到新的未标记三元组为止，说明知识图谱表示学习模型已经无法从当前的训练数据中学习到更多的中文知识特征，知识图谱表示学习模型所输出的实体向量和关系向量是数据集zhishi.me所对应的最佳的中文知识图谱表示，训练完成。Step 9) Repeat steps 4)-8) for iterative training using new labeled triples until no new unlabeled triples can be learned, indicating that the knowledge graph representation learning model can no longer learn more Chinese knowledge features from the current training data. The entity vectors and relationship vectors output by the knowledge graph representation learning model are the best Chinese knowledge graph representation corresponding to the data set zhishi.me, and the training is completed.

中文知识图谱表示学习方法通常采用链接预测作为评估任务，评价指标包括秩均值(MR)、平均倒数排名(MRR)、正确实体在排名前十的结果的百分比(Hits@10)、正确实体在排名前三的结果的百分比(Hits@3)、排名第一的结果是正确实体的百分比(Hits@1)，其中MR的值越小越好，MRR、Hits@10、Hits@3、Hits@1的值越大越好。将数据集zhishi.me中的三元组，随机去除其部分实体或关系，链接预测即对三元组中随机去除的实体或关系进行预测。Chinese knowledge graph representation learning methods usually use link prediction as an evaluation task. The evaluation indicators include mean rank (MR), mean reciprocal rank (MRR), the percentage of correct entities in the top ten results (Hits@10), the percentage of correct entities in the top three results (Hits@3), and the percentage of correct entities in the first-ranked results (Hits@1). The smaller the MR value, the better, and the larger the MRR, Hits@10, Hits@3, and Hits@1 values, the better. Randomly remove some entities or relations from the triples in the zhishi.me dataset, and link prediction is to predict the randomly removed entities or relations in the triples.

在本发明的实施例中，在开源中文数据集zhishi.me上进行表示学习并进行链接预测任务评估，并与TransE模型和TransR模型这两种知识图谱表示学习方法测试结果进行比较，结果如表1所示：In an embodiment of the present invention, representation learning is performed on the open source Chinese dataset zhishi.me and link prediction task evaluation is performed, and the test results of two knowledge graph representation learning methods, the TransE model and the TransR model, are compared. The results are shown in Table 1:

表1测试结果Table 1 Test results

评价指标Evaluation indicators MRMR MRRMRR Hits@10Hits@10 Hits@3Hits@3 Hits@1Hits@1 TransETransE 713713 0.4580.458 0.8120.812 0.7230.723 0.5560.556 TranRTranR 687687 0.5190.519 0.8390.839 0.7680.768 0.6460.646 本发明The present invention 611611 0.8430.843 0.8750.875 0.8010.801 0.6920.692

实验结果显示，本发明的测试结果优于TransE模型和TransR模型，达到了可用水平。The experimental results show that the test results of the present invention are better than those of the TransE model and the TransR model, reaching a usable level.

尽管上面对本发明说明性的具体实施方式进行了描述，以便于本技术领域的技术人员理解本发明，但应该清楚，本发明不限于具体实施方式的范围。凡采用等同替换或等效替换，这些变化是显而易见，一切利用本发明构思的发明创造均在保护之列。Although the above is a description of the illustrative specific embodiments of the present invention to facilitate the understanding of the present invention by those skilled in the art, it should be clear that the present invention is not limited to the scope of the specific embodiments. Where equivalent substitution or equivalent substitution is adopted, these changes are obvious, and all inventions and creations using the concept of the present invention are protected.

Claims

1. The Chinese knowledge graph representation learning method based on the feature tensor is characterized by comprising the following steps of:

step 1) data preparation

Forming data from an open Chinese linked data set zhishi. Me into triplet data, wherein the triplet data is formed by a plurality of triples, the triples are shown as < h, r, t >, h represents a head entity, t represents a tail entity, and r represents a relation between the head entity h and the tail entity t;

step 2) establishing a data structure

Dividing the triplet data into marked triples and unmarked triples, and constructing a dictionary, an entity dictionary, a relation dictionary, an entity pinyin matrix, a word embedding matrix and a data structure of a description matrix, wherein,

marking the triplet: randomly extracting triplet data from the Chinese linked data set zhishi. Me to obtain a triplet data set, taking all triples in the triplet data set as positive triples, removing head entities or tail entities of each positive triplet, randomly selecting an entity different from the entity dictionary to replace the entity dictionary to form negative triples, and only replacing one entity in the triples each time, so that the triples are marked with contrast, and marking the positive triples as 1 and the negative triples as 0;

unlabeled triples: any unlabeled triples in the chinese linked dataset zhishi;

dictionary: all words appearing in the Chinese linked dataset zhishi. Me, including all head entities, tail entities and dictionaries of relationships, are in the form of "words: a serial number, wherein the serial number is a number, and the serial number is increased from zero;

entity dictionary: the entity set in the Chinese link data set zhishi. Me is represented by E, and comprises a dictionary formed by all head entities and tail entities, wherein the dictionary form is 'entity name': a serial number, wherein the serial number is a number, and the serial number is increased from zero;

relationship dictionary: the dictionary is formed by relation sets in the Chinese link data set zhishi. Me, and the dictionary form is' relation name: a serial number, wherein the serial number is a number, and the serial number is increased from zero;

entity pinyin matrix: in order to solve the problem of different meanings of polyphones, calling hundred-degree translation API to obtain entity pinyin, and constructing an entity pinyin matrix, wherein the number of rows of the entity pinyin matrix is consistent with the number of entities in the entity dictionary, and each action of the entity pinyin matrix uses an entity pinyin vector obtained by a one-hot coding mode;

word embedding matrix: the number of lines of the word embedding matrix is consistent with the number of words in the dictionary, and word2vec is used for each line of the word embedding matrix to obtain a word vector;

word embedding matrix: the number of lines of the word embedding matrix is consistent with the number of entities in the entity dictionary, and word vectors obtained by word2vec are used for each action of the word embedding matrix;

describing a matrix: the number of lines of the description matrix is consistent with the number of entities in the entity dictionary, an hundred-degree encyclopedia API is called to acquire entity description information, the entity description information is input into a two-way long-short-term memory network for coding to acquire an entity description vector, and the entity description vector introduces the entity description information, so that the problem of Chinese synonyms can be solved;

step 3) constructing entity feature vector matrix

For the entity in each marked triplet, firstly, an entity pinyin vector, a word vector and an entity description vector form a feature tensor of the entity; converting feature tensors of all entities in the marked triplet into feature vectors of the entities, and constructing an entity feature vector matrix according to the sequence of the entity dictionary;

step 4) taking a marked triplet T _l = < h, r, t >, and obtaining the eigenvectors h of the head entity h and the tail entity t through the entity eigenvector matrix _ft And t _ft To indicate that there is a relation r between entity h and entity T, i.e. h+r=t, the triplet T is marked _l The relationship vector for = < h, r, t > can be expressed as:

r＝t _ft -h _ft

in order to calculate the distance between the entity h and the entity t, the relation between the entities is represented by vector conversion, and the distance formula of the Euclidean distance definition triplet < h, r, t > is as follows:

wherein the subscript "2" denotes the 2 norm, i.e. euclidean norm, and the superscript "2" denotes squaring;

step 5) taking all marked triples as training sets, initializing entity vectors, namely entity feature vector matrixes, initializing relation vectors, constructing a relation vector matrix, wherein the sequence of the relation vector matrix is consistent with that of the relation dictionary, and calculating the relation by the formula r=t _ft -h _ft If the plurality of entity pairs have the same relation, the relation vector is obtained by averaging the vector difference values of the plurality of entity pairs, and normalization is carried out after all the relation vectors are initialized, so that the precision is improved, and convergence is enhanced;

step 6) randomly selecting a positive triplet < h, r, t > in the training set, selecting < h ', r, t > and < h, r, t' in the negative triplet, and forming a tuple pair with < h, r, t > to form a training batch

T _batch ＝[(<h，r，t>，<h′，r，t>)，(<h，r，t>，<h，r，t′>)]，

By S _p ＝{<h，r，t>Positive triples, S _f ＝{<h′，r，t>|h′∈E，<h，r，t′>T' E represents a negative triplet, T _batch As input of knowledge graph representation learning model, for T _batch Training knowledge graph representation learning and combining formula

The knowledge graph represents the loss function definition of the learning model as follows:

wherein gamma is a spacing distance parameter greater than 0, is a super parameter, [ x ]] ₊ Representing a positive function, i.e. when x > 0, [ x ]] ₊ =x; when x is less than or equal to 0, [ x ]] ₊ =0; this training method is called margin-based ranking criterion, which aims to separate the positive triplet and the negative triplet as far as possible and find the support vector of the maximum distance;

step 7) updating parameters of the knowledge graph representation learning model by adopting random gradient descent, wherein gradient updating only needs to calculate distances d (h+r, t) and d (h '+r, t'), and a total of |E| entities and |R| relations are set, the length of each entity vector is m-dimension, the length of each relation vector is n-dimension, and the total of |E|m+ |R| n parameters need to be updated;

and 8) repeating the steps 6) -7) for iterative training, and carrying out relation prediction on unlabeled triples by using a knowledge graph representation learning model after the iterative training is completed, wherein the prediction steps are as follows: taking one of the triples from the unlabeled triplet<h，r，t＞ _unlabel Using knowledge graph to represent learning model to predict relation r 'between h and t, if r' =r, then predictingCorrect; meanwhile, taking the predicted correct triples as positive triples, randomly replacing head entities or tail entities of the positive triples as negative triples, and then combining the new positive triples and the new negative triples into the original marked triples to form new marked triples;

and 9) repeating the steps 4) -8) by adopting a new marked triplet until a new unmarked triplet cannot be learned, wherein the description knowledge graph indicates that the learning model cannot learn more Chinese knowledge features from the current training set, and the entity vector and the relation vector output by the knowledge graph indicates that the learning model is the best Chinese knowledge graph representation corresponding to the Chinese link data set zhishi.

2. The learning method of the feature tensor-based chinese knowledge graph representation according to claim 1, wherein the process of constructing the feature tensor of the entity in the step 3) is as follows: the tag triplet is denoted as T _l E is used for representing an entity set in the knowledge graph, R is used for representing a relation set, an entity E E in one marked triplet is selected, and the entity pinyin matrix is searched for obtaining a pinyin vector E of the entity E _p The method comprises the steps of carrying out a first treatment on the surface of the Let entity name be c ₁ c ₂ ...c _m Wherein c _m The mth word representing the name of the entity is searched for the word embedding matrix according to the word to obtain the word vector e of the entity e _c ＝c ₁ c ₂ ...c _m The method comprises the steps of carrying out a first treatment on the surface of the Searching the word embedding matrix to obtain a word vector e of the entity e _w The method comprises the steps of carrying out a first treatment on the surface of the Searching the description matrix to obtain a description vector e of the entity e _d The feature tensor FeatureTensor of entity e is expressed as

3. The learning method of the feature tensor-based chinese knowledge graph representation according to claim 2, wherein the step 3) of converting the feature tensor of the entity into the feature vector of the entity comprises: entity is put into practiceThe different dimensions of the feature tensor of (2) are connected in a vector splicing way, the vectors are randomly lost by adopting dropout, knowledge representation is prevented from learning and fitting, and the pinyin vectors e of the same entity e are obtained by the method _p Word vector e _c Word vector e _w Description vector e _d Connecting to obtain a feature vector e of the entity e _ft ＝[e _p ；e _c ；e _w ；e _d ]。

4. A feature tensor-based chinese knowledge graph representation learning method according to any one of claims 1-3, characterized in that the feature tensor-based chinese knowledge graph representation learning method employs link prediction as an evaluation task, the evaluation indexes comprising rank average MR, average reciprocal rank MRR, percentage of top ten results of correct entity hit @10, percentage of top three results of correct entity hit @3, percentage of top three results of correct entity hit @1, wherein the smaller the value of MR, the better the values of MRR, hit @10, hit @3, hit @ 1; and randomly removing part of entities or relations of the triples in the Chinese linked data set zhishi.