CN111160564B - A Chinese Knowledge Graph Representation Learning Method Based on Feature Tensor - Google Patents
A Chinese Knowledge Graph Representation Learning Method Based on Feature Tensor Download PDFInfo
- Publication number
- CN111160564B CN111160564B CN201911300781.5A CN201911300781A CN111160564B CN 111160564 B CN111160564 B CN 111160564B CN 201911300781 A CN201911300781 A CN 201911300781A CN 111160564 B CN111160564 B CN 111160564B
- Authority
- CN
- China
- Prior art keywords
- entity
- vector
- triples
- triplet
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
本发明提供了一种基于特征张量的中文知识图谱表示学习方法,该方法包括:数据准备、建立数据结构、构建实体特征向量矩阵、定义标记三元组的关系向量和距离公式、得到训练集、对知识图谱表示学习模型进行训练、更新模型参数、迭代训练,并使用该模型对未标记三元组进行关系预测、再一次迭代训练,直至无法学习到新的未标记三元组。本发明提出使用中文拼音、字信息、词信息、描述信息组成特征张量,并转换为特征向量,用以替代传统知识表示学习中随机初始化实体向量的方法,充分利用了中文特点。另外,采用双层迭代的方式,对训练语料进行补充,使得关系矩阵能够不断修正,提高知识图谱表示学习模型的精度和收敛速度。
The present invention provides a Chinese knowledge map representation learning method based on feature tensor, the method includes: data preparation, establishment of data structure, construction of entity feature vector matrix, definition of relational vector and distance formula of marked triplets, and obtaining training set 1. Train the knowledge graph representation learning model, update the model parameters, iteratively train, and use the model to predict the relationship between unlabeled triples, and iteratively train again until no new unlabeled triples can be learned. The present invention proposes to use Chinese pinyin, word information, word information, and description information to form feature tensors and convert them into feature vectors to replace the method of randomly initializing entity vectors in traditional knowledge representation learning, making full use of Chinese characteristics. In addition, the training corpus is supplemented by a two-layer iterative method, so that the relationship matrix can be continuously revised, and the accuracy and convergence speed of the knowledge graph representation learning model can be improved.
Description
技术领域Technical Field
本发明涉及知识图谱领域,尤其涉及一种基于特征张量的中文知识图谱表示学习方法。The present invention relates to the field of knowledge graphs, and in particular to a Chinese knowledge graph representation learning method based on feature tensors.
背景技术Background Art
知识图谱以结构化的形式描述客观世界中概念、实体间的复杂关系,提供了一种更好地组织、管理和理解互联网海量信息的能力。知识图谱技术通常包括知识表示、知识图谱构建和知识图谱应用三个方面的研究内容,其中,知识表示是知识图谱构建和应用的基础,反映人类对客观世界的认知,并能够从不同层次和粒度表达客观世界所呈现的语义。首先要了解人类本身是如何表示知识并利用他们解决问题的,然后将其形式化表示成计算机可以推理和计算的表达形式,建立基于知识的系统,提供智能知识服务。同时,知识表示也需要结合计算机对符号表示、处理和计算的能力。知识表示需要解决的关键问题是1)建立什么样的知识表示形式能够准确地反映客观世界的知识;2)建立什么样的知识表示可以具备语义表示能力;3)知识表示如何支持高效知识推理和计算,从而使知识表示具有得到新知识的推理能力。当前的知识表示方法可以分成基于符号逻辑的知识表示、互联网资源的开放知识表示方法和基于知识图谱的表示学习。Knowledge graphs describe the complex relationships between concepts and entities in the objective world in a structured form, providing a better way to organize, manage and understand the massive amount of information on the Internet. Knowledge graph technology usually includes three aspects of research: knowledge representation, knowledge graph construction and knowledge graph application. Among them, knowledge representation is the basis for knowledge graph construction and application, reflecting human cognition of the objective world and being able to express the semantics presented by the objective world from different levels and granularities. First, we need to understand how humans themselves represent knowledge and use them to solve problems, and then formalize it into an expression form that computers can reason and calculate, establish a knowledge-based system, and provide intelligent knowledge services. At the same time, knowledge representation also needs to combine the computer's ability to represent, process and calculate symbols. The key issues that need to be solved in knowledge representation are: 1) what kind of knowledge representation form can accurately reflect the knowledge of the objective world; 2) what kind of knowledge representation can have semantic representation capabilities; 3) how knowledge representation can support efficient knowledge reasoning and calculation, so that knowledge representation has the ability to reason about new knowledge. Current knowledge representation methods can be divided into knowledge representation based on symbolic logic, open knowledge representation methods of Internet resources, and representation learning based on knowledge graphs.
1)基于符号逻辑的知识表示:主要包括逻辑表示法、产生式表示法和框架表示等,虽然基于符号逻辑的知识表示技术虽然可以很好地描述逻辑推理,但是由于在推理中机器生成规则的能力很弱,推理规则的获取需要大量的人力,并且对数据的质量要求较高,在目前大规模数据时代,基于符号逻辑的知识表示已经不能很好地解决知识表示的问题。1) Knowledge representation based on symbolic logic: mainly includes logical representation, production representation and framework representation. Although knowledge representation technology based on symbolic logic can well describe logical reasoning, the ability of machines to generate rules in reasoning is very weak, the acquisition of reasoning rules requires a lot of manpower, and the quality of data is relatively high. In the current era of large-scale data, knowledge representation based on symbolic logic can no longer solve the problem of knowledge representation well.
2)万维网内容的知识表示:Tim Berners-Lee提出了语义网的概念,在语义网中,网络内容都应该有确定的意义,而且可以很容易地被计算机理解、获取和集成。万维网内容知识表示包括半结构基于标记的置标语言XML、基于RDF万维网资源语义元数据描述框架和基于描述逻辑的OWL本体描述语言等;以及当前在工业界得到大规模应用的基于三元组的知识图谱知识表示方法,三元组可以表示为<h,r,t>,表示头实体h和尾实体t之间存在关系r。这些技术使我们可以将机器理解和处理的语义信息发布在万维网上。但是万维网内容以数百万亿计,这对于知识存储和知识表示学习是一个巨大的挑战。2) Knowledge representation of Web content: Tim Berners-Lee proposed the concept of the semantic web, in which all web content should have a definite meaning and can be easily understood, acquired and integrated by computers. Web content knowledge representation includes semi-structured tag-based markup language XML, RDF-based Web resource semantic metadata description framework, and description logic-based OWL ontology description language, as well as triple-based knowledge graph knowledge representation methods that are currently widely used in the industry. A triple can be represented as <h, r, t>, indicating that there is a relationship r between the head entity h and the tail entity t. These technologies enable us to publish semantic information that can be understood and processed by machines on the Web. However, the content of the Web is in the hundreds of trillions, which is a huge challenge for knowledge storage and knowledge representation learning.
3)表示学习:表示学习的目标是通过机器学习或深度学习将研究对象的语义信息表示为稠密低维的向量。对不同粒度知识单元进行隐式的向量化表示,以支持大数据环境下知识的快速计算。表示学习主要包括张量重构和势能函数的方法。张量重构综合整个知识库的信息,但在大数据环境下张量维度很高,重构的计算量较大;势能函数方法认为关系是头实体向尾实体的一种翻译操作,Bordes等人提出的TransE模型是翻译模型的代表,但缺少了显示的语义信息,尤其是中文这种拥有拼音、结构、字词信息的文字,通过机器学习或深度学习到的中文低维向量仅仅是计算机的参数拟合,缺乏解释性。3) Representation learning: The goal of representation learning is to represent the semantic information of the research object as a dense low-dimensional vector through machine learning or deep learning. Implicit vectorization is performed on knowledge units of different granularities to support the rapid calculation of knowledge in a big data environment. Representation learning mainly includes tensor reconstruction and potential energy function methods. Tensor reconstruction integrates the information of the entire knowledge base, but in a big data environment, the tensor dimension is very high and the reconstruction calculation is large; the potential energy function method believes that the relationship is a translation operation from the head entity to the tail entity. The TransE model proposed by Bordes et al. is a representative of the translation model, but it lacks explicit semantic information, especially for Chinese, which has phonetic, structural, and word information. The low-dimensional Chinese vector obtained through machine learning or deep learning is only a computer parameter fitting and lacks interpretability.
综上,基于符号逻辑的知识表示和互联网资源的开放知识表示方法使知识具有显示的语义定义,但存在数据稀疏问题,难以实现大规模的知识图谱应用;基于深度学习的知识表示可以将知识单元(实体、关系和规则)映射到低维的连续实数空间表示,但缺少了显示的语义定义。In summary, knowledge representation based on symbolic logic and open knowledge representation methods based on Internet resources give knowledge an explicit semantic definition, but there is a data sparsity problem, making it difficult to realize large-scale knowledge graph applications; knowledge representation based on deep learning can map knowledge units (entities, relationships, and rules) to a low-dimensional continuous real number space representation, but lacks an explicit semantic definition.
此外,国外对于知识图谱表示的学习研究较多,但仅限于英文知识图谱,由于语言差异,英文单词只有简单的字符串信息、短语信息,在对知识表示进行学习时,只需要随机初始化向量即可,而中文则包含了丰富的语义信息,现有的研究方法无法在中文知识图谱上取得良好的效果,目前国内主要还停留在如何构建知识图谱的阶段,缺乏对中文知识图谱表示学习的研究。In addition, there are many studies on learning knowledge graph representation abroad, but they are limited to English knowledge graphs. Due to language differences, English words only have simple string information and phrase information. When learning knowledge representation, only random initialization vectors are needed. Chinese contains rich semantic information. Existing research methods cannot achieve good results on Chinese knowledge graphs. At present, China is still mainly at the stage of how to build knowledge graphs, and lacks research on learning Chinese knowledge graph representation.
发明内容Summary of the invention
针对上述问题,本发明提出一种基于特征张量的中文知识图谱表示学习方法,相比较于随机初始化向量,本发明引入中文拼音、字、词、描述信息四个特征作为中文显示语义信息构成特征张量,使得中文知识图谱表示的学习过程变得可解释,并且与深度学习相结合,将学习的知识表示映射到低维连续实数空间,便于对中文知识及其之间的关系进行学习。In response to the above problems, the present invention proposes a Chinese knowledge graph representation learning method based on feature tensor. Compared with random initialization vectors, the present invention introduces four features of Chinese pinyin, characters, words, and description information as Chinese display semantic information to constitute the feature tensor, which makes the learning process of Chinese knowledge graph representation explainable. It is combined with deep learning to map the learned knowledge representation to a low-dimensional continuous real number space, which facilitates the learning of Chinese knowledge and the relationships between them.
本发明提出的一种基于特征张量的中文知识图谱表示学习方法,包括如下步骤:The present invention proposes a Chinese knowledge graph representation learning method based on feature tensor, which includes the following steps:
步骤1)数据准备Step 1) Data preparation
将来自一个开放的中文链接数据集zhishi.me的数据构成三元组数据,所述三元组数据由大量三元组组成,三元组形如<h,r,t>,h表示头实体,t表示尾实体,r表示头实体h和尾实体t之间的关系;The data from an open Chinese link dataset zhishi.me is used to form triple data, wherein the triple data consists of a large number of triples, and the triple is in the form of <h, r, t>, where h represents the head entity, t represents the tail entity, and r represents the relationship between the head entity h and the tail entity t;
步骤2)建立数据结构Step 2) Create data structure
将所述三元组数据分为标记三元组和未标记三元组,并构建字典、实体词典、关系词典、实体拼音矩阵、字嵌入矩阵、词嵌入矩阵和描述矩阵的数据结构;The triple data are divided into labeled triples and unlabeled triples, and data structures of a dictionary, an entity dictionary, a relationship dictionary, an entity pinyin matrix, a character embedding matrix, a word embedding matrix and a description matrix are constructed;
步骤3)构建实体特征向量矩阵Step 3) Construct entity feature vector matrix
对于每个标记三元组中的实体,首先由实体拼音向量、字向量、词向量和描述向量构成实体的特征张量;并将标记三元组中的所有实体的特征张量转换为实体的特征向量,并按实体词典的顺序构建实体特征向量矩阵;For each entity in the tag triple, firstly, the entity's feature tensor is constructed by the entity's pinyin vector, character vector, word vector and description vector; then the feature tensors of all entities in the tag triple are converted into the entity's feature vector, and the entity feature vector matrix is constructed in the order of the entity dictionary;
步骤4)取一个标记三元组Tl=<h,r,t>,通过实体特征向量矩阵得到头实体h和尾实体t的特征向量hft和tft,为了表示实体h与实体t存在关系r,即h+r=t,所以标记三元组Tl=<h,r,t>的关系向量可表示为:Step 4) Take a labeled triple T l = <h, r, t>, and obtain the feature vectors h ft and t ft of the head entity h and the tail entity t through the entity feature vector matrix. In order to indicate that there is a relationship r between entity h and entity t, that is, h+r=t, the relationship vector of the labeled triple T l = <h, r, t> can be expressed as:
r=tft-hft r = t ft -h ft
为了计算实体h和实体t之间的距离,通过向量转换来表示实体之间的关系,采用欧式距离定义三元组<h,r,t>的距离公式为:In order to calculate the distance between entity h and entity t, the relationship between entities is represented by vector transformation, and the distance formula of the triple <h, r, t> is defined using Euclidean distance:
其中下标“2”表示2范数,即欧几里得范数,上标“2”表示求平方;The subscript "2" indicates the 2-norm, i.e. the Euclidean norm, and the superscript "2" indicates square.
步骤5)将标记三元组作为训练集,并初始化实体向量,即实体特征向量矩阵,初始化关系向量,构建关系向量矩阵,顺序与关系词典一致,关系计算由公式r=tft-hft得到,若有多个实体对存在同一个关系,则关系向量为多个实体对向量差值取平均,对所有的关系向量初始化后要进行归一化,使得精度提高,并且收敛加强;Step 5) Use the labeled triples as training sets, and initialize entity vectors, i.e., entity feature vector matrices, initialize relationship vectors, and construct relationship vector matrices. The order is consistent with the relationship dictionary. The relationship calculation is obtained by the formula r = t ft -h ft . If multiple entity pairs have the same relationship, the relationship vector is the average of the differences between the vectors of multiple entity pairs. All relationship vectors are normalized after initialization to improve accuracy and enhance convergence.
步骤6)在所述训练集中随机选取一个正三元组<h,r,t>,在负三元组中将<h′,r,t>和<h,r,t′>选出,与<h,r,t>组成元组对,构成一个训练批次Tbatch=[(<h,r,t>,<h′,r,t>),(<h,r,t>,<h,r,t′>)],采用Sp={<h,r,t>}表示正三元组,Sf={<h′,r,t>|h′∈E,<h,r,t′>|t′∈E}表示负三元组,将Tbatch作为知识图谱表示学习模型的输入,对Tbatch进行知识图谱表示学习训练,其中,E表示实体集合,结合公式所述知识图谱表示学习模型的损失函数定义为:Step 6) Randomly select a positive triplet <h, r, t> in the training set, select <h′, r, t> and <h, r, t′> from the negative triplet, and form a tuple pair with <h, r, t> to form a training batch T batch = [(<h, r, t>, <h′, r, t>), (<h, r, t>, <h, r, t′>)], use S p = {<h, r, t>} to represent the positive triplet, S f = {<h′, r, t>|h′∈E, <h, r, t′>|t′∈E} to represent the negative triplet, use T batch as the input of the knowledge graph representation learning model, and perform knowledge graph representation learning training on T batch , where E represents the entity set, combined with the formula The loss function of the knowledge graph representation learning model is defined as:
其中γ是大于0的间隔距离参数,是一个超参数,[x]+表示正值函数,即当x>0时,[x]+=x;当x≤0时,[x]+=0;这种训练方法叫做margin-based ranking criterion,目的是将正三元组和负三元组尽可能地分开,找出最大距离的支持向量;Where γ is a distance parameter greater than 0, which is a hyperparameter. [x] + represents a positive function, that is, when x>0, [x] + =x; when x≤0, [x] + =0. This training method is called margin-based ranking criterion, and its purpose is to separate positive triplets and negative triplets as much as possible and find the support vector with the maximum distance.
步骤7)采用随机梯度下降(SGD)对知识图谱表示学习模型参数进行更新,梯度更新只需计算距离d(h+r,t)和d(h′+r,t′),设一共有|E|个实体和|R|个关系,每个实体向量长度为m维,关系向量长度为n维,则一共需要更新(|E|*m+|R|*n)个参数;Step 7) Use stochastic gradient descent (SGD) to update the parameters of the knowledge graph representation learning model. The gradient update only needs to calculate the distance d(h+r, t) and d(h′+r, t′). Assume there are a total of |E| entities and |R| relationships, each entity vector is m-dimensional, and the relationship vector is n-dimensional, then a total of (|E|*m+|R|*n) parameters need to be updated;
步骤8)重复步骤6)-步骤7)进行迭代训练,迭代训练完成之后,使用知识图谱表示学习模型参数对未标记三元组进行关系预测,预测步骤为:从未标记三元组中任取一个三元组<h,r,t>unlabel,使用知识图谱表示学习模型参数预测h与t之间的关系r′,若r′=r,则预测正确;同时,将预测正确的三元组作为正三元组,随机替换这些正三元组的头实体或尾实体作为负三元组,然后将这些新的正三元组和负三元组结合至原来的标记三元组中,形成新的标记三元组;Step 8) Repeat step 6)-step 7) to perform iterative training. After the iterative training is completed, the knowledge graph is used to represent the learning model parameters to predict the relationship between unlabeled triples. The prediction step is: randomly select a triple <h, r, t> unlabel from the unlabeled triples, and use the knowledge graph to represent the learning model parameters to predict the relationship r′ between h and t. If r′=r, the prediction is correct; at the same time, the correctly predicted triples are used as positive triples, and the head entities or tail entities of these positive triples are randomly replaced as negative triples. Then, these new positive triples and negative triples are combined with the original labeled triples to form new labeled triples;
步骤9)采用新的标记三元组重复步骤4)-8)进行迭代训练,直至无法学习到新的未标记三元组为止,说明知识图谱表示学习模型已经无法从当前的训练数据中学习到更多的中文知识特征,知识图谱表示学习模型所输出的实体向量和关系向量是中文链接数据集zhishi.me所对应的最佳的中文知识图谱表示,训练完成。Step 9) Repeat steps 4)-8) for iterative training using new labeled triples until no new unlabeled triples can be learned, indicating that the knowledge graph representation learning model can no longer learn more Chinese knowledge features from the current training data. The entity vectors and relationship vectors output by the knowledge graph representation learning model are the best Chinese knowledge graph representation corresponding to the Chinese link dataset zhishi.me, and the training is completed.
本发明针对现有知识表示的学习方法无法结合中文字词信息的问题,提出使用中文拼音、字信息、词信息、描述信息组成特征张量,并转换为特征向量,用以替代传统知识表示学习中随机初始化实体向量的方法,充分利用了中文特点。另外,本发明采用双层迭代的方式,对训练语料进行补充,使得关系矩阵能够不断修正,提高知识图谱表示学习模型的精度和收敛速度。In view of the problem that the existing knowledge representation learning method cannot combine Chinese word information, the present invention proposes to use Chinese pinyin, character information, word information, and description information to form a feature tensor, and convert it into a feature vector to replace the method of randomly initializing entity vectors in traditional knowledge representation learning, making full use of the characteristics of Chinese. In addition, the present invention adopts a double-layer iteration method to supplement the training corpus, so that the relationship matrix can be continuously corrected, improving the accuracy and convergence speed of the knowledge graph representation learning model.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为本发明的方法处理过程框架图FIG. 1 is a framework diagram of the method processing process of the present invention
图2为本发明的方法处理过程流程图FIG. 2 is a flow chart of the method processing process of the present invention
图3为描述矩阵LSTM编码实体描述向量Figure 3 is the description matrix LSTM encoding entity description vector
图4为张量转换为向量示意图Figure 4 is a schematic diagram of converting a tensor into a vector
具体实施方式DETAILED DESCRIPTION
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。The technical solutions in the embodiments of the present invention will be described clearly and completely below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, rather than all the embodiments.
如图2所示,本发明提出的一种基于特征张量的中文知识图谱表示学习方法包括下列步骤:As shown in FIG2 , a Chinese knowledge graph representation learning method based on feature tensor proposed in the present invention includes the following steps:
步骤1)数据准备Step 1) Data preparation
本发明所用三元组数据来自一个开放的中文链接数据集zhishi.me,由大量三元组组成,三元组形如<h,r,t>,h表示头实体,t表示尾实体,r表示头实体h和尾实体t之间的关系。The triple data used in the present invention comes from an open Chinese link data set zhishi.me, which consists of a large number of triples in the form of <h, r, t>, where h represents the head entity, t represents the tail entity, and r represents the relationship between the head entity h and the tail entity t.
步骤2)建立数据结构Step 2) Create data structure
如图1所示,将所述三元组数据分为标记三元组和未标记三元组,并构建字典、实体词典、关系词典、实体拼音矩阵、字嵌入矩阵、词嵌入矩阵、描述矩阵等数据结构。As shown in FIG1 , the triple data are divided into labeled triples and unlabeled triples, and data structures such as dictionaries, entity dictionaries, relationship dictionaries, entity pinyin matrices, character embedding matrices, word embedding matrices, and description matrices are constructed.
标记三元组:在所述数据集zhishi.me中随机抽取三元组,得到一个三元组数据集合,将所述三元组数据集合中的所有三元组作为正三元组,将每个正三元组的头实体或尾实体去掉,并随机在实体词典中选择一个不同于自身的实体代替,构成负三元组,每次只替换三元组中的一个实体,这样才有对照性。对上述这些三元组进行标记,将正三元组标记为1,负三元组标记为0。Marking triples: Randomly extract triples from the data set zhishi.me to obtain a triple data set, take all triples in the triple data set as positive triples, remove the head entity or tail entity of each positive triple, and randomly select an entity different from itself in the entity dictionary to replace it, forming a negative triple. Only one entity in the triple is replaced each time, so that there is a comparison. Mark the above triples, mark the positive triples as 1, and the negative triples as 0.
未标记三元组:所述数据集zhishi.me中任意未标记的三元组。Unlabeled triples: any unlabeled triples in the dataset zhishi.me.
字典:所述数据集zhishi.me中出现的所有字,包括所有的头实体、尾实体和关系构成的字典,字典形式为“字:序列号”,序列号为数字,从零开始递增。Dictionary: All the characters appearing in the dataset zhishi.me, including all head entities, tail entities and relations. The dictionary format is "character: serial number", where the serial number is a number and increases from zero.
实体词典:所述数据集zhishi.me中的实体集合,包括所有的头实体和尾实体构成的词典,词典形式为“实体名:序列号”,序列号为数字,从零开始递增。Entity dictionary: The entity set in the dataset zhishi.me, including the dictionary consisting of all head entities and tail entities. The dictionary format is "entity name: serial number", where the serial number is a number and increases from zero.
关系词典:所述数据集zhishi.me中的关系集合构成的词典,词典形式为“关系名:序列号”,序列号为数字,从零开始递增。Relation dictionary: a dictionary consisting of the relationship sets in the dataset zhishi.me. The dictionary format is "relation name: sequence number". The sequence number is a number, starting from zero and increasing.
实体拼音矩阵:为解决多音字不同含义的问题,调用百度翻译API得到实体拼音,构建实体拼音矩阵,行数与实体词典中实体数量一致,每行为使用one-hot编码方式得到的实体拼音向量。Entity Pinyin Matrix: To solve the problem of different meanings of polyphones, call the Baidu Translate API to get the entity Pinyin and construct an entity Pinyin matrix. The number of rows is consistent with the number of entities in the entity dictionary, and each row is an entity Pinyin vector obtained using one-hot encoding.
字嵌入矩阵:行数与字典的字数一致,每行为使用word2vec得到的字向量。Word embedding matrix: The number of rows is the same as the number of words in the dictionary, and each row is a word vector obtained using word2vec.
词嵌入矩阵:行数与实体词典中实体数量一致,每行为使用word2vec得到的词向量。Word embedding matrix: The number of rows is consistent with the number of entities in the entity dictionary, and each row is a word vector obtained using word2vec.
描述矩阵:行数与实体词典中实体数量一致,调用百度百科API获到实体描述信息,将描述信息输入双向长短期记忆网络(Bi-directional Long Short-Term Memory,BiLSTM)编码得到实体描述向量,如图3所示。该向量引入了实体的描述信息,可以解决中文同义词的问题。Description matrix: The number of rows is consistent with the number of entities in the entity dictionary. The Baidu Encyclopedia API is called to obtain entity description information, and the description information is input into the bidirectional long short-term memory network (Bi-directional Long Short-Term Memory, BiLSTM) to encode the entity description vector, as shown in Figure 3. This vector introduces the description information of the entity and can solve the problem of Chinese synonyms.
步骤3)构建实体特征向量矩阵Step 3) Construct entity feature vector matrix
对于每个标记三元组中的实体,首先由实体拼音向量、字向量、词向量和描述向量构成实体的特征张量,在后续步骤中作为预先定义好的实体的特征张量。构建过程为:标记三元组表示为Tl,用E表示知识图谱中的实体集合,用R表示关系集合,任取一个标记三元组中的实体e∈E,查找实体拼音矩阵得到实体拼音向量ep;设实体名称为c1c2...cm,其中cm表示组成实体名称的第m个字,根据字查找字嵌入矩阵得到实体的字向量ec=c1c2...cm;查找词嵌入矩阵得到词向量ew;查找描述矩阵得到实体的描述向量ed,实体的特征张量FeatureTens or表示为 For each entity in a tag triple, the entity's feature tensor is first constructed by the entity's pinyin vector, character vector, word vector, and description vector, which is used as the predefined entity's feature tensor in subsequent steps. The construction process is as follows: the tag triple is represented as T l , E is used to represent the entity set in the knowledge graph, and R is used to represent the relationship set. Take any entity e∈E in a tag triple, search the entity's pinyin matrix to obtain the entity's pinyin vector ep ; let the entity name be c 1 c 2 ...c m , where c m represents the mth character that makes up the entity name, and search the character embedding matrix based on the character to obtain the entity's character vector e c =c 1 c 2 ...c m ; search the word embedding matrix to obtain the word vector e w ; search the description matrix to obtain the entity's description vector ed , and the entity's feature tensor FeatureTens or is represented as
将实体的特征张量转换为实体的特征向量,构建实体特征向量矩阵。如图4所示,将实体的特征张量的不同维度进行连接,连接方式为向量拼接。如给定向量A=[x1,x2,x3,...,xm]和B=[y1,y2,y3,...,yn],连接得到向量C=[x1,x2,x3,...,xm,y1,y2,y3,...,yn],并采用dropout对向量进行随机丢失,防止知识表示学习过拟合。用这种方法将同一个实体e的拼音向量ep、字向量ec、词向量ew、描述向量ed进行连接,得到实体的特征向量eft=[ep;ec;ew;ed]。The feature tensor of the entity is converted into the feature vector of the entity, and the entity feature vector matrix is constructed. As shown in Figure 4, the different dimensions of the feature tensor of the entity are connected, and the connection method is vector splicing. For example , given vectors A = [x 1 , x 2 , x 3 , ..., x m ] and B = [y 1 , y 2, y 3, ..., yn], the connection is obtained to obtain vector C = [x 1, x 2 , x 3 , ..., x m , y 1 , y 2 , y 3 , ..., yn ], and dropout is used to randomly lose the vector to prevent overfitting of knowledge representation learning. In this way, the pinyin vector ep , character vector ec , word vector ew , and description vector ed of the same entity e are connected to obtain the entity's feature vector e ft = [ ep ; e c ; e w ; ed ].
将标记三元组中的所有实体的特征张量转换为实体的特征向量,并按实体词典的顺序构建实体特征向量矩阵。Convert the feature tensors of all entities in the tag triplet into feature vectors of the entity, and build the entity feature vector matrix in the order of the entity dictionary.
步骤4)取一个标记三元组Tl=<h,r,t>,通过实体特征向量矩阵得到头实体h和尾实体t的特征向量hft和tft,为了表示实体h与实体t存在关系r,即h+r=t,所以标记三元组Tl=<h,r,t>的关系向量可表示为:Step 4) Take a labeled triple T l = <h, r, t>, and obtain the feature vectors h ft and t ft of the head entity h and the tail entity t through the entity feature vector matrix. In order to indicate that there is a relationship r between entity h and entity t, that is, h+r=t, the relationship vector of the labeled triple T l = <h, r, t> can be expressed as:
r=tft-hft (1)r=t ft -h ft (1)
为了计算实体h和实体t之间的距离,通过向量转换来表示实体之间的关系,采用欧式距离定义三元组<h,r,t>的距离公式为:In order to calculate the distance between entity h and entity t, the relationship between entities is represented by vector transformation, and the distance formula of the triple <h, r, t> is defined using Euclidean distance:
公式(2)中的下标“2”表示2范数,即欧几里得范数,上标“2”表示求平方。The subscript "2" in formula (2) represents the 2-norm, i.e., the Euclidean norm, and the superscript "2" represents square.
步骤5)将标记三元组作为训练集,初始化实体向量,即实体特征向量矩阵,初始化关系向量,构建关系向量矩阵,顺序与关系词典一致,关系计算由公式(1)得到。若有多个实体对存在同一个关系,则关系向量为多个实体对向量差值取平均。对所有的向量初始化后要进行归一化,会使精度加大和收敛加强。Step 5) Use the labeled triples as training sets, initialize entity vectors, i.e., entity feature vector matrices, initialize relationship vectors, and construct relationship vector matrices. The order is consistent with the relationship dictionary, and the relationship calculation is obtained by formula (1). If there are multiple entity pairs with the same relationship, the relationship vector is the average of the differences between the vectors of multiple entity pairs. After all vectors are initialized, they should be normalized to increase accuracy and enhance convergence.
步骤6)在所述训练集中随机选取一个正三元组<h,r,t>,在负三元组中将<h′,r,t>和<h,r,t′>选出,与<h,r,t>组成元组对,构成一个训练批次Tbatch=[(<h,r,t>,<h′,r,t′>),(<h,r,t>,<h,r,t′>)]。用Sp={<h,r,t>}表示正三元组,Sf={<h′,r,t>|h′∈E,<h,r,t′>|t′∈E}表示负三元组。将Tbatch作为知识图谱表示学习模型的输入,对Tbatch进行知识图谱表示学习训练,结合公式(2),知识图谱表示学习模型的损失函数定义为:Step 6) Randomly select a positive triplet <h, r, t> in the training set, select <h′, r, t> and <h, r, t′> from the negative triplet, and form a tuple pair with <h, r, t> to form a training batch T batch = [(<h, r, t>, <h′, r, t′>), (<h, r, t>, <h, r, t′>)]. Use Sp = {<h, r, t>} to represent the positive triplet, and Sf = {<h′, r, t>|h′∈E, <h, r, t′>|t′∈E} to represent the negative triplet. Take T batch as the input of the knowledge graph representation learning model, perform knowledge graph representation learning training on T batch , and combine formula (2), the loss function of the knowledge graph representation learning model is defined as:
其中γ是大于0的间隔距离参数,是一个超参数,[x]+表示正值函数,即当x>0时,[x]+=x;当x≤0时,[x]+=0。这种训练方法叫做margin-based ranking criterion,目的是将正三元组和负三元组尽可能地分开,找出最大距离的支持向量。Where γ is a distance parameter greater than 0, which is a hyperparameter, and [x] + represents a positive function, that is, when x>0, [x] + = x; when x≤0, [x] + = 0. This training method is called margin-based ranking criterion, and its purpose is to separate positive triplets and negative triplets as much as possible and find the support vector with the maximum distance.
步骤7)采用随机梯度下降(SGD)对知识图谱表示学习模型参数进行更新,梯度更新只需计算距离d(h+r,t)和d(h′+r,t′)。设一共有|E|个实体和|R|个关系,每个实体向量长度为m维,关系向量长度为n维,则一共需要更新(|E|*m+|R|*n)个参数。Step 7) Use stochastic gradient descent (SGD) to update the parameters of the knowledge graph representation learning model. Gradient update only needs to calculate the distance d(h+r, t) and d(h′+r, t′). Assume there are a total of |E| entities and |R| relationships, each entity vector is m-dimensional, and the relationship vector is n-dimensional, then a total of (|E|*m+|R|*n) parameters need to be updated.
步骤8)重复步骤6)-7)进行迭代训练,迭代训练完成之后,使用知识图谱表示学习模型对未标记三元组进行关系预测。预测步骤为,从未标记三元组中任取一个三元组<h,r,t>unlabel,使用知识图谱表示学习模型预测h与t之间的关系r′,若r′=r,则预测正确。同时,将预测正确的三元组作为正三元组,随机替换这些正三元组的头实体或尾实体作为负三元组,然后将这些新的正三元组和负三元组结合至原来的标记三元组中,形成新的标记三元组。Step 8) Repeat steps 6)-7) for iterative training. After the iterative training is completed, the knowledge graph representation learning model is used to predict the relationship between unlabeled triples. The prediction step is to randomly select a triple <h, r, t> unlabel from the unlabeled triples, and use the knowledge graph representation learning model to predict the relationship r′ between h and t. If r′=r, the prediction is correct. At the same time, the correctly predicted triples are used as positive triples, and the head entities or tail entities of these positive triples are randomly replaced as negative triples. Then, these new positive triples and negative triples are combined with the original labeled triples to form new labeled triples.
步骤9)采用新的标记三元组重复步骤4)-8)进行迭代训练,直至无法学习到新的未标记三元组为止,说明知识图谱表示学习模型已经无法从当前的训练数据中学习到更多的中文知识特征,知识图谱表示学习模型所输出的实体向量和关系向量是数据集zhishi.me所对应的最佳的中文知识图谱表示,训练完成。Step 9) Repeat steps 4)-8) for iterative training using new labeled triples until no new unlabeled triples can be learned, indicating that the knowledge graph representation learning model can no longer learn more Chinese knowledge features from the current training data. The entity vectors and relationship vectors output by the knowledge graph representation learning model are the best Chinese knowledge graph representation corresponding to the data set zhishi.me, and the training is completed.
中文知识图谱表示学习方法通常采用链接预测作为评估任务,评价指标包括秩均值(MR)、平均倒数排名(MRR)、正确实体在排名前十的结果的百分比(Hits@10)、正确实体在排名前三的结果的百分比(Hits@3)、排名第一的结果是正确实体的百分比(Hits@1),其中MR的值越小越好,MRR、Hits@10、Hits@3、Hits@1的值越大越好。将数据集zhishi.me中的三元组,随机去除其部分实体或关系,链接预测即对三元组中随机去除的实体或关系进行预测。Chinese knowledge graph representation learning methods usually use link prediction as an evaluation task. The evaluation indicators include mean rank (MR), mean reciprocal rank (MRR), the percentage of correct entities in the top ten results (Hits@10), the percentage of correct entities in the top three results (Hits@3), and the percentage of correct entities in the first-ranked results (Hits@1). The smaller the MR value, the better, and the larger the MRR, Hits@10, Hits@3, and Hits@1 values, the better. Randomly remove some entities or relations from the triples in the zhishi.me dataset, and link prediction is to predict the randomly removed entities or relations in the triples.
在本发明的实施例中,在开源中文数据集zhishi.me上进行表示学习并进行链接预测任务评估,并与TransE模型和TransR模型这两种知识图谱表示学习方法测试结果进行比较,结果如表1所示:In an embodiment of the present invention, representation learning is performed on the open source Chinese dataset zhishi.me and link prediction task evaluation is performed, and the test results of two knowledge graph representation learning methods, the TransE model and the TransR model, are compared. The results are shown in Table 1:
表1测试结果Table 1 Test results
实验结果显示,本发明的测试结果优于TransE模型和TransR模型,达到了可用水平。The experimental results show that the test results of the present invention are better than those of the TransE model and the TransR model, reaching a usable level.
尽管上面对本发明说明性的具体实施方式进行了描述,以便于本技术领域的技术人员理解本发明,但应该清楚,本发明不限于具体实施方式的范围。凡采用等同替换或等效替换,这些变化是显而易见,一切利用本发明构思的发明创造均在保护之列。Although the above is a description of the illustrative specific embodiments of the present invention to facilitate the understanding of the present invention by those skilled in the art, it should be clear that the present invention is not limited to the scope of the specific embodiments. Where equivalent substitution or equivalent substitution is adopted, these changes are obvious, and all inventions and creations using the concept of the present invention are protected.
Claims (4)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201911300781.5A CN111160564B (en) | 2019-12-17 | 2019-12-17 | A Chinese Knowledge Graph Representation Learning Method Based on Feature Tensor |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201911300781.5A CN111160564B (en) | 2019-12-17 | 2019-12-17 | A Chinese Knowledge Graph Representation Learning Method Based on Feature Tensor |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN111160564A CN111160564A (en) | 2020-05-15 |
| CN111160564B true CN111160564B (en) | 2023-05-19 |
Family
ID=70557605
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201911300781.5A Expired - Fee Related CN111160564B (en) | 2019-12-17 | 2019-12-17 | A Chinese Knowledge Graph Representation Learning Method Based on Feature Tensor |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN111160564B (en) |
Families Citing this family (16)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112100398B (en) * | 2020-08-31 | 2021-09-14 | 清华大学 | Patent blank prediction method and system |
| US12135746B2 (en) | 2020-09-29 | 2024-11-05 | International Business Machines Corporation | Automatic knowledge graph construction |
| CN112463976B (en) * | 2020-09-29 | 2024-05-24 | 东南大学 | A knowledge graph construction method centered on crowd intelligence perception tasks |
| CN113569052B (en) * | 2021-02-09 | 2024-07-26 | 腾讯科技(深圳)有限公司 | Knowledge graph representation learning method and device |
| CN113051904B (en) * | 2021-04-21 | 2022-11-18 | 东南大学 | A link prediction method for small-scale knowledge graphs |
| CN113742488B (en) * | 2021-07-30 | 2022-12-02 | 清华大学 | Embedded knowledge graph completion method and device based on multitask learning |
| CN113705815B (en) * | 2021-08-31 | 2024-11-26 | 大连理工大学 | A knowledge representation learning method for product indicator knowledge graph |
| CN113963748B (en) * | 2021-09-28 | 2023-08-18 | 华东师范大学 | Protein knowledge graph vectorization method |
| CN114416941B (en) * | 2021-12-28 | 2023-09-05 | 北京百度网讯科技有限公司 | Method and device for generating dialogue knowledge point determination model fused with knowledge graph |
| CN114648160A (en) * | 2022-03-11 | 2022-06-21 | 山东科技大学 | Rapid product quality prediction method based on parallel tensor decomposition |
| CN114861665B (en) * | 2022-04-27 | 2023-01-06 | 北京三快在线科技有限公司 | Method and device for training reinforcement learning model and determining data relation |
| CN115098694B (en) * | 2022-06-13 | 2025-07-01 | 中华人民共和国南京海关 | Customs data classification method, device and storage medium based on knowledge graph representation |
| CN115130383A (en) * | 2022-07-01 | 2022-09-30 | 南京航空航天大学 | Link prediction method and system based on decoupling representation learning |
| CN116451784A (en) * | 2023-03-02 | 2023-07-18 | 杭州中奥科技有限公司 | Feature expression method and system of knowledge graph and electronic equipment |
| CN116842193A (en) * | 2023-06-30 | 2023-10-03 | 北京百度网讯科技有限公司 | Text processing method, training method, generation method, device and electronic device |
| CN119517429A (en) * | 2024-10-11 | 2025-02-25 | 重庆邮电大学 | A multidimensional data fusion processing method for medical text data |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106528610A (en) * | 2016-09-28 | 2017-03-22 | 厦门理工学院 | Knowledge graph representation learning method based on path tensor decomposition |
| CN106886543A (en) * | 2015-12-16 | 2017-06-23 | 清华大学 | The knowledge mapping of binding entity description represents learning method and system |
Family Cites Families (14)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8452725B2 (en) * | 2008-09-03 | 2013-05-28 | Hamid Hatami-Hanza | System and method of ontological subject mapping for knowledge processing applications |
| US8332426B2 (en) * | 2010-11-23 | 2012-12-11 | Microsoft Corporation | Indentifying referring expressions for concepts |
| US10402750B2 (en) * | 2015-12-30 | 2019-09-03 | Facebook, Inc. | Identifying entities using a deep-learning model |
| US9705908B1 (en) * | 2016-06-12 | 2017-07-11 | Apple Inc. | Emoji frequency detection and deep link frequency |
| JP6537477B2 (en) * | 2016-07-15 | 2019-07-03 | 株式会社トヨタマップマスター | Search system, search method, computer program thereof and recording medium recording the computer program |
| RU2691214C1 (en) * | 2017-12-13 | 2019-06-11 | Общество с ограниченной ответственностью "Аби Продакшн" | Text recognition using artificial intelligence |
| CN108509483A (en) * | 2018-01-31 | 2018-09-07 | 北京化工大学 | The mechanical fault diagnosis construction of knowledge base method of knowledge based collection of illustrative plates |
| US20190354810A1 (en) * | 2018-05-21 | 2019-11-21 | Astound Ai, Inc. | Active learning to reduce noise in labels |
| CN108829865B (en) * | 2018-06-22 | 2021-04-09 | 海信集团有限公司 | Information retrieval method and device |
| CN109522465A (en) * | 2018-10-22 | 2019-03-26 | 国家电网公司 | The semantic searching method and device of knowledge based map |
| CN109740168B (en) * | 2019-01-09 | 2020-10-13 | 北京邮电大学 | Traditional Chinese medicine classical book and ancient sentence translation method based on traditional Chinese medicine knowledge graph and attention mechanism |
| CN109933307A (en) * | 2019-02-18 | 2019-06-25 | 杭州电子科技大学 | A kind of intelligent controller machine learning algorithm modular form description and packaging method based on ontology |
| CN110377755A (en) * | 2019-07-03 | 2019-10-25 | 江苏省人民医院(南京医科大学第一附属医院) | Reasonable medication knowledge map construction method based on medicine specification |
| CN110334219B (en) * | 2019-07-12 | 2023-05-09 | 电子科技大学 | Knowledge graph representation learning method based on attention mechanism integrated with text semantic features |
-
2019
- 2019-12-17 CN CN201911300781.5A patent/CN111160564B/en not_active Expired - Fee Related
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106886543A (en) * | 2015-12-16 | 2017-06-23 | 清华大学 | The knowledge mapping of binding entity description represents learning method and system |
| CN106528610A (en) * | 2016-09-28 | 2017-03-22 | 厦门理工学院 | Knowledge graph representation learning method based on path tensor decomposition |
Also Published As
| Publication number | Publication date |
|---|---|
| CN111160564A (en) | 2020-05-15 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111160564B (en) | A Chinese Knowledge Graph Representation Learning Method Based on Feature Tensor | |
| CN110334219B (en) | Knowledge graph representation learning method based on attention mechanism integrated with text semantic features | |
| Zhang et al. | A review on entity relation extraction | |
| CN109299341B (en) | An adversarial cross-modal retrieval method and system based on dictionary learning | |
| WO2022267976A1 (en) | Entity alignment method and apparatus for multi-modal knowledge graphs, and storage medium | |
| CN111444305A (en) | A Multi-triple Joint Extraction Method Based on Knowledge Graph Embedding | |
| CN111368528A (en) | Entity relation joint extraction method for medical texts | |
| CN108763376B (en) | A Knowledge Representation Learning Method Integrating Relation Path, Type, and Entity Description Information | |
| CN110019839A (en) | Medical knowledge map construction method and system based on neural network and remote supervisory | |
| CN109871454B (en) | A Robust Discrete Supervised Cross-media Hashing Retrieval Method | |
| CN107871158A (en) | A kind of knowledge mapping of binding sequence text message represents learning method and device | |
| CN107832458A (en) | A kind of file classification method based on depth of nesting network of character level | |
| Sun et al. | A LongFormer-based framework for accurate and efficient medical text summarization | |
| CN114239730A (en) | A Cross-modal Retrieval Method Based on Neighbor Ranking Relation | |
| CN111881256B (en) | Text entity relation extraction method and device and computer readable storage medium equipment | |
| CN116910276B (en) | A storage method and system for common sense knowledge graph | |
| CN114444694A (en) | An open world knowledge graph completion method and device | |
| CN114897167A (en) | Method and device for constructing knowledge graph in biological field | |
| CN106919556A (en) | A kind of natural language semanteme deep analysis algorithm of use sparse coding | |
| CN116720519B (en) | Seedling medicine named entity identification method | |
| CN112100393B (en) | A Knowledge Triple Extraction Method in Low-Resource Scenarios | |
| CN114743029B (en) | A method for image text matching | |
| CN113869037B (en) | Learning method of topic label representation based on content-enhanced network embedding | |
| CN116680407A (en) | A method and device for constructing a knowledge map | |
| CN116259382A (en) | Method, device and storage medium for generating structured electronic medical records |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| CF01 | Termination of patent right due to non-payment of annual fee | ||
| CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20230519 |