[go: up one dir, main page]

CN117272073B - Text unit semantic distance precalculation method and device, query method and device - Google Patents

Text unit semantic distance precalculation method and device, query method and device Download PDF

Info

Publication number
CN117272073B
CN117272073B CN202311569661.1A CN202311569661A CN117272073B CN 117272073 B CN117272073 B CN 117272073B CN 202311569661 A CN202311569661 A CN 202311569661A CN 117272073 B CN117272073 B CN 117272073B
Authority
CN
China
Prior art keywords
text unit
text
unit
units
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311569661.1A
Other languages
Chinese (zh)
Other versions
CN117272073A (en
Inventor
张晓东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Langmuda Information Technology Co ltd
Original Assignee
Hangzhou Langmuda Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Langmuda Information Technology Co ltd filed Critical Hangzhou Langmuda Information Technology Co ltd
Priority to CN202311569661.1A priority Critical patent/CN117272073B/en
Publication of CN117272073A publication Critical patent/CN117272073A/en
Application granted granted Critical
Publication of CN117272073B publication Critical patent/CN117272073B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text unit semantic distance pre-calculation method and device, and a query method and device, wherein the pre-calculation method comprises the following steps: acquiring all text units in a pre-calculation knowledge base, and acquiring an associated text unit set of each text unit based on an associated unit acquisition mode; acquiring knowledge representations of all object attribute text units in a pre-calculation knowledge base based on an associated text unit set by a preset object knowledge representation acquisition mode, and acquiring knowledge representations of all category attribute text units in the pre-calculation knowledge base by a preset category knowledge representation acquisition mode; based on knowledge representation of text units, calculating semantic distances of all text unit pairs in a text unit relation determining mode, and collecting all the text unit pairs with the calculated semantic distances and the corresponding semantic distances into a semantic distance library of a pre-calculation knowledge base. The semantic distance calculation process has no vector embedding, chunk process and is lossless to the original data information.

Description

文本单位语义距离预计算方法及装置、查询方法及装置Text unit semantic distance precalculation method and device, query method and device

技术领域Technical field

本发明涉及知识库查询技术领域,涉及一种文本单位语义距离预计算方法,特别是涉及一种文本单位语义距离预计算方法及装置、查询方法及装置。The invention relates to the technical field of knowledge base query, to a text unit semantic distance pre-calculation method, and in particular to a text unit semantic distance pre-calculation method and device, query method and device.

背景技术Background technique

预训练是一种深度学习模型训练的策略,目的是通过训练从大量数据中提取出尽可能多的共性特征,然后将其应用于特定任务模型中,再使用相关特定领域的少量标注数据进行“微调”,使得模型只需要从“共性”出发,去“学习”该特定任务的“特殊”部分即可。但预训练需要使用大量数据训练学习,由于受训练数据自身质量的影响,其结果准确性不一定很高,训练成本也较高。且大模型使用历史数据进行训练,无法做到实时更新。Pre-training is a deep learning model training strategy. The purpose is to extract as many common features as possible from a large amount of data through training, and then apply them to specific task models, and then use a small amount of annotated data in relevant specific fields. "Fine-tuning" allows the model to only "learn" the "special" part of the specific task starting from the "common characteristics". However, pre-training requires the use of a large amount of data for training and learning. Due to the influence of the quality of the training data itself, the accuracy of the results is not necessarily high, and the training cost is also high. Moreover, large models use historical data for training and cannot be updated in real time.

目前已有一些大模型整合框架可以把大模型和外部数据结合起来,做到实时更新;但是大模型整合框架的训练和推理成本都比较高。具体要将外部数据转换成向量,并保存在支持向量搜索功能的数据存储中,使用时需要计算向量的相似度,在向量存储过程中还需要构建索引,计算量很大,时间、算力成本均比较高。同时大模型整合框架中在数据embedding的过程和数据太长,需要对数据分割成chunk的过程中,对原有信息都有损坏,丢失了数据的语义信息和上下文关系,导致最后查询的结果和原数据不相关。There are currently some large model integration frameworks that can combine large models with external data to achieve real-time updates; however, the training and inference costs of large model integration frameworks are relatively high. Specifically, the external data needs to be converted into vectors and stored in a data storage that supports the vector search function. When using it, the similarity of the vectors needs to be calculated. During the vector storage process, an index needs to be built, which requires a large amount of calculation and costs time and computing power. are relatively high. At the same time, in the large model integration framework, the data embedding process and data are too long, and the data needs to be divided into chunks. In the process, the original information is damaged, and the semantic information and contextual relationship of the data are lost, resulting in the final query results and The original data is not relevant.

因此针对预训练和大模型整合框架的数据,需要一种更便捷快速准确,且无需对数据进行分割的数据检索方式,以避免对原有信息无损坏,丢失了数据的语义信息和上下文关系。Therefore, for data from pre-training and large model integration frameworks, a more convenient, fast and accurate data retrieval method that does not require segmentation of data is needed to avoid damaging the original information and losing the semantic information and contextual relationships of the data.

发明内容Contents of the invention

本申请的目的在于提供一种文本单位语义距离预计算方法及装置、查询方法及装置,用于降低训练成本、提升准确率;解决现有大模型整合框架外部数据计算量很大,且需对数据进行分割对原有信息具有损坏,丢失了数据的语义信息和上下文关系,导致最后查询的结果和原数据不相关的问题。The purpose of this application is to provide a text unit semantic distance pre-calculation method and device, query method and device, to reduce training costs and improve accuracy; to solve the problem that the existing large model integration framework requires a large amount of external data calculation and requires Data segmentation damages the original information and loses the semantic information and contextual relationship of the data, resulting in the problem that the final query result is not relevant to the original data.

第一方面,本申请提供一种文本单位语义距离预计算方法,包括:In the first aspect, this application provides a text unit semantic distance precalculation method, including:

获取预计算知识库中的所有文本单位,并基于关联单位获取方式获取每个文本单位的关联文本单位集合;Obtain all text units in the precomputed knowledge base, and obtain the associated text unit set of each text unit based on the associated unit acquisition method;

通过预设对象知识表示获取方式,基于所述关联文本单位集合获取所述预计算知识库中所有对象属性文本单位的知识表示,通过预设范畴知识表示获取方式,获取所述预计算知识库中所有范畴属性文本单位的知识表示;Through the preset object knowledge representation acquisition method, the knowledge representation of all object attribute text units in the precomputed knowledge base is obtained based on the associated text unit set, and through the preset category knowledge representation acquisition method, the knowledge representation in the precomputed knowledge base is obtained. Knowledge representation of all category attribute text units;

获取所有所述文本单位所能构成的所有文本单位对,基于所述文本单位的知识表示,通过文本单位关系确定方式对所有所述文本单位对的语义距离进行计算,将所有计算出语义距离的文本单位对以及所对应的语义距离集合为所述预计算知识库的语义距离库;Obtain all text unit pairs that can be formed by all the text units. Based on the knowledge representation of the text unit, calculate the semantic distance of all the text unit pairs by determining the text unit relationship, and combine all the calculated semantic distances. The text unit pairs and the corresponding semantic distance set are the semantic distance database of the precomputed knowledge base;

其中,所述对象属性文本单位为所述预计算知识库中的对象,所述范畴属性文本单位为所述预计算知识库中的范畴。Wherein, the object attribute text unit is an object in the precomputed knowledge base, and the category attribute text unit is a category in the precomputed knowledge base.

于本申请一实施例中,基于关联单位获取方式获取文本单位的关联文本单位集合包括:In an embodiment of the present application, obtaining the associated text unit set of the text unit based on the associated unit acquisition method includes:

从所述预计算知识库中获取常规文本单位的描述页;Obtain description pages of regular text units from the precomputed knowledge base;

将所述描述页中的文本单位作为所述常规文本单位的内部文本单位,将所述常规文本单位的所有类型所述内部文本单位集合为所述常规文本单位的关联文本单位集合;Treat the text unit in the description page as the internal text unit of the regular text unit, and set all types of internal text units of the regular text unit as a set of associated text units of the regular text unit;

其中,所述常规文本单位为所述预计算知识库中的任意一个文本单位。Wherein, the regular text unit is any text unit in the precomputed knowledge base.

于本申请一实施例中,通过预设对象知识表示获取方式,基于所述关联文本单位集合获取所述预计算知识库中对象属性文本单位的知识表示包括:In one embodiment of the present application, obtaining the knowledge representation of the object attribute text unit in the precomputed knowledge base based on the associated text unit set through a preset object knowledge representation acquisition method includes:

以对象属性文本单位为筛选单位对其所面对所述关联文本单位集合进行筛选,并将符合筛选条件的关联文本单位集合所对应的文本单位,集合为所述对象属性文本单位的知识表示;Use the object attribute text unit as the filtering unit to filter the associated text unit set it faces, and assemble the text units corresponding to the associated text unit set that meet the filtering conditions into the knowledge representation of the object attribute text unit;

其中,对象属性文本单位为所述预计算知识库中任意一个对象;对象属性文本单位对其所面对关联文本单位集合包括所述预计算知识库中,除所述对象属性文本单位所对应关联文本单位集合外的其他所有关联文本单位集合;所述筛选条件为所述关联文本单位集合中包含所述筛选单位。Wherein, the object attribute text unit is any object in the precomputed knowledge base; the set of associated text units faced by the object attribute text unit is included in the precomputed knowledge base, except for the association corresponding to the object attribute text unit. All other associated text unit sets except the text unit set; the filtering condition is that the associated text unit set contains the filtering unit.

于本申请一实施例中,通过预设范畴知识表示获取方式,获取所述预计算知识库中单个范畴属性文本单位的知识表示包括:In an embodiment of the present application, obtaining the knowledge representation of a single category attribute text unit in the precomputed knowledge base through a preset category knowledge representation acquisition method includes:

获取属于范畴属性文本单位的对象属性文本单位作为对象文本单位,将范畴属性文本单位的所有所述对象文本单位的知识表示集合为范畴属性文本单位的知识表示;Obtain the object attribute text unit belonging to the category attribute text unit as the object text unit, and collect the knowledge representation of all the object text units of the category attribute text unit into the knowledge representation of the category attribute text unit;

其中,范畴属性文本单位为所述预计算知识库中任意一个范畴。The category attribute text unit is any category in the precomputed knowledge base.

于本申请一实施例中,通过文本单位关系确定方式对所述文本单位对的语义距离进行计算包括:In one embodiment of the present application, calculating the semantic distance of the text unit pair by determining the text unit relationship includes:

设定所述文本单位对中一个所述文本单位为第一文本单位,另一个所述文本单位为第二文本单位;Setting one of the text units in the text unit pair as the first text unit and the other text unit as the second text unit;

判断所述第一文本单位的知识表示和所述第二文本单位的知识表示是否存在交集,若是则表示所述第一文本单位和第二文本单位有关系,并基于所述第一文本单位的知识表示和所述第二文本单位的知识表示计算所述第一文本单位和第二文本单位之间的语义距离,否则表示所述第一文本单位和第二文本单位之间没有关系。Determine whether there is an intersection between the knowledge representation of the first text unit and the knowledge representation of the second text unit. If so, it means that the first text unit and the second text unit are related, and based on the first text unit The knowledge representation and the knowledge representation of the second text unit calculate the semantic distance between the first text unit and the second text unit, otherwise it means that there is no relationship between the first text unit and the second text unit.

于本申请一实施例中,基于所述第一文本单位的知识表示和所述第二文本单位的知识表示,通过Ochiia系数计算方式或杰卡德指数计算方式计算所述第一文本单位和第二文本单位之间的语义距离。In an embodiment of the present application, based on the knowledge representation of the first text unit and the knowledge representation of the second text unit, the Ochiia coefficient calculation method or the Jaccard index calculation method is used to calculate the first text unit and the second text unit. The semantic distance between two text units.

第二方面,本申请提供一种文本单位语义距离预计算装置,包括关联文本单位获取模块、知识表示获取模块和语义距离库获取模块:In the second aspect, this application provides a text unit semantic distance pre-calculation device, including an associated text unit acquisition module, a knowledge representation acquisition module and a semantic distance library acquisition module:

所述关联文本单位获取模块,用于获取预计算知识库中的所有文本单位,并基于关联单位获取方式获取每个文本单位的关联文本单位集合;The associated text unit acquisition module is used to acquire all text units in the precomputed knowledge base, and acquire the associated text unit set of each text unit based on the associated unit acquisition method;

所述知识表示获取模块,用于基于预设对象知识表示获取方式,获取所述预计算知识库中所有对象属性文本单位的知识表示,基于预设范畴知识表示获取方式,获取所述预计算知识库中所有范畴属性文本单位的知识表示;The knowledge representation acquisition module is used to acquire the knowledge representation of all object attribute text units in the precomputed knowledge base based on a preset object knowledge representation acquisition method, and acquire the precomputed knowledge based on a preset category knowledge representation acquisition method. Knowledge representation of all category attribute text units in the library;

所述语义距离库获取模块,用于获取所有所述文本单位所能构成的所有文本单位对,基于所述文本单位的知识表示,通过文本单位关系确定方式对所有所述文本单位对的语义距离进行计算,将所有计算出语义距离的文本单位对以及所对应的语义距离集合为所述预计算知识库的语义距离库;The semantic distance library acquisition module is used to obtain all text unit pairs that can be formed by all the text units. Based on the knowledge representation of the text unit, the semantic distance of all the text unit pairs is determined by a text unit relationship determination method. Perform calculations to collect all pairs of text units for which semantic distances are calculated and the corresponding semantic distances into a semantic distance library of the precomputed knowledge base;

其中,所述对象属性文本单位为所述预计算知识库中的对象,所述范畴属性文本单位为所述预计算知识库中的范畴。Wherein, the object attribute text unit is an object in the precomputed knowledge base, and the category attribute text unit is a category in the precomputed knowledge base.

第三方面,本申请提供一种知识库文本单位查询方法,包括:In the third aspect, this application provides a knowledge base text unit query method, including:

获取待查询文本单位;Get the text unit to be queried;

从知识库的语义距离库中查找所有具有所述待查询文本单位的文本单位对作为所述待查询文本单位的查询文本单位对,基于所对应语义距离大小将全部或部分所述查询文本单位对中非待查询文本单位集合为查询结果列表;Search all text unit pairs with the text unit to be queried from the semantic distance database of the knowledge base as the query text unit pairs of the text unit to be queried, and combine all or part of the query text unit pairs based on the corresponding semantic distance. The collection of Chinese and African text units to be queried is the query result list;

其中,所述查询文本单位对中非待查询文本单位为所述查询文本单位对中除所述待查询文本单位外的另一个所述文本单位,所述知识库的语义距离库通过所述文本单位语义距离预计算方法获取。Wherein, the non-query text unit in the query text unit pair is the other text unit in the query text unit pair except the text unit to be queried, and the semantic distance database of the knowledge base passes through the text unit. The unit semantic distance precomputation method is obtained.

于本申请一实施例中,从语义距离库中查找所述待查询文本单位的语义距离库以作为待查询语义距离库,基于语义距离从所述待查询语义距离库中获取全部或部分文本单位以形成查询结果列表步骤包括:In one embodiment of the present application, the semantic distance database of the text unit to be queried is searched from the semantic distance database as the semantic distance database to be queried, and all or part of the text units are obtained from the semantic distance database to be queried based on the semantic distance. The steps to form a query result list include:

设定所述待查询文本单位语言类型为第一类语言,知识库的语言类型为第二类语言;The language type of the text unit to be queried is set to the first type of language, and the language type of the knowledge base is set to the second type of language;

若所述第一类语言与所述第二类语言相同,则判断所述待查询文本单位对应标识是否唯一,若是则从知识库的语义距离库中查找所有具有所述待查询文本单位的文本单位对作为所述待查询文本单位的查询文本单位对,基于所对应语义距离大小将全部或部分所述查询文本单位对中非待查询文本单位集合为查询结果列表;否则获取所述待查询文本单位的进阶限定条件,并基于所述进阶限定条件从所述待查询文本单位对应所有标识中确定待查询单位标识,从知识库的语义距离库中查找所有具有所述待查询单位标识的文本单位对作为所述待查询单位标识的查询文本单位对,基于所对应语义距离大小将全部或部分所述查询文本单位对中非待查询单位标识所对应文本单位集合成查询结果列表;If the first type of language is the same as the second type of language, then determine whether the corresponding identifier of the text unit to be queried is unique, and if so, search all texts with the text unit to be queried from the semantic distance database of the knowledge base. The unit pair is a query text unit pair of the text unit to be queried, and based on the corresponding semantic distance, all or part of the query text unit pairs that are not text units to be queried are collected into a query result list; otherwise, the text to be queried is obtained Advanced qualification conditions of the unit, and based on the advanced qualification conditions, determine the identification of the unit to be queried from all identifications corresponding to the text unit to be queried, and search all the identification of the unit to be queried from the semantic distance database of the knowledge base The text unit pairs serve as the query text unit pairs of the unit identifiers to be queried, and all or part of the query text unit pairs that are not corresponding to the unit identifiers to be queried are assembled into a query result list based on the corresponding semantic distance;

若所述第一类语言与所述第二类语言不相同,将所述待查询文本单位转换为第二类语言的等同文本单位,并从知识库的语义距离库中查找所有具有所述等同文本单位的文本单位对作为所述等同文本单位的查询文本单位对,基于所对应语义距离大小将全部或部分所述查询文本单位对中非等同文本单位作为等同查询单位,将所有所述等同查询单位转化为第一类语言的查询结果,将所有所述查询结果集合为查询结果列表;If the first type of language is different from the second type of language, convert the text unit to be queried into an equivalent text unit of the second type of language, and search for all the equivalent text units from the semantic distance database of the knowledge base. Text unit pairs of text units are used as query text unit pairs of the equivalent text units. Based on the corresponding semantic distance, non-equivalent text units in all or part of the query text unit pairs are used as equivalent query units. All the equivalent query units are The unit is converted into query results in the first language, and all the query results are assembled into a query result list;

其中,所述知识库中每种含义的文本单位均有唯一的标识。Each text unit of meaning in the knowledge base has a unique identifier.

第四方面,本申请提供一种知识库文本单位查询装置,包括查询单位获取模块和查询结果获取模块;In the fourth aspect, this application provides a knowledge base text unit query device, including a query unit acquisition module and a query result acquisition module;

所述查询单位获取模块,用于获取待查询文本单位;The query unit acquisition module is used to obtain the text unit to be queried;

所述查询结果获取模块,用于从知识库的语义距离库中查找所有具有所述待查询文本单位的文本单位对作为所述待查询文本单位的查询文本单位对,基于所对应语义距离大小将全部或部分所述查询文本单位对中非待查询文本单位集合为查询结果列表;The query result acquisition module is used to search for all text unit pairs with the text unit to be queried as the query text unit pairs of the text unit to be queried from the semantic distance database of the knowledge base, and based on the corresponding semantic distance size, The set of non-query text units in all or part of the query text units is a query result list;

其中,所述查询文本单位对中非待查询文本单位为所述查询文本单位对中除所述待查询文本单位外的另一个所述文本单位,所述知识库的语义距离库通过所述文本单位语义距离预计算方法获取。Wherein, the non-query text unit in the query text unit pair is the other text unit in the query text unit pair except the text unit to be queried, and the semantic distance database of the knowledge base passes through the text unit. The unit semantic distance precomputation method is obtained.

与现有技术相比,上述方案中的一个或多个实施例可以具有如下优点或有益效果:Compared with the existing technology, one or more embodiments of the above solutions may have the following advantages or beneficial effects:

应用本发明实施例提供的文本单位语义距离预计算方法,通过对描述页中内部文本单位进行查询,而后再将所有具有同一内部文本单位的文本单位进行整合,以获取每个文本单位的知识表示,最后再基于知识表示获取文本单位的语义距离库。语义距离获取过程无向量embedding、chunk过程,对数据原信息无损,提升了准确率。且文本单位语义距离库的获取过程即为文本单位预计算的过程,该过程得到的结果可直接保存在各种数据存储或数据库中进行查询使用,与现有预训练和大模型整合框架相比节省了向量嵌入和计算的时间,即节省了大模型训练的时间成本和算力成本;有助于在训练时,更快实现语义对齐,提升大模型推理能力。By applying the text unit semantic distance pre-calculation method provided by the embodiment of the present invention, the internal text units in the description page are queried, and then all text units with the same internal text unit are integrated to obtain the knowledge representation of each text unit. , and finally obtain the semantic distance library of text units based on knowledge representation. There is no vector embedding or chunking process in the semantic distance acquisition process, which does not damage the original information of the data and improves the accuracy. And the acquisition process of the text unit semantic distance library is the process of text unit pre-calculation. The results obtained by this process can be directly saved in various data storage or databases for query and use. Compared with the existing pre-training and large model integration frameworks It saves the time of vector embedding and calculation, that is, it saves the time cost and computing power cost of large model training; it helps to achieve semantic alignment faster during training and improve the reasoning ability of large models.

应用本发明实施例提供的知识库文本单位查询方法,可实现对文本单位的查询。进一步可快速实现对不同语种文本数据的查询,同时还可快速实现对具有多个语义的文本单位的查询。By applying the knowledge base text unit query method provided by the embodiment of the present invention, the text unit query can be realized. Furthermore, the query of text data in different languages can be quickly realized, and the query of text units with multiple semantics can also be quickly realized.

本发明的其它特征和优点将在随后的说明书中阐述,并且部分地从说明书中变得显而易见,或者通过实施本发明而了解。本发明的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。Additional features and advantages of the invention will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and obtained by the structure particularly pointed out in the written description, claims and appended drawings.

附图说明Description of drawings

附图用来提供对本发明的进一步理解,并且构成说明书的一部分,与本发明的实施例共同用于解释本发明,并不构成对本发明的限制。在附图中。The accompanying drawings are used to provide a further understanding of the present invention and constitute a part of the specification. They are used together with the embodiments of the present invention to explain the present invention and do not constitute a limitation of the present invention. In the attached picture.

图1显示为本申请实施例所述文本单位语义距离预计算方法的流程示意图。Figure 1 shows a schematic flowchart of the text unit semantic distance pre-calculation method according to the embodiment of the present application.

图2显示为本申请实施例所述文本单位语义距离预计算装置的结构示意图。Figure 2 shows a schematic structural diagram of a text unit semantic distance pre-calculation device according to an embodiment of the present application.

图3显示为本申请实施例所述知识库文本单位查询方法的流程示意图。Figure 3 shows a schematic flow chart of the knowledge base text unit query method according to the embodiment of the present application.

图4显示为本申请实施例所述知识库文本单位查询装置的结构示意图。Figure 4 shows a schematic structural diagram of the knowledge base text unit query device according to the embodiment of the present application.

具体实施方式Detailed ways

以下将结合附图及实施例来详细说明本发明的实施方式,借此对本发明如何应用技术手段来解决技术问题,并达成技术效果的实现过程能充分理解并据以实施。需要说明的是,只要不构成冲突,本发明中的各个实施例以及各实施例中的各个特征可以相互结合,所形成的技术方案均在本发明的保护范围之内。The embodiments of the present invention will be described in detail below with reference to the accompanying drawings and examples, so that the implementation process of how to apply technical means to solve technical problems and achieve technical effects of the present invention can be fully understood and implemented accordingly. It should be noted that as long as there is no conflict, the various embodiments of the present invention and the various features in the embodiments can be combined with each other, and the resulting technical solutions are within the protection scope of the present invention.

需要说明的是,以下实施例中所提供的图示仅以示意方式说明本申请的基本构想,遂图式中仅显示与本申请中有关的组件而非按照实际实施时的组件数目、形状及尺寸绘制,其实际实施时各组件的型态、数量及比例可为一种随意的改变,且其组件布局型态也可能更为复杂。It should be noted that the illustrations provided in the following embodiments only illustrate the basic concept of the present application in a schematic manner, and the drawings only show the components related to the present application and do not follow the actual implementation of the component numbers, shapes and components. Dimension drawing, in actual implementation, the type, quantity and proportion of each component can be arbitrarily changed, and the component layout type may also be more complex.

数据策展(Data Curation)是指将从不同来源收集的数据进行组织和集成,建立值得信任的数据资源库的活动。它包括收集、筛选、评估、保存、维护、利用等系列数据有效管理的全生命周期活动。数据策展可以由人工完成,也可以通过机器进行处理。经过数据策展的数据质量通常较高。Data curation refers to the activity of organizing and integrating data collected from different sources to build a trustworthy data resource library. It includes a series of full life cycle activities for effective management of data such as collection, screening, evaluation, preservation, maintenance, and utilization. Data curation can be done manually or processed by machines. The quality of data that has been curated is generally higher.

知识库是一种存储和管理知识的工具,它可以用来支持知识系统、知识发现、知识共享等应用。文本单位(即token)包括词、短语和句子。在知识库中指的是对象和范畴,即组成它们的词、短语和句子。可以是具体的token对象,比如“Cogito, ergo sum”、“Homohomini lupus”、“知识就是力量”、“NBA Finals Most Valuable Player Award”等,也可以是token范畴,比如“NBA Finals”、“Chinese actors”等。Knowledge base is a tool for storing and managing knowledge. It can be used to support applications such as knowledge systems, knowledge discovery, and knowledge sharing. Text units (i.e. tokens) include words, phrases and sentences. In a knowledge base, it refers to objects and categories, that is, the words, phrases, and sentences that make them up. It can be a specific token object, such as "Cogito, ergo sum", "Homohomini lupus", "Knowledge is power", "NBA Finals Most Valuable Player Award", etc., or it can be a token category, such as "NBA Finals", "Chinese actors" etc.

知识表示(knowledge representation,简称KR)是指将现实世界中的事物、概念、关系等抽象成计算机可以处理的形式,以便计算机能够理解和处理这些信息。Knowledge representation (KR for short) refers to abstracting things, concepts, relationships, etc. in the real world into a form that can be processed by computers, so that computers can understand and process this information.

语义距离是指在语义空间中,用来度量两个词语、短语或句子之间的语义相似性或差异性的概念。它可以用于完成词义消歧、文本分类、信息检索等任务。Semantic distance refers to the concept used to measure the semantic similarity or difference between two words, phrases or sentences in semantic space. It can be used to complete tasks such as word sense disambiguation, text classification, and information retrieval.

大模型的幻觉指的是人工智能模型生成的内容,不是基于任何现实世界的数据,而是大模型自己外推的产物。这种幻觉的本质是由于大模型本身缺乏对现实世界的感知能力,其训练数据可能存在数量少或质量低的情况,因此可能导致训练中出现过拟合和量化误差,以及Prompt上下文缺失等情况。The illusion of the big model refers to the content generated by the artificial intelligence model, which is not based on any real-world data, but is the product of the extrapolation of the big model itself. The essence of this illusion is that the large model itself lacks the ability to perceive the real world, and its training data may be small in quantity or low in quality, which may lead to overfitting and quantization errors in training, as well as lack of prompt context. .

等价体指的是表示相同意思的一组token,包括在不同语言中意思相同的一组token,比如“北大”和“北京大学”等价,中文的“北京大学”和英文的“Peking University”等价。Equivalence refers to a group of tokens that express the same meaning, including a group of tokens that have the same meaning in different languages. For example, "Beijing University" and "Peking University" are equivalent, and "Peking University" in Chinese and "Peking University" in English are equivalent. "equivalence.

大模型整合框架,提供了标准的模块化组件,集成了不同的大模型并将其进行整合,并将它们连接到各种外部数据源和API。The large model integration framework provides standard modular components that integrate different large models and connect them to various external data sources and APIs.

本申请以下实施例提供了文本单位语义距离预计算方法及装置、查询方法及装置,用于解决现有大模型整合框架外部数据计算量很大,且需对数据进行分割对原有信息具有损坏,丢失了数据的语义信息和上下文关系,导致最后查询的结果和原数据不相关的问题。The following embodiments of this application provide text unit semantic distance pre-calculation methods and devices, query methods and devices, which are used to solve the problem that the existing large model integration framework requires a large amount of external data calculation, and the need to segment the data will damage the original information. , the semantic information and contextual relationship of the data are lost, resulting in the problem that the final query result is not relevant to the original data.

以下将结合附图详细阐述本实施例的一种文本单位语义距离预计算方法及装置、查询方法及装置的原理及实施方式,使本领域技术人员不需要创造性劳动即可理解本实施例的文本单位语义距离预计算方法及装置、查询方法及装置。The principles and implementation of a text unit semantic distance pre-calculation method and device, query method and device of this embodiment will be described in detail below with reference to the accompanying drawings, so that those skilled in the art can understand the text of this embodiment without creative work. Unit semantic distance precalculation method and device, query method and device.

如图1所示,本实施例提供一种文本单位语义距离预计算方法,包括如下步骤。As shown in Figure 1, this embodiment provides a text unit semantic distance precalculation method, which includes the following steps.

步骤S101,获取预计算知识库中的所有文本单位,并基于关联单位获取方式获取每个文本单位的关联文本单位集合。Step S101: Acquire all text units in the precomputed knowledge base, and obtain the associated text unit set of each text unit based on the associated unit acquisition method.

本实施例中的知识库可作为大模型整合框架的外部数据集合。本发明实施例中的知识库为经过前期数据策展(Data Curation)的,其包含大多数常用的文本单位(即token)。经过前期数据策展的知识库的数据质量较高,利用该知识库计算文本单位之间的相似度更能体现其真实语义关系。且这些文本单位计算出的结果比现有预训练学习出来的结果更加准确且快速。进一步文本单位(即token)可以为词、短语和句子。在知识库中指的是对象和范畴,即组成它们的词、短语和句子。The knowledge base in this embodiment can be used as an external data collection of the large model integration framework. The knowledge base in the embodiment of the present invention has undergone preliminary data curation (Data Curation), and it contains most commonly used text units (ie, tokens). The data quality of the knowledge base that has been curated in the early stage is higher. Using this knowledge base to calculate the similarity between text units can better reflect their true semantic relationships. And the results calculated by these text units are more accurate and faster than those learned by existing pre-training. Further text units (i.e. tokens) can be words, phrases and sentences. In a knowledge base, it refers to objects and categories, that is, the words, phrases, and sentences that make them up.

将预进行文本单位间语义距离的知识库作为预计算知识库。获取预计算知识库中的所有文本单位。而后基于关联单位获取方式获取预计算知识库中每个文本单位的关联文本单位集合。The knowledge base that pre-performs the semantic distance between text units is used as the pre-computed knowledge base. Get all text units in the precomputed knowledge base. Then, the associated text unit set of each text unit in the precomputed knowledge base is obtained based on the associated unit acquisition method.

进一步地,基于关联单位获取方式获取预计算知识库中单个文本单位的关联文本单位集合过程包括:设置该文本单位为预计算知识库中的任意一个文本单位,且为了与其他文本单位进行区别,设定该文本单位为常规文本单位。由于本实施例中文本单位为范畴或对象,因此每个文本单位均有对应的描述页。首先从预计算知识库中获取常规文本单位的描述页;而后获取该描述页文本中出现过的所有文本单位,以作为常规文本单位的内部文本单位;对常规文本单位的所有类型的内部文本单位进行集合,进而获取常规文本单位的关联文本单位集合。通过上述方式即可获取预计算知识库中所有文本单位的关联文本单位集合。Further, the process of obtaining the associated text unit set of a single text unit in the precomputed knowledge base based on the associated unit acquisition method includes: setting the text unit to any text unit in the precomputed knowledge base, and in order to distinguish it from other text units, Sets the text unit to a regular text unit. Since the text unit in this embodiment is a category or object, each text unit has a corresponding description page. First, obtain the description page of the regular text unit from the precomputed knowledge base; then obtain all text units that appear in the text of the description page as internal text units of the regular text unit; for all types of internal text units of the regular text unit Collection is performed to obtain a collection of associated text units of regular text units. Through the above method, the associated text unit set of all text units in the precomputed knowledge base can be obtained.

需要说明的是,在获取常规文本单位的内部文本单位时,无需考虑内部单位在描述页中出现的次数。例如若文本单位A的描述页文本中出现的所有类型文本单位为文本单位B和文本单位C,则文本单位B和文本单位C即为文本单位A的内部文本单位,进一步即可表示文本单位B和文本单位C分别与文本单位A有关联,此时文本单位A的关联文本单位集合可表示为A->[B,C]。It should be noted that when obtaining the internal text unit of a regular text unit, there is no need to consider the number of times the internal unit appears in the description page. For example, if all types of text units appearing in the text of the description page of text unit A are text unit B and text unit C, then text unit B and text unit C are the internal text units of text unit A, and further can represent text unit B. and text unit C are respectively associated with text unit A. At this time, the set of associated text units of text unit A can be expressed as A->[B,C].

步骤S102,通过预设对象知识表示获取方式,基于关联文本单位集合获取预计算知识库中所有对象属性文本单位的知识表示,通过预设范畴知识表示获取方式,获取预计算知识库中所有范畴属性文本单位的知识表示。Step S102: Obtain knowledge representations of all object attribute text units in the precomputed knowledge base based on the associated text unit set through a preset object knowledge representation acquisition method, and obtain all category attributes in the precomputed knowledge base through a preset category knowledge representation acquisition method. Knowledge representation of text units.

获取到所有文本单位的关联文本单位集合后,即可基于所有文本单位的关联文本单位集合获取每个文本单位的知识表示。本实施例中的文本单位为知识库中的对象和范畴,而文本单位为对象时的知识表示获取方式和文本单位为范畴时的知识表示获取方式并不相同。After obtaining the associated text unit sets of all text units, the knowledge representation of each text unit can be obtained based on the associated text unit sets of all text units. The text units in this embodiment are objects and categories in the knowledge base, and the knowledge representation acquisition method when the text unit is an object is different from the knowledge representation acquisition method when the text unit is a category.

进一步当文本单位为对象时,通过预设对象知识表示获取方式,获取预计算知识库中单个对象属性文本单位的知识表示包括:将对象属性文本单位作为筛选单位,而后基于对象属性文本单位对其所面对的关联文本单位集合进行筛选,以筛选出所有符合筛选条件的关联文本单位集合;最后将所有所筛选出的关联文本单位集合所对应的文本单位进行集合,即得到该对象属性文本单位的知识表示。Furthermore, when the text unit is an object, through the preset object knowledge representation acquisition method, obtaining the knowledge representation of a single object attribute text unit in the precomputed knowledge base includes: using the object attribute text unit as the filtering unit, and then based on the object attribute text unit Filter the associated text unit collection faced to filter out all associated text unit collections that meet the filtering conditions; finally, collect the text units corresponding to all the filtered associated text unit collections to obtain the object attribute text unit knowledge representation.

其中对象属性文本单位为预计算知识库中任意一个对象。且对象属性文本单位所面对的关联文本单位集合为预计算知识库中,除对象属性文本单位所对应关联文本单位集合外的所有其他关联文本单位集合。且筛选条件为关联文本单位集合中包含当前所对应对象属性文本单位。The object attribute text unit is any object in the precomputed knowledge base. And the associated text unit set faced by the object attribute text unit is all other associated text unit sets in the precomputed knowledge base except the associated text unit set corresponding to the object attribute text unit. And the filter condition is that the associated text unit collection contains the current corresponding object attribute text unit.

例如文本单位B的关联文本单位集合为B->[A,C],文本单位C的关联文本单位集合为C->[A,D],那么对象属性文本单位A的知识表示即为KRa=[C,B],经过排序后KRa=[B,C]。For example, the set of associated text units of text unit B is B->[A,C], and the set of associated text units of text unit C is C->[A,D]. Then the knowledge representation of object attribute text unit A is KRa= [C,B], after sorting KRa=[B,C].

分别将每个对象属性文本单位作为筛选单位,通过上述方式即可获取所有对象属性文本单位的知识表示。Each object attribute text unit is used as a filtering unit, and the knowledge representation of all object attribute text units can be obtained through the above method.

进一步当文本单位为对象时,通过预设范畴知识表示获取方式,获取预计算知识库中单个范畴属性文本单位的知识表示包括:在预计算知识库中获取隶属于该范畴属性文本单位下的所有对象属性文本单位,并将所获取的对象属性文本单位作为对象文本单位,将范畴属性文本单位的所有对象文本单位的知识表示进行集合,进而获取该范畴属性文本单位的知识表示。其中范畴属性文本单位为预计算知识库中任意一个范畴。通过上述方式即可获取所有范畴属性文本单位的知识表示。Furthermore, when the text unit is an object, obtaining the knowledge representation of a single category attribute text unit in the precomputed knowledge base through the preset category knowledge representation acquisition method includes: acquiring all the text units belonging to the category attribute text unit in the precomputed knowledge base. The object attribute text unit is used as the object text unit, and the knowledge representation of all the object text units of the category attribute text unit is collected to obtain the knowledge representation of the category attribute text unit. The category attribute text unit is any category in the precomputed knowledge base. Through the above method, the knowledge representation of all category attribute text units can be obtained.

隶属于范畴属性文本单位的所有对象属性文本单位的获取过程为预计算知识库本身的设置属性,在此不对其进行过多描述。The acquisition process of all object attribute text units belonging to category attribute text units is to precompute the setting attributes of the knowledge base itself, which will not be described in detail here.

由于知识库中每种含义的文本单位均有唯一的标识(即ID),因此每个知识表示中的文本单位均可基于标识大小进行排序。该种设置有利于后续文本单位间关系的判定。Since each text unit of meaning in the knowledge base has a unique identifier (i.e., ID), the text units in each knowledge representation can be sorted based on the identifier size. This setting is beneficial to the subsequent determination of the relationship between text units.

步骤S103,获取所有文本单位所能构成的所有文本单位对,基于文本单位的知识表示,通过文本单位关系确定方式对所有文本单位对的语义距离进行计算,将所有计算出语义距离的文本单位对以及所对应的语义距离集合为预计算知识库的语义距离库。Step S103: Obtain all text unit pairs that can be formed by all text units. Based on the knowledge representation of the text unit, calculate the semantic distance of all text unit pairs by determining the text unit relationship, and combine all text unit pairs for which the semantic distance is calculated. And the corresponding semantic distance set is the semantic distance library of the precomputed knowledge base.

具体地,先基于所有文本单位获取所有文本单位能构建的所有文本单位对,其中文本单位对中包括所有文本单位中任意两个文本单位。例如文本单位A、文本单位B和文本单位C之间的所有文本单位对包括(A,B)、(B,C)和(A,C)。同时获取所有文本单位的知识表示后,即可基于知识表示对任意两个文本单位的关系进行判定,进而获取预计算知识库的语义距离库。本实施例主要通过文本单位关系确定方式对所有文本单位对中的两个文本单位关系进行确定。Specifically, all text unit pairs that can be constructed by all text units are first obtained based on all text units, where the text unit pairs include any two text units among all text units. For example, all text unit pairs between text unit A, text unit B and text unit C include (A, B), (B, C) and (A, C). After obtaining the knowledge representation of all text units at the same time, the relationship between any two text units can be determined based on the knowledge representation, and then the semantic distance library of the precomputed knowledge base can be obtained. This embodiment mainly determines the relationship between two text units in all text unit pairs by determining the text unit relationship.

进一步地,设定文本单位中一个文本单位为第一文本单位,另一个文本单位为第二文本单位。此时通过文本单位关系确定方式对文本单位对中两个文本单位的关系进行判定过程包括:首先判断第一文本单位的知识表示和第二文本单位的知识表示是否存在交集,若是则表示第一文本单位和第二文本单位有关系;而后即可通过Ochiia系数计算方式或杰卡德指数计算方式计算第一文本单位和第二文本单位之间的语义距离。若第一文本单位的知识表示和第二文本单位的知识表示不存在交集,则表示第一文本单位和第二文本单位之间没有关系,此时无需对第一文本单位和第二文本单位之间的语义距离进行计算,即第一文本单位和第二文本单位之间计算不出语义距离。Further, one of the text units is set as the first text unit, and the other text unit is set as the second text unit. At this time, the process of determining the relationship between the two text units in the text unit pair through the text unit relationship determination method includes: first, determining whether there is an intersection between the knowledge representation of the first text unit and the knowledge representation of the second text unit. If so, it means that the first The text unit has a relationship with the second text unit; then the semantic distance between the first text unit and the second text unit can be calculated through the Ochiia coefficient calculation method or the Jaccard index calculation method. If there is no intersection between the knowledge representation of the first text unit and the knowledge representation of the second text unit, it means that there is no relationship between the first text unit and the second text unit. In this case, there is no need to compare the first text unit and the second text unit. The semantic distance between the first text unit and the second text unit cannot be calculated.

其中,通过杰卡德指数计算方式获取第一文本单位和第二文本单位之间的语义距离过程为:假设第一文本单位为A,其知识表示为KRa,第二文本单位为B,其知识表示为KRb;则基于杰卡德指数计算方式获取第一文本单位和第二文本单位之间的语义距离的表达式为:Among them, the process of obtaining the semantic distance between the first text unit and the second text unit through the Jaccard index calculation method is: assuming that the first text unit is A, its knowledge is expressed as KRa, and the second text unit is B, and its knowledge Expressed as KRb; the expression for obtaining the semantic distance between the first text unit and the second text unit based on the Jaccard index calculation method is:

通过Ochiia系数计算方式获取第一文本单位和第二文本单位之间的语义距离过程为:假设第一文本单位为A,其知识表示为KRa,第二文本单位为B,其知识表示为KRb;则基于Ochiia系数计算方式获取第一文本单位和第二文本单位之间的语义距离的表达式为:The process of obtaining the semantic distance between the first text unit and the second text unit through the Ochiia coefficient calculation method is: assuming that the first text unit is A, its knowledge representation is KRa, the second text unit is B, and its knowledge representation is KRb; Then the expression for obtaining the semantic distance between the first text unit and the second text unit based on the Ochiia coefficient calculation method is:

其中语义距离D越近,表示token之间的关系越紧密。语义距离D限定方式为:当D∈[0.6,1],则表示两个文本单位之间极度相似;当D∈[0.4,0.6],则表示两个文本单位之间较相似;当D∈[0.2,0.4],则表示两个文本单位之间不太相似;当D∈[0,0.2],则表示两个文本单位之间不相似。The closer the semantic distance D is, the closer the relationship between tokens is. The semantic distance D is defined as follows: when D∈[0.6,1], it means that the two text units are extremely similar; when D∈[0.4,0.6], it means that the two text units are relatively similar; when D∈ [0.2,0.4], it means that the two text units are not similar; when D∈[0,0.2], it means that the two text units are not similar.

通过上述方式即可获取出所有能计算出语义距离的文本单位对,最后将所有计算出语义距离的文本单位对以及文本单位对所对应的语义距离集合为预计算知识库的语义距离库,在对文本单位进行查询时,直接在预计算知识库的语义距离库中基于语义距离进行查询即可快速获取相应文本单位的查询结果。Through the above method, all the text unit pairs for which the semantic distance can be calculated can be obtained. Finally, all the text unit pairs for which the semantic distance is calculated and the semantic distance corresponding to the text unit pair are collected into the semantic distance library of the pre-computed knowledge base. When querying text units, you can quickly obtain the query results of the corresponding text units by directly querying based on semantic distance in the semantic distance database of the precomputed knowledge base.

需要说明的是,本发明实施例中获取的语义距离库实际为多个文本单位间语义距离的集合,并未将其限定为硬件数据库,本实施例所获取的语义距离库可应用于存储及检索机制中,例如可基于实际应用所需存储于高速存储介质中,也可存储于本地或远程存储器中。It should be noted that the semantic distance library obtained in this embodiment of the present invention is actually a collection of semantic distances between multiple text units and is not limited to a hardware database. The semantic distance library obtained in this embodiment can be applied to storage and In the retrieval mechanism, for example, it can be stored in a high-speed storage medium based on actual application requirements, or it can be stored in a local or remote storage.

本发明实施例提供的文本单位语义距离预计算方法,通过对描述页中内部文本单位进行查询,而后再将所有具有同一内部文本单位的文本单位进行整合,以获取每个文本单位的知识表示,最后再基于知识表示获取文本单位的语义距离库。语义距离获取过程无向量embedding、chunk过程,对数据原信息无损,提升了准确率。且文本单位语义距离库的获取过程即为文本单位预计算的过程,该过程得到的结果可直接保存在各种数据存储或数据库中进行查询使用;与现有预训练和大模型整合框架相比节省了向量嵌入和计算的时间,即节省了大模型训练的时间成本和算力成本;有助于在训练时,更快实现语义对齐,提升大模型推理能力。The text unit semantic distance precalculation method provided by the embodiment of the present invention queries the internal text units in the description page, and then integrates all text units with the same internal text unit to obtain the knowledge representation of each text unit. Finally, the semantic distance library of the text unit is obtained based on the knowledge representation. There is no vector embedding or chunking process in the semantic distance acquisition process, which does not damage the original information of the data and improves the accuracy. And the acquisition process of text unit semantic distance database is the process of text unit pre-calculation. The results obtained by this process can be directly saved in various data storage or databases for query and use; compared with existing pre-training and large model integration frameworks It saves the time of vector embedding and calculation, that is, it saves the time cost and computing power cost of large model training; it helps to achieve semantic alignment faster during training and improve the reasoning ability of large models.

如图2所示,本实施例提供一种文本单位语义距离预计算装置,包括关联文本单位获取模块、知识表示获取模块和语义距离库获取模块。As shown in Figure 2, this embodiment provides a text unit semantic distance pre-calculation device, which includes an associated text unit acquisition module, a knowledge representation acquisition module and a semantic distance library acquisition module.

关联文本单位获取模块用于获取预计算知识库中的所有文本单位,并基于关联单位获取方式获取每个文本单位的关联文本单位集合。The associated text unit acquisition module is used to acquire all text units in the precomputed knowledge base, and acquire the associated text unit set of each text unit based on the associated unit acquisition method.

知识表示获取模块用于通过预设对象知识表示获取方式,基于关联文本单位集合获取预计算知识库中所有对象属性文本单位的知识表示,通过预设范畴知识表示获取方式,获取预计算知识库中所有范畴属性文本单位的知识表示。The knowledge representation acquisition module is used to obtain the knowledge representation of all object attribute text units in the precomputed knowledge base based on the associated text unit set through the preset object knowledge representation acquisition method, and obtain the knowledge representation in the precomputed knowledge base through the preset category knowledge representation acquisition method. Knowledge representation of all category attribute text units.

语义距离库获取模块用于获取所有文本单位所能构成的所有文本单位对,基于文本单位的知识表示,通过文本单位关系确定方式对所有文本单位对的语义距离进行计算,将所有计算出语义距离的文本单位对以及所对应的语义距离集合为预计算知识库的语义距离库。The semantic distance library acquisition module is used to obtain all text unit pairs that can be formed by all text units. Based on the knowledge representation of text units, the semantic distance of all text unit pairs is calculated by determining the text unit relationship, and all calculated semantic distances are The text unit pairs and the corresponding semantic distance set are the semantic distance database of the precomputed knowledge base.

其中,对象属性文本单位为所述预计算知识库中的对象,范畴属性文本单位为所述预计算知识库中的范畴。The object attribute text unit is an object in the precomputed knowledge base, and the category attribute text unit is a category in the precomputed knowledge base.

本发明实施例提供的文本单位语义距离预计算装置,通过对描述页中内部文本单位进行查询,而后再将所有具有同一内部文本单位的文本单位进行整合,以获取每个文本单位的知识表示,最后再基于知识表示获取文本单位的语义距离库。语义距离获取过程无向量embedding、chunk过程,对数据原信息无损,提升了准确率。且文本单位语义距离库的获取过程即为文本单位预计算的过程,该过程得到的结果可直接保存在各种数据存储和数据库中进行查询使用;与现有预训练和大模型整合框架相比节省了向量嵌入和计算的时间,即节省了大模型训练的时间成本和算力成本;有助于在训练时,更快实现语义对齐,提升大模型推理能力。The text unit semantic distance precalculation device provided by the embodiment of the present invention queries the internal text units in the description page, and then integrates all text units with the same internal text unit to obtain the knowledge representation of each text unit. Finally, the semantic distance library of the text unit is obtained based on the knowledge representation. There is no vector embedding or chunking process in the semantic distance acquisition process, which does not damage the original information of the data and improves the accuracy. And the acquisition process of text unit semantic distance library is the process of text unit pre-calculation. The results obtained by this process can be directly saved in various data storage and databases for query and use; compared with existing pre-training and large model integration frameworks It saves the time of vector embedding and calculation, that is, it saves the time cost and computing power cost of large model training; it helps to achieve semantic alignment faster during training and improve the reasoning ability of large models.

如图3所示,本实施例提供一种知识库文本单位查询方法,包括如下步骤。As shown in Figure 3, this embodiment provides a knowledge base text unit query method, which includes the following steps.

步骤S301,获取待查询文本单位。Step S301: Obtain the text unit to be queried.

通过文本框输入或选择方式获取待查询文本单位。Get the text unit to be queried through text box input or selection.

步骤S302,从知识库的语义距离库中查找所有具有待查询文本单位的文本单位对作为待查询文本单位的查询文本单位对,基于所对应语义距离大小将全部或部分查询文本单位对中非待查询文本单位集合为查询结果列表。Step S302: Search all text unit pairs with the text unit to be queried from the semantic distance database of the knowledge base as the query text unit pairs of the text unit to be queried, and based on the corresponding semantic distance size, all or part of the query text unit pairs are classified into non-query text units. The query text unit set is the query result list.

基于待查询文本单位的具体情况进行查询。待查询文本单位查询类型包括同语言查询、多语言查询以及多义词查询。具体查询方式如下,首先设定待查询文本单位语言类型为第一类语言,知识库的语言类型为第二类语言;且设定待查询文本单位所属知识库的语义距离库是通过上述文本单位语义距离预计算方法所获取的。Query based on the specific conditions of the text unit to be queried. Query types for text units to be queried include same-language query, multi-language query and polysemy query. The specific query method is as follows. First, set the language type of the text unit to be queried as the first language, and the language type of the knowledge base as the second language; and set the semantic distance database of the knowledge base to which the text unit to be queried is based on the above text unit. Obtained by semantic distance precomputation method.

当第一类语言与第二类语言相同,即表示该次查询为同语言查询。此时还需判断待查询文本单位是否为多义词,具体判断待查询文本单位所对应标识是否唯一,若是则表示待查询文本单位为单义词。此时可直接将待查询文本单位作为查询条件,从知识库的语义距离库中查找所有具有待查询文本单位的文本单位对,并将所查找出的文本单位对作为当前待查询文本单位的查询文本单位对,而后基于所对应语义距离从大到小的顺序,对所有查询文本单位对中的非待查询文本单位依次排序,最后基于查询条件(例如显示结果数量等)将全部或部分排序的文本单位集合为查询结果列表。其中查询文本单位对中非待查询文本单位为查询文本单位对中除待查询文本单位外的另一个文本单位。When the first language and the second language are the same, it means that the query is a query in the same language. At this time, it is also necessary to determine whether the text unit to be queried is a polysemy word, and specifically whether the identifier corresponding to the text unit to be queried is unique. If so, it means that the text unit to be queried is a single word. At this time, you can directly use the text unit to be queried as the query condition, search for all text unit pairs with the text unit to be queried from the semantic distance database of the knowledge base, and use the found text unit pairs as the query for the current text unit to be queried. text unit pairs, and then sort the non-query text units in all query text unit pairs in descending order based on the corresponding semantic distance, and finally sort all or part of the text units based on the query conditions (such as the number of displayed results, etc.) The text unit collection is the query result list. The non-query text unit in the query text unit pair is another text unit in the query text unit pair other than the text unit to be queried.

在上述查询过程中,若待查询文本单位不为知识库中文本单位,此时可基于知识库中待查询文本单位的等价体,来完成待查询文本单位的查询。例如查询距离文本单位“Beijing university”距离最近的文本单位,而知识库中不存在该文本单位,根据等价体映射找到知识库中对应的文本单位“Peking university”,最终返回距离“Pekinguniversity”最近的查询结果列表。During the above query process, if the text unit to be queried is not a text unit in the knowledge base, the query of the text unit to be queried can be completed based on the equivalent of the text unit to be queried in the knowledge base. For example, query the text unit closest to the text unit "Beijing university", but the text unit does not exist in the knowledge base. Find the corresponding text unit "Peking university" in the knowledge base based on the equivalent mapping, and finally return the text unit closest to "Peking university". list of query results.

需要说明的是,等价体是表示相同意思的一组文本单位,包括在不同语言中意思相同的一组文本单位,例如“北大”和“北京大学”等价,中文的“北京大学”和英文的“PekingUniversity”等价等。同语言知识库等价体的获取过程和不同语言知识库之间等价体的获取过程均为常规方式,在此不对其进行赘述。知识库中每种含义的文本单位均有唯一的标识。It should be noted that equivalence is a group of text units that express the same meaning, including a group of text units that have the same meaning in different languages. For example, "Beijing University" and "Peking University" are equivalent, and "Peking University" in Chinese and The English equivalent of "PekingUniversity" etc. The process of obtaining equivalences in the same language knowledge base and the process of obtaining equivalences between knowledge bases in different languages are both conventional methods and will not be described in detail here. Each text unit of meaning in the knowledge base has a unique identifier.

而若判断待查询文本单位所对应标识不唯一,则表示待查询文本单位为多义词。此时需获取用户的补充查询条件以作为待查询文本单位的进阶限定条件。而后从待查询文本单位所对应的所有标识中筛选出符合进阶限定条件的标识,并将筛选出的待查询单位标识。再基于待查询单位标识从知识库的语义距离库中查找对应的查询结果列表,具体过程为:从知识库的语义距离库中查找出所有具有待查询单位标识的文本单位对,并将所查询出的文本单位对作为当前待查询单位标识的查询文本单位对,而后基于所对应语义距离从大到小的顺序,对所有查询文本单位对中的非待查询单位标识所对应文本单位依次排序,最后基于查询条件将全部或部分排序的文本单位集合为查询结果列表。其中查询文本单位对中非待查询单位标识所对应文本单位为查询文本单位对中除待查询单位标识所对应文本单位外的另一个文本单位。具体所排序文本单位中的全部文本单位还是选取部分文本单位以集合为查询结果列表,是基于知识库本身所设定的显示条件进行实施的。其中进阶限定条件可以以文字方式输入获取,也可以通过选项的形式体现,以供用户选取。If it is determined that the identifier corresponding to the text unit to be queried is not unique, it means that the text unit to be queried is a polysemy. At this time, it is necessary to obtain the user's supplementary query conditions as advanced qualification conditions for the text unit to be queried. Then, the identifiers that meet the advanced qualification conditions are filtered out from all the identifiers corresponding to the text units to be queried, and the selected unit identifiers to be queried are filtered out. Then, based on the identity of the unit to be queried, the corresponding query result list is searched from the semantic distance database of the knowledge base. The specific process is: find all text unit pairs with the identity of the unit to be queried from the semantic distance database of the knowledge base, and add the queried The resulting text unit pairs are used as the query text unit pairs of the current unit identifier to be queried, and then based on the corresponding semantic distance from large to small, the text units corresponding to the non-to-be-queried unit identifiers in all query text unit pairs are sorted in order. Finally, all or part of the sorted text units are collected into a query result list based on the query conditions. The text unit corresponding to the non-query unit identifier in the query text unit pair is another text unit in the query text unit pair other than the text unit corresponding to the unit identifier to be queried. Whether all the text units in the sorted text units or selected part of the text units are combined into the query result list is implemented based on the display conditions set by the knowledge base itself. The advanced qualification conditions can be input and obtained in the form of text, or can be reflected in the form of options for the user to select.

具体针对具有多个意思的文本单位,可以添加唯一特征来进行区分,本实施例中所设置的进阶限定条件即为区分多义词的唯一特征。例如唯一特征可以指“国家”、“领域”、“行业”等,区分方式可以用括号或者其他方式标注。比如文本单位“NorthwesternUniversity”,可能指“Northwestern University(United States)”、“NorthwestUniversity (China)”等文本单位,这里添加了“国家”特征,用括号标注。查询位于“China”的文本单位“Northwestern University”,那么距离文本单位“China”最近的文本单位“Northwestern University”就是需要查询的文本单位“Northwestern University”。Specifically for text units with multiple meanings, unique features can be added to distinguish them. The advanced qualification conditions set in this embodiment are the unique features for distinguishing polysemy words. For example, the unique feature can refer to "country", "field", "industry", etc., and the method of differentiation can be marked with brackets or other methods. For example, the text unit "Northwestern University" may refer to text units such as "Northwestern University (United States)" and "Northwest University (China)". The "country" feature is added here and marked with brackets. Query the text unit "Northwestern University" located in "China", then the text unit "Northwestern University" closest to the text unit "China" is the text unit "Northwestern University" that needs to be queried.

当第一类语言与第二类语言不相同,则表示该次查询为不同语言间查询。此时需通过等价体替换或软件翻译方式等方式,将待查询文本单位转换为第二类语言的等同文本单位。而后从知识库的语义距离库中查找所有具有等同文本单位的文本单位对,并将所查询出的文本单位对作为当前同文本单位的查询文本单位对,而后基于所对应语义距离从大到小的顺序,对所有查询文本单位对中的非同文本单位依次排序,最后基于查询条件将全部或部分排序的文本单位均作为同文本单位的等同查询单位。此时所获取的所有等同查询单位形成的集合并不为该次查询的结果,还需通过等价体替换或软件翻译方式等方式,将所有等同查询单位转化为第一类语言的查询结果,再将所有查询结果进行集合即可获取待查询文本单位该次查询的查询结果列表。且所排序文本单位中的全部文本单位还是选取部分文本单位以集合为查询结果列表,是基于知识库本身所设定的显示条件进行实施的。When the first language and the second language are different, it means that the query is a query between different languages. At this time, it is necessary to convert the text unit to be queried into an equivalent text unit in the second language through equivalent substitution or software translation. Then search all text unit pairs with equivalent text units from the semantic distance database of the knowledge base, and use the queried text unit pairs as the current query text unit pairs of the same text unit, and then based on the corresponding semantic distance from large to small order, sort the non-identical text units in all query text unit pairs in sequence, and finally use all or part of the sorted text units as equivalent query units of the same text unit based on the query conditions. The set of all equivalent query units obtained at this time is not the result of this query. All equivalent query units need to be converted into query results in the first language through equivalent substitution or software translation. Then all the query results are collected to obtain the query result list of the query for the text unit to be queried. Moreover, whether all the text units in the sorted text units or some text units are selected and set as a query result list is implemented based on the display conditions set by the knowledge base itself.

例如若知识库是英文的,但是想查询中文文本单位之间的语义距离。比如查询离token“我思,故我在”最近的token有哪些可以通过等价体映射,找到对应的英文等价体“Cogito, ergo sum”,查询该等价体token在英文知识库中语义距离最近的英文列表。再把英文列表中的每个token根据等价体转换成中文,最终返回中文列表。For example, if the knowledge base is in English, but you want to query the semantic distance between Chinese text units. For example, to query which tokens are closest to the token "I think, therefore I am", you can find the corresponding English equivalent "Cogito, ergo sum" through equivalence mapping, and query the semantic distance of this equivalence token in the English knowledge base. Recent English listings. Then convert each token in the English list into Chinese according to the equivalent body, and finally return the Chinese list.

不同语言间文本单位的查询,还可适用于如下情况。Querying text units between different languages can also be applied to the following situations.

如果某种语言的语料质量不好,即使增加语料数量,也无法保证大模型训练结果的质量。此时,可以利用已经通过证实的,效果良好的其他语言的训练结果,通过等价体映射的方式,得到所需语言的训练结果。例如,中文知识库A的语料质量不佳,大模型训练结果也不理想。而英文知识库B的训练结果很好,可以将B中的token根据等价体映射成中文,从而得到中文token之间的语义距离关系。这样就可以极大地提升中文语言模型的能力。If the corpus quality of a certain language is not good, even if the quantity of corpus is increased, the quality of large model training results cannot be guaranteed. At this time, you can use the proven and effective training results of other languages to obtain the training results of the required language through equivalent body mapping. For example, the corpus quality of Chinese knowledge base A is not good, and the training results of large models are not ideal. The training results of English knowledge base B are very good. The tokens in B can be mapped into Chinese according to the equivalent body, thereby obtaining the semantic distance relationship between Chinese tokens. This can greatly improve the capabilities of the Chinese language model.

本发明实施例提供的知识库文本单位查询方法,可实现对文本单位的查询。进一步可快速实现对不同语种文本数据的查询,同时还可快速实现对具有多个语义的文本单位的查询。The knowledge base text unit query method provided by the embodiment of the present invention can realize the query of text units. Furthermore, the query of text data in different languages can be quickly realized, and the query of text units with multiple semantics can also be quickly realized.

如图4所示,本实施例提供一种知识库文本单位查询装置,包括查询单位获取模块和查询结果获取模块。As shown in Figure 4, this embodiment provides a knowledge base text unit query device, which includes a query unit acquisition module and a query result acquisition module.

查询单位获取模块用于获取待查询文本单位。The query unit acquisition module is used to obtain the text unit to be queried.

查询结果获取模块用于从知识库的语义距离库中查找所有具有待查询文本单位的文本单位对作为待查询文本单位的查询文本单位对,基于所对应语义距离大小将全部或部分查询文本单位对中非待查询文本单位集合为查询结果列表。The query result acquisition module is used to find all text unit pairs with the text unit to be queried as the query text unit pairs of the text unit to be queried from the semantic distance database of the knowledge base, and all or part of the query text unit pairs based on the corresponding semantic distance size. The set of text units to be queried in China and Africa is the query result list.

其中,查询文本单位对中非待查询文本单位为查询文本单位对中除待查询文本单位外的另一个文本单位,知识库的语义距离库通过上述文本单位语义距离预计算方法获取的。Among them, the non-query text unit in the query text unit pair is another text unit in the query text unit pair other than the text unit to be queried, and the semantic distance database of the knowledge base is obtained through the above text unit semantic distance pre-calculation method.

本发明实施例提供的知识库文本单位查询装置,可实现对文本单位的查询。进一步可快速实现对不同语种文本数据的查询,同时还可快速实现对具有多个语义的文本单位的查询。The knowledge base text unit query device provided by the embodiment of the present invention can realize the query of text units. Furthermore, the query of text data in different languages can be quickly realized, and the query of text units with multiple semantics can also be quickly realized.

虽然本发明所公开的实施方式如上,但所述的内容只是为了便于理解本发明而采用的实施方式,并非用以限定本发明。任何本发明所属技术领域内的技术人员,在不脱离本发明所公开的精神和范围的前提下,可以在实施的形式上及细节上作任何的修改与变化,但本发明的保护范围,仍须以所附的权利要求书所界定的范围为准。Although the disclosed embodiments of the present invention are as above, the described contents are only used to facilitate understanding of the present invention and are not intended to limit the present invention. Any person skilled in the technical field to which the present invention belongs can make any modifications and changes in the form and details of the implementation without departing from the spirit and scope of the disclosure of the present invention. However, the protection scope of the present invention remains The scope defined by the appended claims shall prevail.

Claims (6)

1.一种文本单位语义距离预计算方法,包括:1. A text unit semantic distance precalculation method, including: 获取预计算知识库中的所有文本单位,并基于关联单位获取方式获取每个文本单位的关联文本单位集合;Obtain all text units in the precomputed knowledge base, and obtain the associated text unit set of each text unit based on the associated unit acquisition method; 通过预设对象知识表示获取方式,基于所述关联文本单位集合获取所述预计算知识库中所有对象属性文本单位的知识表示,通过预设范畴知识表示获取方式获取所述预计算知识库中所有范畴属性文本单位的知识表示;Through the preset object knowledge representation acquisition method, the knowledge representation of all object attribute text units in the precomputed knowledge base is obtained based on the associated text unit set, and all the knowledge representations in the precomputed knowledge base are obtained through the preset category knowledge representation acquisition method. Knowledge representation of category attribute text units; 获取所有所述文本单位所能构成的所有文本单位对,基于所述文本单位的知识表示,通过文本单位关系确定方式对所有所述文本单位对的语义距离进行计算,将所有计算出语义距离的文本单位对以及所对应的语义距离集合为所述预计算知识库的语义距离库;Obtain all text unit pairs that can be formed by all the text units. Based on the knowledge representation of the text unit, calculate the semantic distance of all the text unit pairs by determining the text unit relationship, and combine all the calculated semantic distances. The text unit pairs and the corresponding semantic distance set are the semantic distance database of the precomputed knowledge base; 其中,所述对象属性文本单位为所述预计算知识库中的对象,所述范畴属性文本单位为所述预计算知识库中的范畴;Wherein, the object attribute text unit is an object in the precomputed knowledge base, and the category attribute text unit is a category in the precomputed knowledge base; 基于关联单位获取方式获取文本单位的关联文本单位集合包括:The set of associated text units to obtain text units based on the associated unit acquisition method includes: 从所述预计算知识库中获取常规文本单位的描述页;Obtain description pages of regular text units from the precomputed knowledge base; 将所述描述页中的文本单位作为所述常规文本单位的内部文本单位,将所述常规文本单位的所有类型所述内部文本单位集合为所述常规文本单位的关联文本单位集合;Treat the text unit in the description page as the internal text unit of the regular text unit, and set all types of internal text units of the regular text unit as a set of associated text units of the regular text unit; 其中,所述常规文本单位为所述预计算知识库中的任意一个文本单位;Wherein, the regular text unit is any text unit in the precomputed knowledge base; 通过预设对象知识表示获取方式,基于所述关联文本单位集合获取所述预计算知识库中对象属性文本单位的知识表示包括:Using the preset object knowledge representation acquisition method, obtaining the knowledge representation of the object attribute text unit in the precomputed knowledge base based on the associated text unit set includes: 以对象属性文本单位为筛选单位对其所面对所述关联文本单位集合进行筛选,并将符合筛选条件的关联文本单位集合所对应的文本单位,集合为所述对象属性文本单位的知识表示;Use the object attribute text unit as the filtering unit to filter the associated text unit set it faces, and assemble the text units corresponding to the associated text unit set that meet the filtering conditions into the knowledge representation of the object attribute text unit; 其中,对象属性文本单位为所述预计算知识库中任意一个对象;对象属性文本单位对其所面对关联文本单位集合包括所述预计算知识库中,除所述对象属性文本单位所对应关联文本单位集合外的其他所有关联文本单位集合;所述筛选条件为所述关联文本单位集合中包含所述筛选单位;Wherein, the object attribute text unit is any object in the precomputed knowledge base; the set of associated text units faced by the object attribute text unit is included in the precomputed knowledge base, except for the association corresponding to the object attribute text unit. All other associated text unit sets except the text unit set; the filtering condition is that the associated text unit set contains the filtering unit; 通过预设范畴知识表示获取方式,获取所述预计算知识库中单个范畴属性文本单位的知识表示包括:Through the preset category knowledge representation acquisition method, obtaining the knowledge representation of a single category attribute text unit in the precomputed knowledge base includes: 获取属于范畴属性文本单位的对象属性文本单位作为对象文本单位,将范畴属性文本单位的所有所述对象文本单位的知识表示集合为范畴属性文本单位的知识表示;Obtain the object attribute text unit belonging to the category attribute text unit as the object text unit, and collect the knowledge representation of all the object text units of the category attribute text unit into the knowledge representation of the category attribute text unit; 其中,范畴属性文本单位为所述预计算知识库中任意一个范畴。The category attribute text unit is any category in the precomputed knowledge base. 2.根据权利要求1所述的预计算方法,其特征在于,通过文本单位关系确定方式对所述文本单位对的语义距离进行计算包括:2. The pre-calculation method according to claim 1, characterized in that calculating the semantic distance of the text unit pair by determining the text unit relationship includes: 设定文本单位对中一个所述文本单位为第一文本单位,另一个所述文本单位为第二文本单位;Setting one of the text units in the text unit pair as the first text unit and the other text unit as the second text unit; 判断所述第一文本单位的知识表示和所述第二文本单位的知识表示是否存在交集,若是则表示所述第一文本单位和第二文本单位有关系,并基于所述第一文本单位的知识表示和所述第二文本单位的知识表示计算所述第一文本单位和第二文本单位之间的语义距离,否则表示所述第一文本单位和第二文本单位之间没有关系。Determine whether there is an intersection between the knowledge representation of the first text unit and the knowledge representation of the second text unit. If so, it means that the first text unit and the second text unit are related, and based on the first text unit The knowledge representation and the knowledge representation of the second text unit calculate the semantic distance between the first text unit and the second text unit, otherwise it means that there is no relationship between the first text unit and the second text unit. 3.根据权利要求2所述的预计算方法,其特征在于,基于所述第一文本单位的知识表示和所述第二文本单位的知识表示,通过Ochiia系数计算方式或杰卡德指数计算方式计算所述第一文本单位和第二文本单位之间的语义距离。3. The pre-calculation method according to claim 2, characterized in that, based on the knowledge representation of the first text unit and the knowledge representation of the second text unit, through the Ochiia coefficient calculation method or the Jaccard index calculation method. The semantic distance between the first text unit and the second text unit is calculated. 4.一种文本单位语义距离预计算装置,其特征在于,包括关联文本单位获取模块、知识表示获取模块和语义距离库获取模块:4. A text unit semantic distance pre-calculation device, which is characterized in that it includes an associated text unit acquisition module, a knowledge representation acquisition module and a semantic distance library acquisition module: 所述关联文本单位获取模块,用于获取预计算知识库中的所有文本单位,并基于关联单位获取方式获取每个文本单位的关联文本单位集合;The associated text unit acquisition module is used to acquire all text units in the precomputed knowledge base, and acquire the associated text unit set of each text unit based on the associated unit acquisition method; 所述知识表示获取模块,用于通过预设对象知识表示获取方式,基于所述关联文本单位集合获取所述预计算知识库中所有对象属性文本单位的知识表示,通过预设范畴知识表示获取方式,获取所述预计算知识库中所有范畴属性文本单位的知识表示;The knowledge representation acquisition module is configured to acquire knowledge representations of all object attribute text units in the precomputed knowledge base based on the associated text unit set through a preset object knowledge representation acquisition method, and through a preset category knowledge representation acquisition method. , obtain the knowledge representation of all category attribute text units in the precomputed knowledge base; 所述语义距离库获取模块,用于获取所有所述文本单位所能构成的所有文本单位对,基于所述文本单位的知识表示,通过文本单位关系确定方式对所有所述文本单位对的语义距离进行计算,将所有计算出语义距离的文本单位对以及所对应的语义距离集合为所述预计算知识库的语义距离库;The semantic distance library acquisition module is used to obtain all text unit pairs that can be formed by all the text units. Based on the knowledge representation of the text unit, the semantic distance of all the text unit pairs is determined by a text unit relationship determination method. Perform calculations to collect all pairs of text units for which semantic distances are calculated and the corresponding semantic distances into a semantic distance library of the precomputed knowledge base; 其中,所述对象属性文本单位为所述预计算知识库中的对象,所述范畴属性文本单位为所述预计算知识库中的范畴;Wherein, the object attribute text unit is an object in the precomputed knowledge base, and the category attribute text unit is a category in the precomputed knowledge base; 基于关联单位获取方式获取文本单位的关联文本单位集合包括:The set of associated text units to obtain text units based on the associated unit acquisition method includes: 从所述预计算知识库中获取常规文本单位的描述页;Obtain description pages of regular text units from the precomputed knowledge base; 将所述描述页中的文本单位作为所述常规文本单位的内部文本单位,将所述常规文本单位的所有类型所述内部文本单位集合为所述常规文本单位的关联文本单位集合;Treat the text unit in the description page as the internal text unit of the regular text unit, and set all types of internal text units of the regular text unit as a set of associated text units of the regular text unit; 其中,所述常规文本单位为所述预计算知识库中的任意一个文本单位;Wherein, the regular text unit is any text unit in the precomputed knowledge base; 通过预设对象知识表示获取方式,基于所述关联文本单位集合获取所述预计算知识库中对象属性文本单位的知识表示包括:Using the preset object knowledge representation acquisition method, obtaining the knowledge representation of the object attribute text unit in the precomputed knowledge base based on the associated text unit set includes: 以对象属性文本单位为筛选单位对其所面对所述关联文本单位集合进行筛选,并将符合筛选条件的关联文本单位集合所对应的文本单位,集合为所述对象属性文本单位的知识表示;Use the object attribute text unit as the filtering unit to filter the associated text unit set it faces, and assemble the text units corresponding to the associated text unit set that meet the filtering conditions into the knowledge representation of the object attribute text unit; 其中,对象属性文本单位为所述预计算知识库中任意一个对象;对象属性文本单位对其所面对关联文本单位集合包括所述预计算知识库中,除所述对象属性文本单位所对应关联文本单位集合外的其他所有关联文本单位集合;所述筛选条件为所述关联文本单位集合中包含所述筛选单位;Wherein, the object attribute text unit is any object in the precomputed knowledge base; the set of associated text units faced by the object attribute text unit is included in the precomputed knowledge base, except for the association corresponding to the object attribute text unit. All other associated text unit sets except the text unit set; the filtering condition is that the associated text unit set contains the filtering unit; 通过预设范畴知识表示获取方式,获取所述预计算知识库中单个范畴属性文本单位的知识表示包括:Through the preset category knowledge representation acquisition method, obtaining the knowledge representation of a single category attribute text unit in the precomputed knowledge base includes: 获取属于范畴属性文本单位的对象属性文本单位作为对象文本单位,将范畴属性文本单位的所有所述对象文本单位的知识表示集合为范畴属性文本单位的知识表示;Obtain the object attribute text unit belonging to the category attribute text unit as the object text unit, and collect the knowledge representation of all the object text units of the category attribute text unit into the knowledge representation of the category attribute text unit; 其中,范畴属性文本单位为所述预计算知识库中任意一个范畴。The category attribute text unit is any category in the precomputed knowledge base. 5.一种知识库文本单位查询方法,包括:5. A knowledge base text unit query method, including: 获取待查询文本单位;Get the text unit to be queried; 设定所述待查询文本单位语言类型为第一类语言,知识库的语言类型为第二类语言;The language type of the text unit to be queried is set to the first type of language, and the language type of the knowledge base is set to the second type of language; 若所述第一类语言与所述第二类语言相同,则判断所述待查询文本单位对应标识是否唯一,若是则从知识库的语义距离库中查找所有具有所述待查询文本单位的文本单位对作为所述待查询文本单位的查询文本单位对,基于所对应语义距离大小将全部或部分所述查询文本单位对中非待查询文本单位集合为查询结果列表;否则获取所述待查询文本单位的进阶限定条件,并基于所述进阶限定条件从所述待查询文本单位对应所有标识中确定待查询单位标识,从知识库的语义距离库中查找所有具有所述待查询单位标识的文本单位对作为所述待查询单位标识的查询文本单位对,基于所对应语义距离大小将全部或部分所述查询文本单位对中非待查询单位标识所对应文本单位集合成查询结果列表;If the first type of language is the same as the second type of language, then determine whether the corresponding identifier of the text unit to be queried is unique, and if so, search all texts with the text unit to be queried from the semantic distance database of the knowledge base. The unit pair is a query text unit pair of the text unit to be queried, and based on the corresponding semantic distance, all or part of the query text unit pairs that are not text units to be queried are collected into a query result list; otherwise, the text to be queried is obtained Advanced qualification conditions of the unit, and based on the advanced qualification conditions, determine the identification of the unit to be queried from all identifications corresponding to the text unit to be queried, and search all the identification of the unit to be queried from the semantic distance database of the knowledge base The text unit pairs serve as the query text unit pairs of the unit identifiers to be queried, and all or part of the query text unit pairs that are not corresponding to the unit identifiers to be queried are assembled into a query result list based on the corresponding semantic distance; 若所述第一类语言与所述第二类语言不相同,将所述待查询文本单位转换为第二类语言的等同文本单位,并从知识库的语义距离库中查找所有具有所述等同文本单位的文本单位对作为所述等同文本单位的查询文本单位对,基于所对应语义距离大小将全部或部分所述查询文本单位对中非等同文本单位作为等同查询单位,将所有所述等同查询单位转化为第一类语言的查询结果,将所有所述查询结果集合为查询结果列表;If the first type of language is different from the second type of language, convert the text unit to be queried into an equivalent text unit of the second type of language, and search for all the equivalent text units from the semantic distance database of the knowledge base. Text unit pairs of text units are used as query text unit pairs of the equivalent text units. Based on the corresponding semantic distance, non-equivalent text units in all or part of the query text unit pairs are used as equivalent query units. All the equivalent query units are The unit is converted into query results in the first language, and all the query results are assembled into a query result list; 其中,所述知识库中每种含义的文本单位均有唯一的标识;Wherein, each text unit of meaning in the knowledge base has a unique identifier; 其中,所述查询文本单位对中非待查询文本单位为所述查询文本单位对中除所述待查询文本单位外的另一个所述文本单位,所述知识库的语义距离库通过权利要求1-3中任意一项所述文本单位语义距离预计算方法获取。Wherein, the non-query text unit in the query text unit pair is the other text unit in the query text unit pair except the text unit to be queried, and the semantic distance database of the knowledge base passes the claim 1 - Obtained by the text unit semantic distance pre-calculation method described in any one of -3. 6.一种知识库文本单位查询装置,其特征在于,包括查询单位获取模块和查询结果获取模块;6. A knowledge base text unit query device, which is characterized in that it includes a query unit acquisition module and a query result acquisition module; 所述查询单位获取模块,用于获取待查询文本单位;The query unit acquisition module is used to obtain the text unit to be queried; 所述查询结果获取模块,用于设定所述待查询文本单位语言类型为第一类语言,知识库的语言类型为第二类语言;若所述第一类语言与所述第二类语言相同,则判断所述待查询文本单位对应标识是否唯一,若是则从知识库的语义距离库中查找所有具有所述待查询文本单位的文本单位对作为所述待查询文本单位的查询文本单位对,基于所对应语义距离大小将全部或部分所述查询文本单位对中非待查询文本单位集合为查询结果列表;否则获取所述待查询文本单位的进阶限定条件,并基于所述进阶限定条件从所述待查询文本单位对应所有标识中确定待查询单位标识,从知识库的语义距离库中查找所有具有所述待查询单位标识的文本单位对作为所述待查询单位标识的查询文本单位对,基于所对应语义距离大小将全部或部分所述查询文本单位对中非待查询单位标识所对应文本单位集合成查询结果列表;The query result acquisition module is used to set the language type of the text unit to be queried as the first language, and the language type of the knowledge base as the second language; if the first language and the second language are the same, then determine whether the corresponding identification of the text unit to be queried is unique, and if so, search all text unit pairs with the text unit to be queried from the semantic distance database of the knowledge base as the query text unit pairs of the text unit to be queried. , based on the corresponding semantic distance, all or part of the query text units and non-to-be-queried text units are collected into a query result list; otherwise, the advanced qualification conditions of the to-be-queried text units are obtained, and based on the advanced qualifications The condition determines the identity of the unit to be queried from all the identities corresponding to the text unit to be queried, and searches for all text unit pairs with the identity of the unit to be queried from the semantic distance database of the knowledge base as the query text unit for the identity of the unit to be queried. Yes, based on the corresponding semantic distance, all or part of the query text units corresponding to the non-query unit identifiers are assembled into a query result list; 若所述第一类语言与所述第二类语言不相同,将所述待查询文本单位转换为第二类语言的等同文本单位,并从知识库的语义距离库中查找所有具有所述等同文本单位的文本单位对作为所述等同文本单位的查询文本单位对,基于所对应语义距离大小将全部或部分所述查询文本单位对中非等同文本单位作为等同查询单位,将所有所述等同查询单位转化为第一类语言的查询结果,将所有所述查询结果集合为查询结果列表;If the first type of language is different from the second type of language, convert the text unit to be queried into an equivalent text unit of the second type of language, and search for all the equivalent text units from the semantic distance database of the knowledge base. Text unit pairs of text units are used as query text unit pairs of the equivalent text units. Based on the corresponding semantic distance, non-equivalent text units in all or part of the query text unit pairs are used as equivalent query units. All the equivalent query units are The unit is converted into query results in the first language, and all the query results are assembled into a query result list; 其中,所述知识库中每种含义的文本单位均有唯一的标识;Wherein, each text unit of meaning in the knowledge base has a unique identifier; 所述查询文本单位对中非待查询文本单位为所述查询文本单位对中除所述待查询文本单位外的另一个所述文本单位,所述知识库的语义距离库通过权利要求1-3中任意一项所述文本单位语义距离预计算方法获取。The non-query text unit in the query text unit pair is the other text unit in the query text unit pair except the text unit to be queried, and the semantic distance database of the knowledge base passes claims 1-3 The text unit semantic distance precalculation method described in any one of the above is obtained.
CN202311569661.1A 2023-11-23 2023-11-23 Text unit semantic distance precalculation method and device, query method and device Active CN117272073B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311569661.1A CN117272073B (en) 2023-11-23 2023-11-23 Text unit semantic distance precalculation method and device, query method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311569661.1A CN117272073B (en) 2023-11-23 2023-11-23 Text unit semantic distance precalculation method and device, query method and device

Publications (2)

Publication Number Publication Date
CN117272073A CN117272073A (en) 2023-12-22
CN117272073B true CN117272073B (en) 2024-03-08

Family

ID=89220074

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311569661.1A Active CN117272073B (en) 2023-11-23 2023-11-23 Text unit semantic distance precalculation method and device, query method and device

Country Status (1)

Country Link
CN (1) CN117272073B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2025201409A1 (en) * 2024-03-26 2025-10-02 杭州朗目达信息科技有限公司 Semantic distance-based textual inference method and apparatus, storage medium, and terminal
CN117973544B (en) * 2024-03-26 2024-06-25 杭州朗目达信息科技有限公司 Text unit reasoning method device based on semantic distance, storage medium and terminal

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101599011A (en) * 2008-06-05 2009-12-09 北京书生国际信息技术有限公司 Document processing system and method
CN109643308A (en) * 2016-08-23 2019-04-16 伊路米纳有限公司 Determine the semantic distance system and method for relevant ontology data
CN112131883A (en) * 2020-09-30 2020-12-25 腾讯科技(深圳)有限公司 Language model training method and device, computer equipment and storage medium
CN112687397A (en) * 2020-12-31 2021-04-20 四川大学华西医院 Rare disease knowledge base processing method and device and readable storage medium
CN113761208A (en) * 2021-09-17 2021-12-07 福州数据技术研究院有限公司 Scientific and technological innovation information classification method and storage device based on knowledge graph
CN115248839A (en) * 2022-07-28 2022-10-28 中科极限元(杭州)智能科技股份有限公司 A knowledge system-based long text retrieval method and device
CN116578724A (en) * 2023-07-14 2023-08-11 杭州朗目达信息科技有限公司 Knowledge base knowledge structure construction method and device, storage medium and terminal
CN116701431A (en) * 2023-05-25 2023-09-05 东云睿连(武汉)计算技术有限公司 Data retrieval method and system based on large language model
CN116932694A (en) * 2023-07-19 2023-10-24 神思电子技术股份有限公司 Intelligent retrieval method, device and storage medium for knowledge base
CN117033744A (en) * 2023-08-10 2023-11-10 中国工商银行股份有限公司 Data query method, device, storage medium and electronic equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070106499A1 (en) * 2005-08-09 2007-05-10 Kathleen Dahlgren Natural language search system
US20070288452A1 (en) * 2006-06-12 2007-12-13 D&S Consultants, Inc. System and Method for Rapidly Searching a Database
WO2008046104A2 (en) * 2006-10-13 2008-04-17 Collexis Holding, Inc. Methods and systems for knowledge discovery
CN103049532A (en) * 2012-12-21 2013-04-17 东莞中国科学院云计算产业技术创新与育成中心 Knowledge base engine construction and query method based on emergency management of emergencies

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101599011A (en) * 2008-06-05 2009-12-09 北京书生国际信息技术有限公司 Document processing system and method
CN109643308A (en) * 2016-08-23 2019-04-16 伊路米纳有限公司 Determine the semantic distance system and method for relevant ontology data
CN112131883A (en) * 2020-09-30 2020-12-25 腾讯科技(深圳)有限公司 Language model training method and device, computer equipment and storage medium
CN112687397A (en) * 2020-12-31 2021-04-20 四川大学华西医院 Rare disease knowledge base processing method and device and readable storage medium
CN113761208A (en) * 2021-09-17 2021-12-07 福州数据技术研究院有限公司 Scientific and technological innovation information classification method and storage device based on knowledge graph
CN115248839A (en) * 2022-07-28 2022-10-28 中科极限元(杭州)智能科技股份有限公司 A knowledge system-based long text retrieval method and device
CN116701431A (en) * 2023-05-25 2023-09-05 东云睿连(武汉)计算技术有限公司 Data retrieval method and system based on large language model
CN116578724A (en) * 2023-07-14 2023-08-11 杭州朗目达信息科技有限公司 Knowledge base knowledge structure construction method and device, storage medium and terminal
CN116932694A (en) * 2023-07-19 2023-10-24 神思电子技术股份有限公司 Intelligent retrieval method, device and storage medium for knowledge base
CN117033744A (en) * 2023-08-10 2023-11-10 中国工商银行股份有限公司 Data query method, device, storage medium and electronic equipment

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Jung, Sangkeun et al..Semantic vector learning for natural language understanding.《Computer Speech and Language》.2019,第56卷第130-145页. *
Silva, F. et al..A knowledge-based retrieval model.《 Proceedings 21st International Conference on Software Engineering & Knowledge Engineering (SEKE 2009)》.2009,第558-63页. *
刘兴林等.基于互联网的词汇语义知识库构建框架研究.《计算机与现代化》.2010,(第10期),第8-11页. *
刘贤达.基于语义的文本关联性分析.《中国科学院机构知识库网格》.2015,全文. *
谢金峰等.基于多语义相似性的关系检测方法.《西北工业大学学报》.2021,第39卷(第6期),第1387-1394页. *

Also Published As

Publication number Publication date
CN117272073A (en) 2023-12-22

Similar Documents

Publication Publication Date Title
US12399893B2 (en) Question-answering system for answering relational questions by utilizing two paths where at least one path uses BERT model
CN111753099B (en) Method and system for enhancing relevance of archive entity based on knowledge graph
US11853352B2 (en) Method and apparatus for establishing image set for image recognition, network device, and storage medium
CN111680173A (en) A CMR Model for Unified Retrieval of Cross-Media Information
CN110399457A (en) An intelligent question answering method and system
CN111159330A (en) Database query statement generation method and device
CN117272073B (en) Text unit semantic distance precalculation method and device, query method and device
CN118917305B (en) A RAG system optimization method, system, electronic device and storage medium
CN108563773A (en) The accurate search ordering method of legal provision of knowledge based collection of illustrative plates
CN114330335A (en) Keyword extraction method, device, equipment and storage medium
CN117708308B (en) RAG natural language intelligent knowledge base management method and system
CN113946686A (en) Electric power marketing knowledge map construction method and system
CN118626611A (en) Retrieval method, device, electronic device and readable storage medium
CN115470396A (en) Information retrieval method, server, medium and product
CN115964468A (en) Rural information intelligent question-answering method and device based on multilevel template matching
US20090234852A1 (en) Sub-linear approximate string match
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN120030172B (en) A method, device, equipment and medium for constructing a scientific and technological literature knowledge graph
CN120196735A (en) A method, device, equipment and storage medium for determining question and answer generated by retrieval enhancement
CN115617689A (en) A Software Defect Location Method Based on CNN Model and Domain Features
CN114780700A (en) Intelligent question-answering method, device, equipment and medium based on machine reading understanding
CN114328895A (en) News abstract generation method and device and computer equipment
CN113127650A (en) Technical map construction method and system based on map database
CN120045560A (en) Table lookup method, apparatus, device, computer readable storage medium and computer program product
CN119537672A (en) A search processing method and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant