[go: up one dir, main page]

CN119903915A - A data reasoning method based on distributed system and related equipment - Google Patents

A data reasoning method based on distributed system and related equipment Download PDF

Info

Publication number
CN119903915A
CN119903915A CN202411959541.7A CN202411959541A CN119903915A CN 119903915 A CN119903915 A CN 119903915A CN 202411959541 A CN202411959541 A CN 202411959541A CN 119903915 A CN119903915 A CN 119903915A
Authority
CN
China
Prior art keywords
data
matching degree
query information
node
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411959541.7A
Other languages
Chinese (zh)
Inventor
刘永超
洪春涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202411959541.7A priority Critical patent/CN119903915A/en
Publication of CN119903915A publication Critical patent/CN119903915A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本说明书提供了一种基于分布式系统的数据推理方法及相关设备。分布式系统包含多个节点,多个节点各自存储了不同的数据分片。该方法包括:获取多个节点从各自检索出的数据集中选取出的目标数据;任一节点检索出的数据集中包含在该节点存储的数据分片中与用户输入的查询信息相匹配的数据;确定多个目标数据与查询信息之间的匹配度,并从与多个目标数据对应的多个匹配度中确定出匹配度阈值;多个节点检索出的多个数据集中包含的与查询信息之间的匹配度高于匹配度阈值的数据数量,大于或等于预设的筛选数量;获取多个节点从各自的数据集中筛选出的与所述查询信息之间的匹配度高于所述匹配度阈值的推理数据,基于推理数据辅助LLM模型执行推理任务。

The present specification provides a data reasoning method and related equipment based on a distributed system. The distributed system includes multiple nodes, each of which stores different data slices. The method includes: obtaining target data selected by multiple nodes from the data sets retrieved by each node; the data set retrieved by any node contains data that matches the query information input by the user in the data slice stored in the node; determining the matching degree between the multiple target data and the query information, and determining the matching degree threshold from the multiple matching degrees corresponding to the multiple target data; the number of data with a matching degree higher than the matching degree threshold contained in the multiple data sets retrieved by the multiple nodes and the query information is greater than or equal to the preset screening number; obtaining the inference data with a matching degree higher than the matching degree threshold filtered by the multiple nodes from their respective data sets, and performing the inference task based on the inference data to assist the LLM model.

Description

Data reasoning method based on distributed system and related equipment
Technical Field
One or more embodiments of the present disclosure relate to the field of data reasoning technologies, and in particular, to a data reasoning method based on a distributed system and related devices.
Background
Retrieval enhancement generation (RETRIEVAL AUGMENTED GENERATION, RAG), a technique combining information retrieval and text generation, aims to improve the generation quality of a large-scale language model (Large Language Models, LLM) by using an external knowledge base, and successfully relieves problems such as "illusion", lack of knowledge in specific fields, outdated information and the like. RAG typically involves two stages of retrieving and generating, first, finding a number of data from a large amount of data that is relevant to the query information entered by the user, and then using the retrieved number of data to assist in reasoning about the pre-trained LLM model to generate accurate query results.
In general, the above data retrieval stage can be implemented on one machine, however, when the data size is very large and a single machine cannot accommodate all data, how to implement efficient data retrieval based on a distributed system is particularly important.
Disclosure of Invention
In view of this, one or more embodiments of the present description provide a data reasoning method and related apparatus based on a distributed system.
In a first aspect, the present disclosure provides a data reasoning method based on a distributed system, where the distributed system includes a plurality of nodes, and the plurality of nodes each store different data slices, the method includes:
The method comprises the steps of obtaining target data selected by a plurality of nodes from each retrieved data set, wherein the data set retrieved by any node comprises a plurality of data matched with query information input by a user in a data fragment stored by the node;
Determining the matching degree between a plurality of target data selected by the plurality of nodes and the query information, and determining a matching degree threshold value from a plurality of matching degrees corresponding to the plurality of target data, wherein the matching degree between the plurality of data sets retrieved by the plurality of nodes and the query information is higher than the matching degree threshold value, and the data quantity is larger than or equal to a preset screening quantity;
Acquiring reasoning data of which the matching degree between the query information and the data sets is higher than the matching degree threshold value, wherein the reasoning data are screened out from the data sets by the nodes, and summarizing the acquired reasoning data screened out by the nodes into a reasoning data set;
based on the reasoning data set, the LLM model is assisted to execute the reasoning task, and a query result corresponding to the query information is generated.
In a second aspect, the present specification provides a data reasoning apparatus based on a distributed system comprising a plurality of nodes, each of the plurality of nodes storing a different data slice, the apparatus comprising:
The system comprises an acquisition unit, a search unit and a search unit, wherein the acquisition unit is used for acquiring target data selected by the nodes from data sets retrieved by the nodes, and the data set retrieved by any node comprises a plurality of data which are matched with query information input by a user in a data fragment stored by the node;
the determining unit is used for determining the matching degree between the plurality of target data selected by the plurality of nodes and the query information, and determining a matching degree threshold value from a plurality of matching degrees corresponding to the plurality of target data, wherein the matching degree between the plurality of data sets retrieved by the plurality of nodes and the query information is higher than the matching degree threshold value, and the data quantity is larger than or equal to a preset screening quantity;
the data summarizing unit is used for acquiring the reasoning data of which the matching degree between the query information and the data sets is higher than the matching degree threshold value, wherein the reasoning data are screened out from the data sets by the nodes, and summarizing the acquired reasoning data screened out by the nodes into a reasoning data set;
and the model reasoning unit is used for assisting the LLM model to execute the reasoning task based on the reasoning data set and generating a query result corresponding to the query information.
Accordingly, the specification also provides a computing device, which comprises a memory and a processor, wherein the memory stores computer programs/instructions executable by the processor, and the processor executes the data reasoning method based on the distributed system according to the first aspect when executing the computer programs/instructions.
Accordingly, the present specification also provides a computer readable storage medium having stored thereon a computer program/instruction which, when executed by a processor, performs a data reasoning method based on a distributed system as described in the first aspect above.
Accordingly, the present specification also provides a computer program product comprising computer programs/instructions which, when executed by a processor, perform the distributed system based data reasoning method as described in the first aspect above.
In summary, the nodes of the distributed system may retrieve data sets matching the query information input by the user from the respective stored data fragments, and may further select target data from the respective retrieved data sets. Then, the matching degree between a plurality of target data selected by a plurality of nodes and the query information is determined, and a matching degree threshold value for further screening partial data from each data set is determined from a plurality of matching degrees corresponding to the plurality of target data. Based on the data, each node can further screen out a small amount of data with the matching degree higher than the matching degree threshold value from a large amount of data contained in the data set retrieved by each node, only the small amount of data with the matching degree higher than the matching degree threshold value screened by each node is needed to be acquired and summarized later, and the LLM model is assisted to execute the reasoning task based on the small amount of data. Therefore, on the premise of ensuring the data retrieval efficiency and the LLM model generation quality, unnecessary data transmission and calculation with lower matching degree are avoided, the data transmission amount and calculation amount in the distributed system are greatly reduced, and further the calculation and communication expenditure is reduced.
Drawings
FIG. 1 is a schematic diagram of a system architecture provided by an exemplary embodiment;
FIG. 2 is a flow chart of a data reasoning method based on a distributed system provided by an exemplary embodiment;
fig. 3 is a schematic diagram of a data inference apparatus based on a distributed system according to an exemplary embodiment;
FIG. 4 is a schematic diagram of a computing device according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with aspects of one or more embodiments of the present description as detailed in the accompanying claims.
It should be noted that in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, a single step described in this specification may be described as being split into multiple steps in other embodiments, while multiple steps described in this specification may be described as being combined into a single step in other embodiments.
The term "plurality" as used herein means two or more.
Retrieval enhancement generation techniques typically include two phases, retrieval and generation. In the retrieval stage, a plurality of data related to query information input by a user are required to be found from a large amount of data, but when the data size is very huge and a single machine cannot accommodate all the data, the data retrieval needs to be realized based on a distributed system.
The distributed system comprises a plurality of nodes, the plurality of nodes can store a large-scale data set in a distributed mode, for example, the large-scale data set can be divided into a plurality of data fragments, and the plurality of nodes can respectively store different data fragments. In an illustrated embodiment, a plurality of nodes may each retrieve data related to query information entered by a user from a respective stored data slice by running a corresponding search algorithm. Further, a plurality of data respectively retrieved by a plurality of nodes can be summarized to one node (for example, a master node) through a distributed merging (Union) operation, and then the node can assist the pre-trained LLM model to execute corresponding reasoning tasks based on the summarized plurality of data, and output a query result corresponding to query information input by a user.
In an embodiment shown, the search algorithm may be an approximate nearest neighbor search (Approximate Nearest Neighbor Search, ANNS) algorithm, and the plurality of nodes may construct a corresponding index structure for each stored data slice by running ANNS algorithm, and then quickly search a preset number of nearest neighbor data, namely top-k nearest neighbor data, which is the preset number, from each stored data slice based on the index structure, wherein k is an integer greater than 1, and the preset number is the nearest neighbor data. It should be appreciated that conventional accurate nearest neighbor searches (Exact Nearest Neighbor Search, ens) may not meet the actual needs of a user when faced with large-scale data sets, as their computational cost may rise dramatically with increasing data volume and dimensions. ANNS, by allowing a degree of approximation, significantly increases the retrieval speed while still maintaining high accuracy in most cases, making it an effective tool for processing large and high-dimensional data.
As described above, the problem of data storage and retrieval efficiency in the face of a large-scale data set can be effectively solved by the distributed technology, but when the amount of data retrieved by each node is large (for example, k is equal to several millions or even several tens of millions), a large amount of data needs to be transmitted and calculated later when the large amount of data retrieved by each node is summarized, so that a large amount of calculation and communication overhead is generated, and practical use requirements cannot be satisfied.
Based on the above, the present disclosure provides a technical solution, where partial data with higher matching degree is further screened from the data set matched with the query information retrieved by each of the plurality of nodes, and the screened partial data is summarized, so as to assist the LLM model in performing the reasoning task, thereby greatly reducing the data transmission amount and the calculation amount in the distributed system, and further reducing the overall calculation and communication overhead.
In practice, the present description may be applied to a distributed system comprising a plurality of nodes each storing a different data slice. In an illustrated embodiment, the specification may obtain target data selected by a plurality of nodes from respective retrieved data sets. The data set retrieved by any node contains a plurality of data matched with query information input by a user in the data fragment stored by the node. Further, the matching degree between the plurality of target data selected by the plurality of nodes and the query information can be determined, and the matching degree threshold value is determined from the plurality of matching degrees corresponding to the plurality of target data. The data quantity, which is contained in the plurality of data sets retrieved by the plurality of nodes and has the matching degree with the query information higher than the matching degree threshold value, is larger than or equal to the preset screening quantity. Further, the inference data of which the matching degree between the query information and the data set is higher than the matching degree threshold value can be obtained, wherein the inference data are screened from the data sets by the nodes, and the obtained inference data are summarized into the inference data sets. Further, the present specification may assist the LLM model in performing an inference task based on the inference dataset, generating a query result corresponding to the query information.
In the above technical solution, the plurality of nodes of the distributed system may retrieve, from the respective stored data fragments, a data set matching the query information input by the user, and may further select the target data from the respective retrieved data sets. Then, the matching degree between a plurality of target data selected by a plurality of nodes and the query information is determined, and a matching degree threshold value for further screening partial data from each data set is determined from a plurality of matching degrees corresponding to the plurality of target data. Based on the data, each node can further screen out a small amount of data with the matching degree higher than the matching degree threshold value from a large amount of data contained in the data set retrieved by each node, only the small amount of data with the matching degree higher than the matching degree threshold value screened by each node is needed to be acquired and summarized later, and the LLM model is assisted to execute the reasoning task based on the small amount of data. Therefore, on the premise of ensuring the data retrieval efficiency and the LLM model generation quality, unnecessary data transmission and calculation with lower matching degree are avoided, the data transmission amount and calculation amount in the distributed system are greatly reduced, and further the calculation and communication expenditure is reduced.
Referring to fig. 1, fig. 1 is a schematic diagram of a system architecture according to an exemplary embodiment. One or more embodiments provided herein may be embodied in the system architecture shown in fig. 1 or a similar system architecture. As shown in fig. 1, the distributed system 10 may include a plurality of nodes, such as a node 100a, a node 100b, a node 100c, and a node 100d, among which communication connections may be established by any possible manner, which is not specifically limited in this specification.
In an illustrated embodiment, nodes 100a, 100b, 100c, and 100d in distributed system 10 may be used to store large-scale data sets in a distributed manner. For example, a large-scale data set may be first divided into a plurality of data slices, and different data slices may be stored in nodes 100a, 100b, 100c, and 100d, respectively. For example, a data set may be evenly divided into a plurality of data slices and the plurality of data slices may be distributed to a plurality of nodes for storage. Or if the hardware conditions of the plurality of nodes are different, the data set may be divided into a plurality of data slices with different scales, and the data slices with larger scales (i.e. the data slices with larger data quantity) may be allocated to the nodes with better hardware conditions for storage, which is not particularly limited in the specification.
The data set may be a data set related to a specific field, or may be a data set related to a plurality of fields, such as a medical field, a financial field, a video entertainment field, and the like, which are not particularly limited in this specification.
In addition, the present specification is not particularly limited to the data types in the data set. In an embodiment, the data included in the data set may be documents (documents), image data, audio/video data, etc., which is not specifically limited in this specification.
In an illustrated embodiment, nodes 100a, 100b, 100c, and 100d in distributed system 10 may obtain query information entered by a user. Illustratively, the master node (e.g., node 100 a) of the plurality of nodes may receive the query information input by the user and transmit the received query information to the slave nodes (e.g., nodes 100b, 100c and 100 d), which is not particularly limited in this specification.
In an illustrated embodiment, the query information may be query text entered by a user via an input device (e.g., keyboard or touch screen, etc.) on the user terminal. By way of example, the query text may be "who is the nobel physics prize of 1943. The query text may also be, for example, "what drugs can be taken by cold fever. By way of example, the query text may also be "what are major traffic accidents occurring in city a in the last two years.
Further, after the node 100a, the node 100b, the node 100c, and the node 100d acquire the query information input by the user, the data sets matching the query information may be respectively retrieved from the data fragments stored in each. The dataset retrieved by any node may contain several data that match the query information in the data shards stored by that node.
Further, after each node respectively retrieves the data set matched with the query information, each node can further screen out partial data with higher matching degree with the query information from a large amount of data contained in the data set, and the partial data can be called as reasoning data and can be subsequently used for assisting the LLM model to execute the reasoning task.
In an illustrated embodiment, each node may first select target data from a respective retrieved data set. Then, determining the matching degree between the plurality of target data selected by the plurality of nodes and the query information, and determining a matching degree threshold value for further screening the reasoning data from the plurality of data sets retrieved by the plurality of nodes from the plurality of matching degrees corresponding to the plurality of target data, which is specifically referred to the following description of the corresponding embodiment of fig. 2 and will not be described herein.
Further, after determining the matching degree threshold, each node may further screen the inference data from the retrieved data set that has a matching degree with the query information higher than the matching degree threshold.
Further, by means of distributed merging operation, it is possible to obtain inference data further screened from the respective data sets by the plurality of nodes, and to aggregate all the obtained inference data into one inference data set. Illustratively, the slave nodes (e.g., the node 100b, the node 100c, and the node 100 d) may send the inference data further screened from the data set retrieved by the slave nodes to the master node (e.g., the node 100 a), and accordingly, the master node may obtain all the inference data sent by the slave nodes, and the master node may further screen the inference data from the data set retrieved by the master node itself, and finally, the master node may aggregate all the obtained inference data screened by the nodes to obtain the inference data set.
Further, the LLM model can be assisted in performing inference tasks based on the summarized inference data set to generate a query result corresponding to the query information. In an illustrated embodiment, a pre-trained LLM model may be carried in the master node, and after the master node gathers the inference data set, the master node may assist the LLM model to perform an inference task based on the inference data set, which will not be described in detail herein with reference to the following description of the corresponding embodiment of fig. 2.
Further, the distributed system 10 may return the query result generated by the LLM model to the user terminal, and accordingly, an output device (e.g., a display screen) on the user terminal may output and display the query result to the user, which is not limited in this specification.
As described above, in the present description, when the search enhancement generation is implemented, after the plurality of nodes in the distributed system 10 search the data sets matching the query information input by the user from the respective stored data fragments, all the data sets are not summarized directly, but each node may further screen out a large amount of data included in the searched data sets, and a small amount of data having a matching degree with the query information higher than the matching degree threshold value may be further selected. And only a small amount of data with higher matching degree screened by each node is needed to be acquired and summarized later, and the LLM model is assisted to execute the reasoning task based on the small amount of data. Therefore, on the premise of ensuring the data retrieval efficiency and the LLM model generation quality, the method avoids unnecessary data transmission and calculation with lower matching degree, greatly reduces the data transmission amount and calculation amount in the distributed system, and further reduces the calculation and communication expenditure.
Note that each node in the distributed system 10 may be a desktop computer, a server cluster including a plurality of servers, or the like, which have the above-described functions, and this description is not limited in detail. Furthermore, it should be noted that the system architecture shown in fig. 1 is merely illustrative, and in some possible embodiments, more or fewer devices than shown in the drawings may be further included in the distributed system 10, which is not specifically limited in this disclosure.
Referring to fig. 2, fig. 2 is a flowchart of a data reasoning method based on a distributed system according to an exemplary embodiment. The method may be applied to the distributed system shown in fig. 1, which includes a plurality of nodes each storing a different data slice. As shown in fig. 2, the method may specifically include the following steps S201 to S204.
And step S201, acquiring target data selected by a plurality of nodes from each retrieved data set, wherein the data set retrieved by any node comprises a plurality of data matched with query information input by a user in the data fragment stored by the node.
In one illustrated embodiment, after obtaining the query information entered by the user, the plurality of nodes may each retrieve a data set matching the query information from the respective stored data fragments. The data set retrieved by any node may include a number of data that matches the query information in the data fragments stored by the node.
The specific implementation of the data search is not particularly limited in this specification.
In an illustrated embodiment, multiple nodes may implement data retrieval by running an approximate nearest neighbor search algorithm, resulting in a nearest neighbor dataset that matches the query information. Correspondingly, the nearest neighbor data set retrieved by any node in the plurality of nodes through the approximate nearest neighbor search algorithm can contain the nearest neighbor data with the preset quantity, namely top-k nearest neighbor data, which has the highest matching degree with the query information in the data fragments stored by the node, wherein k is the preset quantity, and k is an integer larger than 1.
Wherein the approximate nearest neighbor search algorithm generally comprises two stages of index construction and data search. Correspondingly, when the data retrieval is realized through the approximate nearest neighbor search algorithm, each node can firstly construct a corresponding index structure for each stored data fragment, and then quickly search top-k nearest neighbor data which are most matched with query information input by a user from each stored data fragment based on the index structure, so that a nearest neighbor data set is obtained.
In an illustrated embodiment, the index construction method employed by the approximate nearest neighbor search algorithm may include any of the methods illustrated by hash method (Locality SENSITIVE HASHING, LSH), neighbor map method (HIERARCHICAL NAVIGABLE SMALL WORLD GRAPHS, HNSW), similarity search library (Facebook AI SIMILARITY SEARCH, FAISS) method, and the like, as the specification is not limited in detail.
By way of example, a data retrieval process of the approximate nearest neighbor search algorithm will be described below using a hash method as an example.
First, a series of hash function clusters needs to be defined, each cluster may be composed of a plurality of hash functions. The purpose of the design of these hash functions is to make it more likely that similar data in a dataset (e.g., movies of the same type or medical records of the same department, etc.) will be mapped into the same bucket, while dissimilar data will be mapped into different buckets.
Further, each data in the data set is subjected to hash processing through a hash function cluster, and a hash value combination corresponding to each data is obtained through calculation. The calculated hash value may then be allocated to the same bucket in combination with the same number of data.
Further, the hash function cluster is used for carrying out hash processing on the query information input by the user, and a hash value combination corresponding to the query information is obtained through calculation. Then, a bucket having the same hash value combination is queried among a plurality of buckets, and a relevant plurality of data is retrieved from the bucket. Further, the data with the highest matching degree and the preset number can be selected from the data to be used as top-k nearest neighbor data.
In an illustrated embodiment, the matching degree between the query information and each data may be represented by a distance between a vector corresponding to the query information and a vector corresponding to each data, where the closer the distance between the vectors, the higher the matching degree. The distance may be, for example, a euclidean distance, a cosine distance, or the like, which is not particularly limited in this specification.
Or in some possible embodiments, each node may implement data retrieval by any other possible search algorithm besides the above-mentioned approximate nearest neighbor search algorithm, which is not specifically limited in this specification.
Further, after each node respectively retrieves the data set matched with the query information, each node can further screen out partial data with higher matching degree with the query information from the data contained in the data set, and the partial data can be called as reasoning data and can be subsequently used for assisting the LLM model to execute the reasoning task.
In an illustrative embodiment, a threshold of matching for further screening of inferred data in a plurality of data sets retrieved from a plurality of nodes may be determined based on pre-configured screening rules.
In an embodiment, the filtering rule may include a preset filtering number, and accordingly, when determining the matching degree threshold, the number of data, which is included in the plurality of data sets retrieved by the plurality of nodes and has a matching degree with the query information, is required to be higher than the matching degree threshold and is greater than the filtering number.
In the following, an implementation of determining the matching degree threshold will be explained taking an example that the screening rule includes a screening number.
First, each node may first select at least one target data from the respective retrieved data sets. In an embodiment, the target data may be data randomly selected by the node from all data included in the data set, or the target data may be data sequentially selected by the node from all data included in the data set according to a preset step length, or the like, which is not specifically limited in this specification.
Step S202, determining the matching degree between a plurality of target data selected by the plurality of nodes and the query information, and determining a matching degree threshold value from a plurality of matching degrees corresponding to the plurality of target data, wherein the matching degree between the plurality of data sets retrieved by the plurality of nodes and the query information is higher than the matching degree threshold value, and the data quantity is larger than or equal to a preset screening quantity.
Further, the matching degree between the plurality of target data selected by the plurality of nodes and the query information can be determined. Further, the matching degree threshold may be determined from a plurality of matching degrees corresponding to a plurality of target data according to a preset screening number. The data quantity, which is contained in the plurality of data sets retrieved by the plurality of nodes and has the matching degree with the query information higher than the matching degree threshold value, is larger than or equal to the preset screening quantity.
Specifically, when determining a matching degree threshold from a plurality of matching degrees corresponding to a plurality of target data, the method specifically comprises the steps of firstly sorting the plurality of matching degrees corresponding to the plurality of target data from high to low, traversing from the first matching degree based on the sorted sequence, counting the number of data, which are contained in a plurality of data sets and are higher than the target matching degree, of the matching degree, of query information, aiming at the target matching degree which is traversed currently, determining whether the counted number of data is larger than or equal to the preset screening number, if so, determining the target matching degree as the matching degree threshold, ending traversing, if not, continuing traversing to the next matching degree, and the like.
In an embodiment, if the number of data, which is included in the plurality of data sets and has a matching degree with the query information higher than the last matching degree and smaller than the preset screening number, is counted when the last matching degree is traversed, the plurality of nodes may further select new target data from the data sets retrieved respectively. Wherein, the new matching degree corresponding to the new target data selected by the plurality of nodes is lower than the last matching degree. Correspondingly, a plurality of new target data further selected by the plurality of nodes can be obtained, the traversal process is executed for a plurality of new matching degrees corresponding to the plurality of new target data, and the like until the number of data, which are contained in the plurality of data sets and have matching degrees with query information higher than the new matching degrees, is counted, and the new matching degrees are determined to be a matching degree threshold value if the number of data, which are larger than or equal to a preset screening number.
Or in an embodiment, if the number of data, which is included in the plurality of data sets and has a matching degree with the query information higher than the last matching degree and smaller than the preset screening number, is counted when the last matching degree is traversed, the matching degree threshold value may be directly determined as minus infinity. When the matching degree threshold is minus infinity, the method is equivalent to the fact that data screening is not performed subsequently, and all data contained in the data set retrieved by each node meet the matching degree requirement, which is not particularly limited in the specification.
In an illustrated embodiment, the plurality of nodes includes a master node and a slave node, and the master node may select target data from a data set retrieved by itself and determine a degree of matching between the target data and the query information. In addition, each slave node may send the matching degree to the master node after selecting the target data from the respective data set and determining the matching degree between the target data and the query information. Correspondingly, the master node can receive the matching degrees sent by all the slave nodes, and then determines a matching degree threshold value from the acquired matching degrees according to the steps.
Step S203, obtaining the inference data that the matching degree between the plurality of nodes and the query information screened from the respective data sets is higher than the matching degree threshold, and summarizing the obtained inference data screened by the plurality of nodes into an inference data set.
Further, after determining the matching degree threshold, each node may screen the inference data with the matching degree higher than the matching degree threshold from the respective retrieved data set. For example, each node may delete data in the dataset having a matching degree below the matching degree threshold, thereby obtaining a new dataset, and accordingly, the new dataset contains only inferred data having a matching degree above the matching degree threshold.
In an embodiment shown, the master node may send the determined matching degree threshold to the slave node, and correspondingly, the slave node may receive the matching degree threshold, screen out the reasoning data with the matching degree higher than the matching degree threshold from the data set retrieved by the slave node, and send the screened reasoning data (i.e. the new data set) to the master node. Accordingly, the master node may obtain all of the inferred data sent by the slave nodes. And the main node can screen out the reasoning data with the matching degree higher than the matching degree threshold value from the data set retrieved by the main node.
Further, the master node can combine and summarize all the obtained reasoning data to obtain a reasoning data set.
Step S204, based on the reasoning data set, the LLM model is assisted to execute the reasoning task, and a query result corresponding to the query information is generated.
Further, based on the collected reasoning data set, the LLM model can be assisted to execute the reasoning task, and a query result corresponding to the query information input by the user can be generated. In an embodiment, a pre-trained LLM model may be carried in the master node, and after the master node gathers the inference data sets, the master node may assist the LLM model to perform an inference task based on the inference data sets, and generate a query result corresponding to query information input by a user.
It should be noted that, the specific implementation manner of the LLM model for assisting in performing the reasoning task based on the reasoning data set is not particularly limited in this specification.
In an illustrated embodiment, a prompt term (prompt) may be constructed based on query information input by a user and the inference data set, and the prompt term is input into a pre-trained LLM model, and the LLM model performs a corresponding inference task based on the prompt term to generate a query result corresponding to the query information.
In an embodiment, a preset number of candidate data, such as top-k candidate data, with the highest matching degree with the query information input by the user may be further selected from the multiple pieces of reasoning data contained in the reasoning data set. Further, a prompt word may be constructed based on the query information and top-k candidate data, and the prompt word is input into a pre-trained LLM model, and the LLM model performs a corresponding reasoning task based on the prompt word, generates a query result corresponding to the query information, and the like, which is not specifically limited in this specification.
In an embodiment, the preset number of filtering may be greater than or equal to the preset number of candidate data, i.e., the number of filtering may be greater than or equal to k, i.e., the number of data in the inference data set is guaranteed to be at least the number of data in the final LLM model.
The distributed system-based data reasoning method provided in this specification will be described below by way of several examples in connection with the method flow shown in fig. 2. In particular, the data reasoning method may comprise the following steps.
And step 1, data slicing.
A large-scale dataset is partitioned into a plurality of data slices. For example, a large-scale data set (denoted as X) may be uniformly divided into P data slices, i.e., x= { X1, X2,..once, XP }, where a P-th data slice XP may be allocated to a node P for processing, and a process P running in the node P may perform data retrieval for the data slice XP. Wherein P is an integer greater than 1, P is an integer greater than or equal to 1, and less than or equal to P.
Step2, index construction
Each node builds a corresponding ANNS index structure for each stored data slice. Illustratively, a process p running in node p may build ANNS an index structure for a data slice Xp.
Step 3, data searching
And each node performs data search in each stored data fragment based on the built ANNS index structure to obtain top-k nearest neighbor data matched with the query information input by the user. For example, the process p in the node p may convert the obtained query information into the d-dimensional query vector q, and perform data search in the data slice Xp based on the constructed index structure, to obtain top-k nearest neighbor data closest to (i.e., the matching degree is highest, and the following steps all use the distance to represent the matching degree). For example, the process p may sort the top-k nearest neighbor data searched from small to large according to the distance between the nearest neighbor data and the query vector q, so as to obtain an array Ap. That is, the array Ap contains k elements, each element is nearest-neighbor data, and the first element (e.g., denoted as A1) is nearest-neighbor data with the smallest distance from the query vector q among the k nearest-neighbor data.
Step 4, data sampling
Each node randomly selects a plurality of target nearest neighbor data from the top-k nearest neighbor data searched by the node, and obtains the matching degree between the plurality of target nearest neighbor data and the query information. Illustratively, the process p in node p randomly selects s target nearest neighbor data from the k nearest neighbor data contained in its array Ap (s may be much smaller than k, e.g., k is equal to 20, s is equal to 2, and k is equal to 100, s is equal to 2). Then, the process p obtains distances between the s target nearest neighbor data and the query vector q respectively, and obtains s distances corresponding to the s target nearest neighbor data. Then, the s distances are sorted from small to large to obtain an array Sp. That is, the array Sp includes s elements, each element is a distance corresponding to the nearest neighbor data of the target, wherein the first element (e.g. Sp 1) is the smallest distance among the s distances.
The processes on all nodes may then perform a distributed merge operation on all arrays Sp (i.e., P arrays Sp) to form a new array S containing |s| elements that are also sorted by distance from small to large. If there is no repetition value in the array Sp of each process, the array S contains p×s elements in total, i.e., |s|=p×s.
Step 5, data sub-barrels
It will be understood that the S elements (i.e., S distances) included in each array Sp may divide the array Ap into s+1 distance intervals, and correspondingly, the i S elements included in the combined array S may divide the array { Ap } into i s+1 distance intervals. Where the array { Ap } represents the sum of all arrays Ap in all processes, including p×k elements.
Assuming that the array Sp contains 2 elements, respectively Sp 1=10 and Sp 2=20, the s+1 distance intervals may include 3 distance intervals of (- ≡10], (10, 20], (20, +) in order.
Assuming that P is equal to 2, array S contains 4 elements, S1 = distance 3, S2 = distance 10, S3 = distance 12, S4 = distance 20, and then |s| +1 distance bins may include in order (- ≡3], (3, 10], (10, 12], (12, 20], (20, +) these 5 distance bins.
Based on this, the process p can determine which distance interval of the k nearest neighbor data in the array Ap and the query vector q respectively belongs to the distance intervals of the i s+1 distance intervals, and obtain the array Cp through statistics. The array Cp includes |S|+ 1 elements corresponding to |S|+ 1 distance intervals, and the element Cp [ i ] in the array Cp is used for recording the number of nearest neighbor data belonging to the ith distance interval included in the array Ap. Wherein i is an integer greater than or equal to 1 and less than or equal to |s|+1.
The process on all nodes may then perform a distributed accumulation for all arrays Cp (i.e. P arrays Cp), i.e. add the elements of the corresponding positions in all arrays Cp, resulting in a new array C. The new array C also includes |S|+ 1 elements corresponding to |S|+ 1 distance intervals, and the element C [ j ] in the new array C is used to record the number of nearest neighbor data belonging to the j-th distance interval contained in the array { Ap }. Wherein j is an integer greater than or equal to 1 and less than or equal to |s|+1.
For example, still assuming P is equal to 2, the array S contains 4 elements, S1 = distance 3, S2 = distance 10, S3 = distance 12, S4 = distance 20, |S|+1 distance bins may include in order (- +, 3], (3, 10], (10, 12], (12, 20], (20, ++), etc.), in addition, assuming k is equal to 20, i.e., { Ap }, containing 40 nearest neighbor data in total, the final statistically derived array C may be [2,8,12,15,3] based on the distance between the 40 nearest neighbor data and the query vector q, i.e., of the 40 nearest neighbor data, the number of nearest neighbor data in the interval (- ≡3) with the query vector q, the number of nearest neighbor data in the interval (3, 10) with the query vector q, the number of nearest neighbor data in the interval (10, 12) with the query vector q, the number of nearest neighbor data in the interval (12, 20) with the query vector q, and the number of nearest neighbor data in the interval (20, fact) with the query vector q, are equal to 3.
Step 6, threshold calculation
According to array C, elements C1 therein are accumulated until the accumulated value is not less than a preset number (e.g., k) for the first time. Assuming that this condition is reached at C [ j ], the distance threshold (corresponding to the matching degree threshold) is determined to be S [ j ].
Taking the example of step 5 above as an example, the preset number k=20, when added to C3, 2+8+12=22, the added value is greater than 20, i.e. the number of nearest neighbor data contained in the array { Ap } is less than or equal to S3=12 is greater than 20, and S3=12 may be determined as the distance threshold.
Illustratively, assuming the final statistically derived array C is [1,2,1,2,34], since the values of C1, C2, C3, and C4 are all very small, the accumulation value is not greater than 20 until 1+2+1+2+34=40 when added to C5.
In this case, the distance threshold may be determined to be positive and infinite (i.e., the matching degree threshold is negative and infinite), which is equivalent to that no data screening is performed subsequently, and k nearest neighbor data obtained by querying each node all meet the requirements.
Alternatively, in this case, each node may select 2 new target nearest neighbor data from the array Ap after S [4] =distance 20, and combine them to obtain a new array S'. For example, the new array S ' may also include 4 elements, e.g., S ' [1] =distance 21, S ' [2] =distance 25, S ' [3] =distance 30, and S ' [4] =distance 35, respectively. Accordingly, the |S+1 distance bins may include, in order, (- ++21 ], (21, 25], (25, 30], (30, 35], (35, ++) these 5 distance bins.) accordingly, assuming the final statistics result in a new array C 'of [1,24,8,2,5], based on the new array C', when the element C '1 is added to C' 2, 1+24=25, and the added value is greater than 20, that is, the number of nearest neighbor data whose distance is less than or equal to S '2=25 contained in the array { Ap } is greater than 20, S' 2=25 can be determined as the distance threshold.
For example, assuming that the new array C 'obtained by the final statistics is [8,2,1,4,25], since the values of C' [1], C '[2], C' [3] and C '[4] in the new array C' are also smaller until C '[5] is added, the accumulated value is greater than 20, and each node may select 2 new target nearest neighbor data from the array Ap after S' [4] = distance 35, and then combine to obtain a new array S ", where the new array S" may include, for example, S "[1] = distance 37, S" [2] = distance 45, S "[3] = distance 50, S" [4] = distance 55, and so on, which is not specifically limited in the present specification.
Step 7, data screening
The process p deletes nearest neighbor data with a distance smaller than the distance threshold value from the array Ap to generate a new array Ap'.
Step 8, data merging
And combining the new arrays Ap 'obtained by all the processes through distributed combination operation to obtain an array { Ap' }. The final top-k candidate nearest neighbor data may then be further selected from the array { Ap', }, for example, where the top-k candidate nearest neighbor data is closest to. Then, constructing a prompt word based on the query vector q and top-k candidate nearest neighbor data, inputting the prompt word into a pre-trained LLM model, and executing a corresponding reasoning task by the LLM model based on the prompt word to generate a query result corresponding to the query information.
In summary, the nodes of the distributed system may retrieve data sets matching the query information input by the user from the respective stored data fragments, and may further select target data from the respective retrieved data sets. Then, the matching degree between a plurality of target data selected by a plurality of nodes and the query information is determined, and a matching degree threshold value for further screening partial data from each data set is determined from a plurality of matching degrees corresponding to the plurality of target data. Based on the data, each node can further screen out a small amount of data with the matching degree higher than the matching degree threshold value from a large amount of data contained in the data set retrieved by each node, only the small amount of data with the matching degree higher than the matching degree threshold value screened by each node is needed to be acquired and summarized later, and the LLM model is assisted to execute the reasoning task based on the small amount of data. Therefore, on the premise of ensuring the data retrieval efficiency and the LLM model generation quality, unnecessary data transmission and calculation with lower matching degree are avoided, the data transmission amount and calculation amount in the distributed system are greatly reduced, and further the calculation and communication expenditure is reduced.
Corresponding to the implementation of the method flow, the embodiment of the specification also provides a data reasoning device based on the distributed system. Referring to fig. 3, fig. 3 is a schematic structural diagram of a data inference apparatus based on a distributed system according to an exemplary embodiment. The apparatus 30 may be applied to the distributed system shown in fig. 1, which includes a plurality of nodes each storing a different data slice. As shown in fig. 3, the apparatus 30 includes:
An obtaining unit 301, configured to obtain target data selected by the plurality of nodes from each of the data sets retrieved by the plurality of nodes, where the data set retrieved by any node includes a plurality of data pieces stored in the node and matched with query information input by a user;
A determining unit 302, configured to determine a matching degree between a plurality of target data selected by the plurality of nodes and the query information, and determine a matching degree threshold from a plurality of matching degrees corresponding to the plurality of target data, where a number of data, which are contained in a plurality of data sets retrieved by the plurality of nodes and have matching degrees with the query information higher than the matching degree threshold, is greater than or equal to a preset screening number;
A data summarizing unit 303, configured to obtain inference data, which is screened by the plurality of nodes from respective data sets, and has a matching degree with the query information higher than the matching degree threshold, and summarize the obtained inference data screened by the plurality of nodes into an inference data set;
And the model reasoning unit 304 is used for assisting the LLM model to execute a reasoning task based on the reasoning data set and generating a query result corresponding to the query information.
In an illustrated embodiment, the data set retrieved by any node runs a near-nearest-neighbor search algorithm ANNS for the node, and the nearest-neighbor data set that matches the query information is retrieved from the data fragments stored by the node, where the nearest-neighbor data set includes a preset number of nearest-neighbor data with the highest matching degree with the query information in the data fragments stored by the node.
In an embodiment, the target data is selected randomly from all data contained in the data set, or the target data is selected sequentially from all data contained in the data set according to a preset step length.
In an illustrated embodiment, the determining unit 302 is specifically configured to:
sorting a plurality of matching degrees corresponding to the plurality of target data from high to low, and traversing from the first matching degree based on the sorted order;
counting the number of data which are contained in the plurality of data sets and have a matching degree with the query information higher than the target matching degree aiming at the target matching degree traversed currently, and
And determining whether the counted data quantity is larger than or equal to the screening quantity, if so, determining the target matching degree as the matching degree threshold value, ending the traversal, and if not, continuing the traversal to the next matching degree.
In an illustrated embodiment, the determining unit 302 is specifically configured to:
If the data quantity which is higher than the last matching degree and is contained in the plurality of data sets and is smaller than the screening quantity is counted when the last matching degree is traversed, further acquiring new target data which are selected by the plurality of nodes from the data sets which are searched respectively, wherein the new matching degree corresponding to the new target data which are selected by the plurality of nodes is lower than the last matching degree;
And the like, determining the new matching degree as a matching degree threshold value until the data quantity, which is contained in the plurality of data sets and has the matching degree with the query information higher than the new matching degree, is counted and is larger than or equal to the screening quantity.
In an illustrated embodiment, the determining unit 302 is specifically configured to:
and if the number of the data, which is included in the plurality of data sets and is higher than the last matching degree and smaller than the screening number, of the matching degree with the query information is counted when the last matching degree is traversed, determining a matching degree threshold as minus infinity.
In an embodiment shown, the plurality of nodes include a master node and a slave node, and the data summarizing unit 303 is specifically configured to:
The master node sends the determined matching degree threshold value to the slave node so that the slave node screens out the reasoning data with the matching degree higher than the matching degree threshold value from the searched data set and sends the reasoning data to the master node, and
And the master node screens out the reasoning data with the matching degree higher than the matching degree threshold value from the data set retrieved by the master node.
In an illustrated embodiment, the model inference unit 304 is specifically configured to:
selecting a preset number of candidate data with highest matching degree with the query information from a plurality of inference data contained in the inference data set;
And constructing a prompt word based on the query information and the candidate data, inputting the prompt word into a LLM model, and executing an reasoning task by the LLM model based on the prompt word to generate a query result corresponding to the query information.
In an embodiment, the matching degree between the query information and each data is represented by the distance between the vector corresponding to the query information and the vector corresponding to each data, wherein the closer the distance between the vectors is, the higher the matching degree is.
The implementation process of the functions and roles of the units in the above device 30 is specifically described in the above embodiments, and will not be described in detail herein. It should be understood that the above-mentioned apparatus 30 may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions into a memory by a processor (CPU) of the device. In addition to the CPU and the memory, the device in which the above apparatus is located generally includes other hardware such as a chip for performing wireless signal transmission and reception, and/or other hardware such as a board for implementing a network communication function.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the units or modules may be selected according to actual needs to achieve the purposes of the present description. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The apparatus, units, modules illustrated in the above embodiments may be implemented in particular by a computer chip or entity or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, vehicle-mounted computer, or a combination of any of these devices.
Corresponding to the method embodiments described above, embodiments of the present specification also provide a computing device. Referring to fig. 4, fig. 4 is a schematic structural diagram of a computing device according to an exemplary embodiment. The computing device shown in fig. 4 may be a computing device in the distributed system shown in fig. 1 described above, which includes a plurality of nodes that each store a different piece of data. As shown in fig. 4, the computing device includes a processor 1001 and memory 1002, and may further include an input device 1004 (e.g., keyboard, etc.) and an output device 1005 (e.g., display, etc.). The processor 1001, memory 1002, input devices 1004, and output devices 1005 may be connected by a bus or other means. As shown in fig. 4, the memory 1002 includes a computer-readable storage medium 1003, which computer-readable storage medium 1003 stores a computer program executable by the processor 1001. The processor 1001 may be a CPU, microprocessor, or integrated circuit for controlling the execution of the above method embodiments. The processor 1001 may execute the steps of the data inference method based on the distributed system in this embodiment when running the stored computer program, where the steps include obtaining target data selected by a plurality of nodes from each of the retrieved data sets, where each of the data sets retrieved by any node includes a plurality of data that matches query information input by a user in a data segment stored by the node, determining a matching degree between the plurality of target data selected by the plurality of nodes and the query information, and determining a matching degree threshold from a plurality of matching degrees corresponding to the plurality of target data, where a number of data sets included in the plurality of data sets retrieved by the plurality of nodes and having a matching degree higher than the matching degree threshold is greater than or equal to a preset screening number, obtaining inferential data selected by the plurality of nodes from each of the data sets and having a matching degree higher than the matching degree threshold, and summarizing the obtained inferential data selected by the plurality of nodes into a data set, assisting LLM in generating a query model corresponding to the query task based on the data set, and so on. For a detailed description of each step of the data reasoning method based on the distributed system, please refer to the previous contents, and the detailed description is omitted here.
Corresponding to the above-described method embodiments, embodiments of the present description also provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the distributed system-based data reasoning method in the embodiments of the present description. Please refer to the description of the above embodiments, and the details are not repeated here.
The foregoing description of the preferred embodiments is provided for the purpose of illustration only, and is not intended to limit the scope of the disclosure, since any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the disclosure are intended to be included within the scope of the disclosure.
In a typical configuration, the terminal device includes one or more CPUs, input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data.
Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, embodiments of the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, embodiments of the present description may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Claims (13)

1.一种基于分布式系统的数据推理方法,其特征在于,所述分布式系统包含多个节点,所述多个节点各自存储了不同的数据分片;所述方法包括:1. A data reasoning method based on a distributed system, characterized in that the distributed system includes multiple nodes, each of which stores a different data shard; the method comprises: 获取所述多个节点从各自检索出的数据集中选取出的目标数据;其中,任一节点检索出的数据集中包含在该节点存储的数据分片中与用户输入的查询信息相匹配的若干数据;Obtaining target data selected by the multiple nodes from the data sets retrieved by each node; wherein the data set retrieved by any node contains a number of data matching the query information input by the user in the data slice stored in the node; 确定所述多个节点选取出的多个目标数据与所述查询信息之间的匹配度,并从与所述多个目标数据对应的多个匹配度中确定出匹配度阈值;其中,所述多个节点检索出的多个数据集中包含的与所述查询信息之间的匹配度高于所述匹配度阈值的数据数量,大于或者等于预设的筛选数量;Determine the matching degree between the multiple target data selected by the multiple nodes and the query information, and determine a matching degree threshold from the multiple matching degrees corresponding to the multiple target data; wherein the number of data with a matching degree higher than the matching degree threshold contained in the multiple data sets retrieved by the multiple nodes and the query information is greater than or equal to a preset screening number; 获取所述多个节点从各自的数据集中筛选出的与所述查询信息之间的匹配度高于所述匹配度阈值的推理数据,并将获取到的所述多个节点筛选出的推理数据汇总成推理数据集;Acquire the inference data filtered out by the multiple nodes from their respective data sets, the inference data having a matching degree with the query information higher than the matching degree threshold, and aggregate the acquired inference data filtered out by the multiple nodes into an inference data set; 基于所述推理数据集,辅助LLM模型执行推理任务,生成与所述查询信息对应的查询结果。Based on the reasoning data set, the auxiliary LLM model performs the reasoning task and generates a query result corresponding to the query information. 2.根据权利要求1所述的方法,其特征在于,任一节点检索出的数据集为该节点运行近似最近邻搜索算法ANNS,从该节点存储的数据分片中检索出的与所述查询信息相匹配的最近邻数据集,该最近邻数据集中包含在该节点存储的数据分片中与所述查询信息匹配度最高的预设数量的最近邻数据。2. The method according to claim 1 is characterized in that the data set retrieved by any node is a nearest neighbor data set matching the query information retrieved from the data slices stored by the node when the node runs the approximate nearest neighbor search algorithm ANNS, and the nearest neighbor data set includes a preset number of nearest neighbor data with the highest matching degree with the query information in the data slices stored by the node. 3.根据权利要求1所述的方法,其特征在于,所述目标数据为从数据集包含的所有数据中随机选取的数据;或者,所述目标数据为按照预设的步长,从数据集包含的所有数据中依次选取的数据。3. The method according to claim 1 is characterized in that the target data is data randomly selected from all the data contained in the data set; or, the target data is data selected sequentially from all the data contained in the data set according to a preset step size. 4.根据权利要求1所述的方法,其特征在于,所述从与所述多个目标数据对应的多个匹配度中确定出匹配度阈值,包括:4. The method according to claim 1, characterized in that the step of determining a matching degree threshold from a plurality of matching degrees corresponding to the plurality of target data comprises: 将与所述多个目标数据对应的多个匹配度从高到低进行排序,并基于排序后的顺序,从第一个匹配度开始遍历;Sort the multiple matching degrees corresponding to the multiple target data from high to low, and traverse from the first matching degree based on the sorted order; 针对当前遍历到的目标匹配度,统计所述多个数据集中包含的与所述查询信息之间的匹配度高于所述目标匹配度的数据数量;以及,For the currently traversed target matching degree, counting the number of data contained in the multiple data sets whose matching degree with the query information is higher than the target matching degree; and 确定统计出的数据数量是否大于或者等于所述筛选数量,若是,则将所述目标匹配度确定为匹配度阈值,并结束遍历;若否,则继续遍历到下一个匹配度。Determine whether the number of data counted is greater than or equal to the screening number. If so, determine the target matching degree as the matching degree threshold and end the traversal; if not, continue to traverse to the next matching degree. 5.根据权利要求4所述的方法,其特征在于,所述方法还包括:5. The method according to claim 4, characterized in that the method further comprises: 若在遍历到最后一个匹配度时,统计得到的所述多个数据集中包含的与所述查询信息之间的匹配度高于所述最后一个匹配度的数据数量,小于所述筛选数量,则进一步获取所述多个节点从各自检索出的数据集中选取出的新目标数据;其中,与所述多个节点选取出的新目标数据对应的新匹配度低于所述最后一个匹配度;If, when traversing to the last matching degree, the number of data in the multiple data sets whose matching degree with the query information is higher than the last matching degree obtained by counting is less than the screening number, then further obtaining new target data selected by the multiple nodes from the respective retrieved data sets; wherein the new matching degree corresponding to the new target data selected by the multiple nodes is lower than the last matching degree; 以此类推,直至统计得到所述多个数据集中包含的与所述查询信息之间的匹配度高于新匹配度的数据数量,大于或者等于所述筛选数量,则将该新匹配度确定为匹配度阈值。And so on, until the number of data in the multiple data sets whose matching degree with the query information is higher than the new matching degree is obtained and is greater than or equal to the screening number, the new matching degree is determined as the matching degree threshold. 6.根据权利要求4所述的方法,其特征在于,所述方法还包括:6. The method according to claim 4, characterized in that the method further comprises: 若在遍历到最后一个匹配度时,统计得到的所述多个数据集中包含的与所述查询信息之间的匹配度高于所述最后一个匹配度的数据数量,小于所述筛选数量,则将匹配度阈值确定为负无穷。If, when traversing to the last matching degree, the number of data in the multiple data sets whose matching degree with the query information is higher than the last matching degree obtained by counting is less than the screening number, the matching degree threshold is determined to be negative infinity. 7.根据权利要求1所述的方法,其特征在于,所述多个节点中包含主节点和从节点;所述获取所述多个节点从各自的数据集中筛选出的与所述查询信息之间的匹配度高于所述匹配度阈值的推理数据,包括:7. The method according to claim 1, wherein the plurality of nodes include a master node and a slave node; and the step of obtaining the inference data selected by the plurality of nodes from their respective data sets and having a matching degree with the query information higher than the matching degree threshold comprises: 主节点将确定出的所述匹配度阈值发送给所述从节点,以使所述从节点从检索出的数据集中筛选出与所述查询信息之间的匹配度高于所述匹配度阈值的推理数据,并将所述推理数据发送给所述主节点;以及,The master node sends the determined matching degree threshold to the slave node, so that the slave node selects inference data having a matching degree with the query information higher than the matching degree threshold from the retrieved data set, and sends the inference data to the master node; and 所述主节点从其检索出的数据集中筛选出与所述查询信息之间的匹配度高于所述匹配度阈值的推理数据。The master node selects inference data whose matching degree with the query information is higher than the matching degree threshold from the retrieved data set. 8.根据权利要求1所述的方法,其特征在于,所述基于所述推理数据集,辅助LLM模型执行推理任务,生成与所述查询信息对应的查询结果,包括:8. The method according to claim 1, characterized in that the step of assisting the LLM model to perform the reasoning task based on the reasoning data set and generating the query result corresponding to the query information comprises: 从所述推理数据集包含的多个推理数据中选取出与所述查询信息匹配度最高的预设数量的候选数据;Selecting a preset number of candidate data with the highest matching degree with the query information from the multiple inference data included in the inference data set; 基于所述查询信息和所述候选数据构建提示词,并将所述提示词输入至LLM模型,由所述LLM模型基于所述提示词执行推理任务,生成与所述查询信息对应的查询结果。A prompt word is constructed based on the query information and the candidate data, and the prompt word is input into the LLM model, and the LLM model performs an inference task based on the prompt word to generate a query result corresponding to the query information. 9.根据权利要求1-8任意一项所述的方法,其特征在于,所述查询信息与各个数据之间的匹配度采用所述查询信息对应的向量与各个数据对应的向量之间的距离表示;其中,向量之间的距离越近,匹配度越高。9. The method according to any one of claims 1-8 is characterized in that the matching degree between the query information and each data is represented by the distance between the vector corresponding to the query information and the vector corresponding to each data; wherein, the closer the distance between the vectors, the higher the matching degree. 10.一种基于分布式系统的数据推理装置,其特征在于,所述分布式系统包含多个节点,所述多个节点各自存储了不同的数据分片;所述装置包括:10. A data reasoning device based on a distributed system, characterized in that the distributed system includes multiple nodes, each of which stores a different data slice; the device comprises: 获取单元,用于获取所述多个节点从各自检索出的数据集中选取出的目标数据;其中,任一节点检索出的数据集中包含在该节点存储的数据分片中与用户输入的查询信息相匹配的若干数据;An acquisition unit, configured to acquire target data selected by the plurality of nodes from the data sets retrieved by the plurality of nodes; wherein the data set retrieved by any node includes a number of data matching the query information input by the user in the data slice stored in the node; 确定单元,用于确定所述多个节点选取出的多个目标数据与所述查询信息之间的匹配度,并从与所述多个目标数据对应的多个匹配度中确定出匹配度阈值;其中,所述多个节点检索出的多个数据集中包含的与所述查询信息之间的匹配度高于所述匹配度阈值的数据数量,大于或者等于预设的筛选数量;a determination unit, configured to determine a matching degree between the plurality of target data selected by the plurality of nodes and the query information, and determine a matching degree threshold from the plurality of matching degrees corresponding to the plurality of target data; wherein the number of data whose matching degree with the query information is higher than the matching degree threshold and included in the plurality of data sets retrieved by the plurality of nodes is greater than or equal to a preset screening number; 数据汇总单元,用于获取所述多个节点从各自的数据集中筛选出的与所述查询信息之间的匹配度高于所述匹配度阈值的推理数据,并将获取到的所述多个节点筛选出的推理数据汇总成推理数据集;A data aggregation unit, configured to obtain the inference data filtered out by the multiple nodes from their respective data sets, the inference data having a matching degree with the query information higher than the matching degree threshold, and aggregate the obtained inference data filtered out by the multiple nodes into an inference data set; 模型推理单元,用于基于所述推理数据集,辅助LLM模型执行推理任务,生成与所述查询信息对应的查询结果。The model reasoning unit is used to assist the LLM model in performing reasoning tasks based on the reasoning data set and generate query results corresponding to the query information. 11.一种计算设备,其特征在于,包括:存储器和处理器;所述存储器上存储有可由所述处理器运行的计算机程序/指令;所述处理器运行所述计算机程序/指令时,执行如权利要求1-9任意一项所述的方法。11. A computing device, comprising: a memory and a processor; the memory stores a computer program/instruction executable by the processor; when the processor executes the computer program/instruction, the processor executes the method according to any one of claims 1 to 9. 12.一种计算机可读存储介质,其特征在于,其上存储有计算机程序/指令,所述计算机程序/指令被处理器执行时实现如权利要求1-9任意一项所述的方法。12. A computer-readable storage medium, characterized in that a computer program/instruction is stored thereon, and when the computer program/instruction is executed by a processor, the method according to any one of claims 1 to 9 is implemented. 13.一种计算机程序产品,其特征在于,所述计算机程序产品包括计算机程序/指令,所述计算机程序/指令被处理器执行时实现如权利要求1-9任意一项所述的方法。13. A computer program product, characterized in that the computer program product comprises a computer program/instruction, and when the computer program/instruction is executed by a processor, the method according to any one of claims 1 to 9 is implemented.
CN202411959541.7A 2024-12-27 2024-12-27 A data reasoning method based on distributed system and related equipment Pending CN119903915A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411959541.7A CN119903915A (en) 2024-12-27 2024-12-27 A data reasoning method based on distributed system and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411959541.7A CN119903915A (en) 2024-12-27 2024-12-27 A data reasoning method based on distributed system and related equipment

Publications (1)

Publication Number Publication Date
CN119903915A true CN119903915A (en) 2025-04-29

Family

ID=95463862

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411959541.7A Pending CN119903915A (en) 2024-12-27 2024-12-27 A data reasoning method based on distributed system and related equipment

Country Status (1)

Country Link
CN (1) CN119903915A (en)

Similar Documents

Publication Publication Date Title
Norouzi et al. Fast exact search in hamming space with multi-index hashing
CN103810237B (en) Data managing method and system
CN114840487B (en) Metadata management method and device for distributed file system
CN114281989B (en) Data deduplication method and device based on text similarity, storage medium and server
US11281645B2 (en) Data management system, data management method, and computer program product
CN113961514A (en) Data query method and device
CN115878824B (en) Image retrieval system, method and device
EP4685659A1 (en) Vector retrieval methods and apparatuses, devices, and storage media
Li et al. Fast distributed video deduplication via locality-sensitive hashing with similarity ranking
CN113971225A (en) Image retrieval system, method and device
Karri et al. AI-Driven Indexing Strategies
CN112199408B (en) Reference distance similarity search
EP4685660A1 (en) Vector retrieval methods and apparatuses, devices, and storage media
CN118467544B (en) Distributed vector indexing and retrieval method and system by using memory and disk in mixed mode
CN119961486A (en) Data processing method and electronic device
CN111625530A (en) Large-scale vector retrieval method and device
CN119903915A (en) A data reasoning method based on distributed system and related equipment
CN116304253B (en) Data storage method, data retrieval method and method for identifying similar videos
Nguyen Mau et al. Audio fingerprint hierarchy searching strategies on GPGPU massively parallel computer
CN120470158B (en) Database system and vector hybrid search method
Antaris et al. Similarity search over the cloud based on image descriptors' dimensions value cardinalities
CN120994760B (en) Document Retrieval Method Based on Multi-Field Information and Outlier Detection
CN115640426A (en) Data indexing method and system
CN117785889B (en) An index management method and related equipment for graph database
Gan et al. {SNARY}: A {High-Performance} and Generic {SmartNIC-accelerated} Retrieval System

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Country or region after: China

Address after: 310000 Zhejiang Province, Hangzhou City, Xihu District, Xixi Road 543-569 (continuous odd numbers) Building 1, Building 2, 5th Floor, Room 518

Applicant after: Alipay (Hangzhou) Digital Service Technology Co.,Ltd.

Address before: Room 518, 5th Floor, Building 2, Building 1, No. 543-569 Xixi Road, Xihu District, Hangzhou City, Zhejiang Province, China 310000

Applicant before: Alipay (Hangzhou) Information Technology Co., Ltd.

Country or region before: China

CB02 Change of applicant information