Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with aspects of one or more embodiments of the present description as detailed in the accompanying claims.
It should be noted that in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, a single step described in this specification may be described as being split into multiple steps in other embodiments, while multiple steps described in this specification may be described as being combined into a single step in other embodiments.
The term "plurality" as used herein means two or more.
Retrieval enhancement generation techniques typically include two phases, retrieval and generation. In the retrieval stage, a plurality of data related to query information input by a user are required to be found from a large amount of data, but when the data size is very huge and a single machine cannot accommodate all the data, the data retrieval needs to be realized based on a distributed system.
The distributed system comprises a plurality of nodes, the plurality of nodes can store a large-scale data set in a distributed mode, for example, the large-scale data set can be divided into a plurality of data fragments, and the plurality of nodes can respectively store different data fragments. In an illustrated embodiment, a plurality of nodes may each retrieve data related to query information entered by a user from a respective stored data slice by running a corresponding search algorithm. Further, a plurality of data respectively retrieved by a plurality of nodes can be summarized to one node (for example, a master node) through a distributed merging (Union) operation, and then the node can assist the pre-trained LLM model to execute corresponding reasoning tasks based on the summarized plurality of data, and output a query result corresponding to query information input by a user.
In an embodiment shown, the search algorithm may be an approximate nearest neighbor search (Approximate Nearest Neighbor Search, ANNS) algorithm, and the plurality of nodes may construct a corresponding index structure for each stored data slice by running ANNS algorithm, and then quickly search a preset number of nearest neighbor data, namely top-k nearest neighbor data, which is the preset number, from each stored data slice based on the index structure, wherein k is an integer greater than 1, and the preset number is the nearest neighbor data. It should be appreciated that conventional accurate nearest neighbor searches (Exact Nearest Neighbor Search, ens) may not meet the actual needs of a user when faced with large-scale data sets, as their computational cost may rise dramatically with increasing data volume and dimensions. ANNS, by allowing a degree of approximation, significantly increases the retrieval speed while still maintaining high accuracy in most cases, making it an effective tool for processing large and high-dimensional data.
As described above, the problem of data storage and retrieval efficiency in the face of a large-scale data set can be effectively solved by the distributed technology, but when the amount of data retrieved by each node is large (for example, k is equal to several millions or even several tens of millions), a large amount of data needs to be transmitted and calculated later when the large amount of data retrieved by each node is summarized, so that a large amount of calculation and communication overhead is generated, and practical use requirements cannot be satisfied.
Based on the above, the present disclosure provides a technical solution, where partial data with higher matching degree is further screened from the data set matched with the query information retrieved by each of the plurality of nodes, and the screened partial data is summarized, so as to assist the LLM model in performing the reasoning task, thereby greatly reducing the data transmission amount and the calculation amount in the distributed system, and further reducing the overall calculation and communication overhead.
In practice, the present description may be applied to a distributed system comprising a plurality of nodes each storing a different data slice. In an illustrated embodiment, the specification may obtain target data selected by a plurality of nodes from respective retrieved data sets. The data set retrieved by any node contains a plurality of data matched with query information input by a user in the data fragment stored by the node. Further, the matching degree between the plurality of target data selected by the plurality of nodes and the query information can be determined, and the matching degree threshold value is determined from the plurality of matching degrees corresponding to the plurality of target data. The data quantity, which is contained in the plurality of data sets retrieved by the plurality of nodes and has the matching degree with the query information higher than the matching degree threshold value, is larger than or equal to the preset screening quantity. Further, the inference data of which the matching degree between the query information and the data set is higher than the matching degree threshold value can be obtained, wherein the inference data are screened from the data sets by the nodes, and the obtained inference data are summarized into the inference data sets. Further, the present specification may assist the LLM model in performing an inference task based on the inference dataset, generating a query result corresponding to the query information.
In the above technical solution, the plurality of nodes of the distributed system may retrieve, from the respective stored data fragments, a data set matching the query information input by the user, and may further select the target data from the respective retrieved data sets. Then, the matching degree between a plurality of target data selected by a plurality of nodes and the query information is determined, and a matching degree threshold value for further screening partial data from each data set is determined from a plurality of matching degrees corresponding to the plurality of target data. Based on the data, each node can further screen out a small amount of data with the matching degree higher than the matching degree threshold value from a large amount of data contained in the data set retrieved by each node, only the small amount of data with the matching degree higher than the matching degree threshold value screened by each node is needed to be acquired and summarized later, and the LLM model is assisted to execute the reasoning task based on the small amount of data. Therefore, on the premise of ensuring the data retrieval efficiency and the LLM model generation quality, unnecessary data transmission and calculation with lower matching degree are avoided, the data transmission amount and calculation amount in the distributed system are greatly reduced, and further the calculation and communication expenditure is reduced.
Referring to fig. 1, fig. 1 is a schematic diagram of a system architecture according to an exemplary embodiment. One or more embodiments provided herein may be embodied in the system architecture shown in fig. 1 or a similar system architecture. As shown in fig. 1, the distributed system 10 may include a plurality of nodes, such as a node 100a, a node 100b, a node 100c, and a node 100d, among which communication connections may be established by any possible manner, which is not specifically limited in this specification.
In an illustrated embodiment, nodes 100a, 100b, 100c, and 100d in distributed system 10 may be used to store large-scale data sets in a distributed manner. For example, a large-scale data set may be first divided into a plurality of data slices, and different data slices may be stored in nodes 100a, 100b, 100c, and 100d, respectively. For example, a data set may be evenly divided into a plurality of data slices and the plurality of data slices may be distributed to a plurality of nodes for storage. Or if the hardware conditions of the plurality of nodes are different, the data set may be divided into a plurality of data slices with different scales, and the data slices with larger scales (i.e. the data slices with larger data quantity) may be allocated to the nodes with better hardware conditions for storage, which is not particularly limited in the specification.
The data set may be a data set related to a specific field, or may be a data set related to a plurality of fields, such as a medical field, a financial field, a video entertainment field, and the like, which are not particularly limited in this specification.
In addition, the present specification is not particularly limited to the data types in the data set. In an embodiment, the data included in the data set may be documents (documents), image data, audio/video data, etc., which is not specifically limited in this specification.
In an illustrated embodiment, nodes 100a, 100b, 100c, and 100d in distributed system 10 may obtain query information entered by a user. Illustratively, the master node (e.g., node 100 a) of the plurality of nodes may receive the query information input by the user and transmit the received query information to the slave nodes (e.g., nodes 100b, 100c and 100 d), which is not particularly limited in this specification.
In an illustrated embodiment, the query information may be query text entered by a user via an input device (e.g., keyboard or touch screen, etc.) on the user terminal. By way of example, the query text may be "who is the nobel physics prize of 1943. The query text may also be, for example, "what drugs can be taken by cold fever. By way of example, the query text may also be "what are major traffic accidents occurring in city a in the last two years.
Further, after the node 100a, the node 100b, the node 100c, and the node 100d acquire the query information input by the user, the data sets matching the query information may be respectively retrieved from the data fragments stored in each. The dataset retrieved by any node may contain several data that match the query information in the data shards stored by that node.
Further, after each node respectively retrieves the data set matched with the query information, each node can further screen out partial data with higher matching degree with the query information from a large amount of data contained in the data set, and the partial data can be called as reasoning data and can be subsequently used for assisting the LLM model to execute the reasoning task.
In an illustrated embodiment, each node may first select target data from a respective retrieved data set. Then, determining the matching degree between the plurality of target data selected by the plurality of nodes and the query information, and determining a matching degree threshold value for further screening the reasoning data from the plurality of data sets retrieved by the plurality of nodes from the plurality of matching degrees corresponding to the plurality of target data, which is specifically referred to the following description of the corresponding embodiment of fig. 2 and will not be described herein.
Further, after determining the matching degree threshold, each node may further screen the inference data from the retrieved data set that has a matching degree with the query information higher than the matching degree threshold.
Further, by means of distributed merging operation, it is possible to obtain inference data further screened from the respective data sets by the plurality of nodes, and to aggregate all the obtained inference data into one inference data set. Illustratively, the slave nodes (e.g., the node 100b, the node 100c, and the node 100 d) may send the inference data further screened from the data set retrieved by the slave nodes to the master node (e.g., the node 100 a), and accordingly, the master node may obtain all the inference data sent by the slave nodes, and the master node may further screen the inference data from the data set retrieved by the master node itself, and finally, the master node may aggregate all the obtained inference data screened by the nodes to obtain the inference data set.
Further, the LLM model can be assisted in performing inference tasks based on the summarized inference data set to generate a query result corresponding to the query information. In an illustrated embodiment, a pre-trained LLM model may be carried in the master node, and after the master node gathers the inference data set, the master node may assist the LLM model to perform an inference task based on the inference data set, which will not be described in detail herein with reference to the following description of the corresponding embodiment of fig. 2.
Further, the distributed system 10 may return the query result generated by the LLM model to the user terminal, and accordingly, an output device (e.g., a display screen) on the user terminal may output and display the query result to the user, which is not limited in this specification.
As described above, in the present description, when the search enhancement generation is implemented, after the plurality of nodes in the distributed system 10 search the data sets matching the query information input by the user from the respective stored data fragments, all the data sets are not summarized directly, but each node may further screen out a large amount of data included in the searched data sets, and a small amount of data having a matching degree with the query information higher than the matching degree threshold value may be further selected. And only a small amount of data with higher matching degree screened by each node is needed to be acquired and summarized later, and the LLM model is assisted to execute the reasoning task based on the small amount of data. Therefore, on the premise of ensuring the data retrieval efficiency and the LLM model generation quality, the method avoids unnecessary data transmission and calculation with lower matching degree, greatly reduces the data transmission amount and calculation amount in the distributed system, and further reduces the calculation and communication expenditure.
Note that each node in the distributed system 10 may be a desktop computer, a server cluster including a plurality of servers, or the like, which have the above-described functions, and this description is not limited in detail. Furthermore, it should be noted that the system architecture shown in fig. 1 is merely illustrative, and in some possible embodiments, more or fewer devices than shown in the drawings may be further included in the distributed system 10, which is not specifically limited in this disclosure.
Referring to fig. 2, fig. 2 is a flowchart of a data reasoning method based on a distributed system according to an exemplary embodiment. The method may be applied to the distributed system shown in fig. 1, which includes a plurality of nodes each storing a different data slice. As shown in fig. 2, the method may specifically include the following steps S201 to S204.
And step S201, acquiring target data selected by a plurality of nodes from each retrieved data set, wherein the data set retrieved by any node comprises a plurality of data matched with query information input by a user in the data fragment stored by the node.
In one illustrated embodiment, after obtaining the query information entered by the user, the plurality of nodes may each retrieve a data set matching the query information from the respective stored data fragments. The data set retrieved by any node may include a number of data that matches the query information in the data fragments stored by the node.
The specific implementation of the data search is not particularly limited in this specification.
In an illustrated embodiment, multiple nodes may implement data retrieval by running an approximate nearest neighbor search algorithm, resulting in a nearest neighbor dataset that matches the query information. Correspondingly, the nearest neighbor data set retrieved by any node in the plurality of nodes through the approximate nearest neighbor search algorithm can contain the nearest neighbor data with the preset quantity, namely top-k nearest neighbor data, which has the highest matching degree with the query information in the data fragments stored by the node, wherein k is the preset quantity, and k is an integer larger than 1.
Wherein the approximate nearest neighbor search algorithm generally comprises two stages of index construction and data search. Correspondingly, when the data retrieval is realized through the approximate nearest neighbor search algorithm, each node can firstly construct a corresponding index structure for each stored data fragment, and then quickly search top-k nearest neighbor data which are most matched with query information input by a user from each stored data fragment based on the index structure, so that a nearest neighbor data set is obtained.
In an illustrated embodiment, the index construction method employed by the approximate nearest neighbor search algorithm may include any of the methods illustrated by hash method (Locality SENSITIVE HASHING, LSH), neighbor map method (HIERARCHICAL NAVIGABLE SMALL WORLD GRAPHS, HNSW), similarity search library (Facebook AI SIMILARITY SEARCH, FAISS) method, and the like, as the specification is not limited in detail.
By way of example, a data retrieval process of the approximate nearest neighbor search algorithm will be described below using a hash method as an example.
First, a series of hash function clusters needs to be defined, each cluster may be composed of a plurality of hash functions. The purpose of the design of these hash functions is to make it more likely that similar data in a dataset (e.g., movies of the same type or medical records of the same department, etc.) will be mapped into the same bucket, while dissimilar data will be mapped into different buckets.
Further, each data in the data set is subjected to hash processing through a hash function cluster, and a hash value combination corresponding to each data is obtained through calculation. The calculated hash value may then be allocated to the same bucket in combination with the same number of data.
Further, the hash function cluster is used for carrying out hash processing on the query information input by the user, and a hash value combination corresponding to the query information is obtained through calculation. Then, a bucket having the same hash value combination is queried among a plurality of buckets, and a relevant plurality of data is retrieved from the bucket. Further, the data with the highest matching degree and the preset number can be selected from the data to be used as top-k nearest neighbor data.
In an illustrated embodiment, the matching degree between the query information and each data may be represented by a distance between a vector corresponding to the query information and a vector corresponding to each data, where the closer the distance between the vectors, the higher the matching degree. The distance may be, for example, a euclidean distance, a cosine distance, or the like, which is not particularly limited in this specification.
Or in some possible embodiments, each node may implement data retrieval by any other possible search algorithm besides the above-mentioned approximate nearest neighbor search algorithm, which is not specifically limited in this specification.
Further, after each node respectively retrieves the data set matched with the query information, each node can further screen out partial data with higher matching degree with the query information from the data contained in the data set, and the partial data can be called as reasoning data and can be subsequently used for assisting the LLM model to execute the reasoning task.
In an illustrative embodiment, a threshold of matching for further screening of inferred data in a plurality of data sets retrieved from a plurality of nodes may be determined based on pre-configured screening rules.
In an embodiment, the filtering rule may include a preset filtering number, and accordingly, when determining the matching degree threshold, the number of data, which is included in the plurality of data sets retrieved by the plurality of nodes and has a matching degree with the query information, is required to be higher than the matching degree threshold and is greater than the filtering number.
In the following, an implementation of determining the matching degree threshold will be explained taking an example that the screening rule includes a screening number.
First, each node may first select at least one target data from the respective retrieved data sets. In an embodiment, the target data may be data randomly selected by the node from all data included in the data set, or the target data may be data sequentially selected by the node from all data included in the data set according to a preset step length, or the like, which is not specifically limited in this specification.
Step S202, determining the matching degree between a plurality of target data selected by the plurality of nodes and the query information, and determining a matching degree threshold value from a plurality of matching degrees corresponding to the plurality of target data, wherein the matching degree between the plurality of data sets retrieved by the plurality of nodes and the query information is higher than the matching degree threshold value, and the data quantity is larger than or equal to a preset screening quantity.
Further, the matching degree between the plurality of target data selected by the plurality of nodes and the query information can be determined. Further, the matching degree threshold may be determined from a plurality of matching degrees corresponding to a plurality of target data according to a preset screening number. The data quantity, which is contained in the plurality of data sets retrieved by the plurality of nodes and has the matching degree with the query information higher than the matching degree threshold value, is larger than or equal to the preset screening quantity.
Specifically, when determining a matching degree threshold from a plurality of matching degrees corresponding to a plurality of target data, the method specifically comprises the steps of firstly sorting the plurality of matching degrees corresponding to the plurality of target data from high to low, traversing from the first matching degree based on the sorted sequence, counting the number of data, which are contained in a plurality of data sets and are higher than the target matching degree, of the matching degree, of query information, aiming at the target matching degree which is traversed currently, determining whether the counted number of data is larger than or equal to the preset screening number, if so, determining the target matching degree as the matching degree threshold, ending traversing, if not, continuing traversing to the next matching degree, and the like.
In an embodiment, if the number of data, which is included in the plurality of data sets and has a matching degree with the query information higher than the last matching degree and smaller than the preset screening number, is counted when the last matching degree is traversed, the plurality of nodes may further select new target data from the data sets retrieved respectively. Wherein, the new matching degree corresponding to the new target data selected by the plurality of nodes is lower than the last matching degree. Correspondingly, a plurality of new target data further selected by the plurality of nodes can be obtained, the traversal process is executed for a plurality of new matching degrees corresponding to the plurality of new target data, and the like until the number of data, which are contained in the plurality of data sets and have matching degrees with query information higher than the new matching degrees, is counted, and the new matching degrees are determined to be a matching degree threshold value if the number of data, which are larger than or equal to a preset screening number.
Or in an embodiment, if the number of data, which is included in the plurality of data sets and has a matching degree with the query information higher than the last matching degree and smaller than the preset screening number, is counted when the last matching degree is traversed, the matching degree threshold value may be directly determined as minus infinity. When the matching degree threshold is minus infinity, the method is equivalent to the fact that data screening is not performed subsequently, and all data contained in the data set retrieved by each node meet the matching degree requirement, which is not particularly limited in the specification.
In an illustrated embodiment, the plurality of nodes includes a master node and a slave node, and the master node may select target data from a data set retrieved by itself and determine a degree of matching between the target data and the query information. In addition, each slave node may send the matching degree to the master node after selecting the target data from the respective data set and determining the matching degree between the target data and the query information. Correspondingly, the master node can receive the matching degrees sent by all the slave nodes, and then determines a matching degree threshold value from the acquired matching degrees according to the steps.
Step S203, obtaining the inference data that the matching degree between the plurality of nodes and the query information screened from the respective data sets is higher than the matching degree threshold, and summarizing the obtained inference data screened by the plurality of nodes into an inference data set.
Further, after determining the matching degree threshold, each node may screen the inference data with the matching degree higher than the matching degree threshold from the respective retrieved data set. For example, each node may delete data in the dataset having a matching degree below the matching degree threshold, thereby obtaining a new dataset, and accordingly, the new dataset contains only inferred data having a matching degree above the matching degree threshold.
In an embodiment shown, the master node may send the determined matching degree threshold to the slave node, and correspondingly, the slave node may receive the matching degree threshold, screen out the reasoning data with the matching degree higher than the matching degree threshold from the data set retrieved by the slave node, and send the screened reasoning data (i.e. the new data set) to the master node. Accordingly, the master node may obtain all of the inferred data sent by the slave nodes. And the main node can screen out the reasoning data with the matching degree higher than the matching degree threshold value from the data set retrieved by the main node.
Further, the master node can combine and summarize all the obtained reasoning data to obtain a reasoning data set.
Step S204, based on the reasoning data set, the LLM model is assisted to execute the reasoning task, and a query result corresponding to the query information is generated.
Further, based on the collected reasoning data set, the LLM model can be assisted to execute the reasoning task, and a query result corresponding to the query information input by the user can be generated. In an embodiment, a pre-trained LLM model may be carried in the master node, and after the master node gathers the inference data sets, the master node may assist the LLM model to perform an inference task based on the inference data sets, and generate a query result corresponding to query information input by a user.
It should be noted that, the specific implementation manner of the LLM model for assisting in performing the reasoning task based on the reasoning data set is not particularly limited in this specification.
In an illustrated embodiment, a prompt term (prompt) may be constructed based on query information input by a user and the inference data set, and the prompt term is input into a pre-trained LLM model, and the LLM model performs a corresponding inference task based on the prompt term to generate a query result corresponding to the query information.
In an embodiment, a preset number of candidate data, such as top-k candidate data, with the highest matching degree with the query information input by the user may be further selected from the multiple pieces of reasoning data contained in the reasoning data set. Further, a prompt word may be constructed based on the query information and top-k candidate data, and the prompt word is input into a pre-trained LLM model, and the LLM model performs a corresponding reasoning task based on the prompt word, generates a query result corresponding to the query information, and the like, which is not specifically limited in this specification.
In an embodiment, the preset number of filtering may be greater than or equal to the preset number of candidate data, i.e., the number of filtering may be greater than or equal to k, i.e., the number of data in the inference data set is guaranteed to be at least the number of data in the final LLM model.
The distributed system-based data reasoning method provided in this specification will be described below by way of several examples in connection with the method flow shown in fig. 2. In particular, the data reasoning method may comprise the following steps.
And step 1, data slicing.
A large-scale dataset is partitioned into a plurality of data slices. For example, a large-scale data set (denoted as X) may be uniformly divided into P data slices, i.e., x= { X1, X2,..once, XP }, where a P-th data slice XP may be allocated to a node P for processing, and a process P running in the node P may perform data retrieval for the data slice XP. Wherein P is an integer greater than 1, P is an integer greater than or equal to 1, and less than or equal to P.
Step2, index construction
Each node builds a corresponding ANNS index structure for each stored data slice. Illustratively, a process p running in node p may build ANNS an index structure for a data slice Xp.
Step 3, data searching
And each node performs data search in each stored data fragment based on the built ANNS index structure to obtain top-k nearest neighbor data matched with the query information input by the user. For example, the process p in the node p may convert the obtained query information into the d-dimensional query vector q, and perform data search in the data slice Xp based on the constructed index structure, to obtain top-k nearest neighbor data closest to (i.e., the matching degree is highest, and the following steps all use the distance to represent the matching degree). For example, the process p may sort the top-k nearest neighbor data searched from small to large according to the distance between the nearest neighbor data and the query vector q, so as to obtain an array Ap. That is, the array Ap contains k elements, each element is nearest-neighbor data, and the first element (e.g., denoted as A1) is nearest-neighbor data with the smallest distance from the query vector q among the k nearest-neighbor data.
Step 4, data sampling
Each node randomly selects a plurality of target nearest neighbor data from the top-k nearest neighbor data searched by the node, and obtains the matching degree between the plurality of target nearest neighbor data and the query information. Illustratively, the process p in node p randomly selects s target nearest neighbor data from the k nearest neighbor data contained in its array Ap (s may be much smaller than k, e.g., k is equal to 20, s is equal to 2, and k is equal to 100, s is equal to 2). Then, the process p obtains distances between the s target nearest neighbor data and the query vector q respectively, and obtains s distances corresponding to the s target nearest neighbor data. Then, the s distances are sorted from small to large to obtain an array Sp. That is, the array Sp includes s elements, each element is a distance corresponding to the nearest neighbor data of the target, wherein the first element (e.g. Sp 1) is the smallest distance among the s distances.
The processes on all nodes may then perform a distributed merge operation on all arrays Sp (i.e., P arrays Sp) to form a new array S containing |s| elements that are also sorted by distance from small to large. If there is no repetition value in the array Sp of each process, the array S contains p×s elements in total, i.e., |s|=p×s.
Step 5, data sub-barrels
It will be understood that the S elements (i.e., S distances) included in each array Sp may divide the array Ap into s+1 distance intervals, and correspondingly, the i S elements included in the combined array S may divide the array { Ap } into i s+1 distance intervals. Where the array { Ap } represents the sum of all arrays Ap in all processes, including p×k elements.
Assuming that the array Sp contains 2 elements, respectively Sp 1=10 and Sp 2=20, the s+1 distance intervals may include 3 distance intervals of (- ≡10], (10, 20], (20, +) in order.
Assuming that P is equal to 2, array S contains 4 elements, S1 = distance 3, S2 = distance 10, S3 = distance 12, S4 = distance 20, and then |s| +1 distance bins may include in order (- ≡3], (3, 10], (10, 12], (12, 20], (20, +) these 5 distance bins.
Based on this, the process p can determine which distance interval of the k nearest neighbor data in the array Ap and the query vector q respectively belongs to the distance intervals of the i s+1 distance intervals, and obtain the array Cp through statistics. The array Cp includes |S|+ 1 elements corresponding to |S|+ 1 distance intervals, and the element Cp [ i ] in the array Cp is used for recording the number of nearest neighbor data belonging to the ith distance interval included in the array Ap. Wherein i is an integer greater than or equal to 1 and less than or equal to |s|+1.
The process on all nodes may then perform a distributed accumulation for all arrays Cp (i.e. P arrays Cp), i.e. add the elements of the corresponding positions in all arrays Cp, resulting in a new array C. The new array C also includes |S|+ 1 elements corresponding to |S|+ 1 distance intervals, and the element C [ j ] in the new array C is used to record the number of nearest neighbor data belonging to the j-th distance interval contained in the array { Ap }. Wherein j is an integer greater than or equal to 1 and less than or equal to |s|+1.
For example, still assuming P is equal to 2, the array S contains 4 elements, S1 = distance 3, S2 = distance 10, S3 = distance 12, S4 = distance 20, |S|+1 distance bins may include in order (- +, 3], (3, 10], (10, 12], (12, 20], (20, ++), etc.), in addition, assuming k is equal to 20, i.e., { Ap }, containing 40 nearest neighbor data in total, the final statistically derived array C may be [2,8,12,15,3] based on the distance between the 40 nearest neighbor data and the query vector q, i.e., of the 40 nearest neighbor data, the number of nearest neighbor data in the interval (- ≡3) with the query vector q, the number of nearest neighbor data in the interval (3, 10) with the query vector q, the number of nearest neighbor data in the interval (10, 12) with the query vector q, the number of nearest neighbor data in the interval (12, 20) with the query vector q, and the number of nearest neighbor data in the interval (20, fact) with the query vector q, are equal to 3.
Step 6, threshold calculation
According to array C, elements C1 therein are accumulated until the accumulated value is not less than a preset number (e.g., k) for the first time. Assuming that this condition is reached at C [ j ], the distance threshold (corresponding to the matching degree threshold) is determined to be S [ j ].
Taking the example of step 5 above as an example, the preset number k=20, when added to C3, 2+8+12=22, the added value is greater than 20, i.e. the number of nearest neighbor data contained in the array { Ap } is less than or equal to S3=12 is greater than 20, and S3=12 may be determined as the distance threshold.
Illustratively, assuming the final statistically derived array C is [1,2,1,2,34], since the values of C1, C2, C3, and C4 are all very small, the accumulation value is not greater than 20 until 1+2+1+2+34=40 when added to C5.
In this case, the distance threshold may be determined to be positive and infinite (i.e., the matching degree threshold is negative and infinite), which is equivalent to that no data screening is performed subsequently, and k nearest neighbor data obtained by querying each node all meet the requirements.
Alternatively, in this case, each node may select 2 new target nearest neighbor data from the array Ap after S [4] =distance 20, and combine them to obtain a new array S'. For example, the new array S ' may also include 4 elements, e.g., S ' [1] =distance 21, S ' [2] =distance 25, S ' [3] =distance 30, and S ' [4] =distance 35, respectively. Accordingly, the |S+1 distance bins may include, in order, (- ++21 ], (21, 25], (25, 30], (30, 35], (35, ++) these 5 distance bins.) accordingly, assuming the final statistics result in a new array C 'of [1,24,8,2,5], based on the new array C', when the element C '1 is added to C' 2, 1+24=25, and the added value is greater than 20, that is, the number of nearest neighbor data whose distance is less than or equal to S '2=25 contained in the array { Ap } is greater than 20, S' 2=25 can be determined as the distance threshold.
For example, assuming that the new array C 'obtained by the final statistics is [8,2,1,4,25], since the values of C' [1], C '[2], C' [3] and C '[4] in the new array C' are also smaller until C '[5] is added, the accumulated value is greater than 20, and each node may select 2 new target nearest neighbor data from the array Ap after S' [4] = distance 35, and then combine to obtain a new array S ", where the new array S" may include, for example, S "[1] = distance 37, S" [2] = distance 45, S "[3] = distance 50, S" [4] = distance 55, and so on, which is not specifically limited in the present specification.
Step 7, data screening
The process p deletes nearest neighbor data with a distance smaller than the distance threshold value from the array Ap to generate a new array Ap'.
Step 8, data merging
And combining the new arrays Ap 'obtained by all the processes through distributed combination operation to obtain an array { Ap' }. The final top-k candidate nearest neighbor data may then be further selected from the array { Ap', }, for example, where the top-k candidate nearest neighbor data is closest to. Then, constructing a prompt word based on the query vector q and top-k candidate nearest neighbor data, inputting the prompt word into a pre-trained LLM model, and executing a corresponding reasoning task by the LLM model based on the prompt word to generate a query result corresponding to the query information.
In summary, the nodes of the distributed system may retrieve data sets matching the query information input by the user from the respective stored data fragments, and may further select target data from the respective retrieved data sets. Then, the matching degree between a plurality of target data selected by a plurality of nodes and the query information is determined, and a matching degree threshold value for further screening partial data from each data set is determined from a plurality of matching degrees corresponding to the plurality of target data. Based on the data, each node can further screen out a small amount of data with the matching degree higher than the matching degree threshold value from a large amount of data contained in the data set retrieved by each node, only the small amount of data with the matching degree higher than the matching degree threshold value screened by each node is needed to be acquired and summarized later, and the LLM model is assisted to execute the reasoning task based on the small amount of data. Therefore, on the premise of ensuring the data retrieval efficiency and the LLM model generation quality, unnecessary data transmission and calculation with lower matching degree are avoided, the data transmission amount and calculation amount in the distributed system are greatly reduced, and further the calculation and communication expenditure is reduced.
Corresponding to the implementation of the method flow, the embodiment of the specification also provides a data reasoning device based on the distributed system. Referring to fig. 3, fig. 3 is a schematic structural diagram of a data inference apparatus based on a distributed system according to an exemplary embodiment. The apparatus 30 may be applied to the distributed system shown in fig. 1, which includes a plurality of nodes each storing a different data slice. As shown in fig. 3, the apparatus 30 includes:
An obtaining unit 301, configured to obtain target data selected by the plurality of nodes from each of the data sets retrieved by the plurality of nodes, where the data set retrieved by any node includes a plurality of data pieces stored in the node and matched with query information input by a user;
A determining unit 302, configured to determine a matching degree between a plurality of target data selected by the plurality of nodes and the query information, and determine a matching degree threshold from a plurality of matching degrees corresponding to the plurality of target data, where a number of data, which are contained in a plurality of data sets retrieved by the plurality of nodes and have matching degrees with the query information higher than the matching degree threshold, is greater than or equal to a preset screening number;
A data summarizing unit 303, configured to obtain inference data, which is screened by the plurality of nodes from respective data sets, and has a matching degree with the query information higher than the matching degree threshold, and summarize the obtained inference data screened by the plurality of nodes into an inference data set;
And the model reasoning unit 304 is used for assisting the LLM model to execute a reasoning task based on the reasoning data set and generating a query result corresponding to the query information.
In an illustrated embodiment, the data set retrieved by any node runs a near-nearest-neighbor search algorithm ANNS for the node, and the nearest-neighbor data set that matches the query information is retrieved from the data fragments stored by the node, where the nearest-neighbor data set includes a preset number of nearest-neighbor data with the highest matching degree with the query information in the data fragments stored by the node.
In an embodiment, the target data is selected randomly from all data contained in the data set, or the target data is selected sequentially from all data contained in the data set according to a preset step length.
In an illustrated embodiment, the determining unit 302 is specifically configured to:
sorting a plurality of matching degrees corresponding to the plurality of target data from high to low, and traversing from the first matching degree based on the sorted order;
counting the number of data which are contained in the plurality of data sets and have a matching degree with the query information higher than the target matching degree aiming at the target matching degree traversed currently, and
And determining whether the counted data quantity is larger than or equal to the screening quantity, if so, determining the target matching degree as the matching degree threshold value, ending the traversal, and if not, continuing the traversal to the next matching degree.
In an illustrated embodiment, the determining unit 302 is specifically configured to:
If the data quantity which is higher than the last matching degree and is contained in the plurality of data sets and is smaller than the screening quantity is counted when the last matching degree is traversed, further acquiring new target data which are selected by the plurality of nodes from the data sets which are searched respectively, wherein the new matching degree corresponding to the new target data which are selected by the plurality of nodes is lower than the last matching degree;
And the like, determining the new matching degree as a matching degree threshold value until the data quantity, which is contained in the plurality of data sets and has the matching degree with the query information higher than the new matching degree, is counted and is larger than or equal to the screening quantity.
In an illustrated embodiment, the determining unit 302 is specifically configured to:
and if the number of the data, which is included in the plurality of data sets and is higher than the last matching degree and smaller than the screening number, of the matching degree with the query information is counted when the last matching degree is traversed, determining a matching degree threshold as minus infinity.
In an embodiment shown, the plurality of nodes include a master node and a slave node, and the data summarizing unit 303 is specifically configured to:
The master node sends the determined matching degree threshold value to the slave node so that the slave node screens out the reasoning data with the matching degree higher than the matching degree threshold value from the searched data set and sends the reasoning data to the master node, and
And the master node screens out the reasoning data with the matching degree higher than the matching degree threshold value from the data set retrieved by the master node.
In an illustrated embodiment, the model inference unit 304 is specifically configured to:
selecting a preset number of candidate data with highest matching degree with the query information from a plurality of inference data contained in the inference data set;
And constructing a prompt word based on the query information and the candidate data, inputting the prompt word into a LLM model, and executing an reasoning task by the LLM model based on the prompt word to generate a query result corresponding to the query information.
In an embodiment, the matching degree between the query information and each data is represented by the distance between the vector corresponding to the query information and the vector corresponding to each data, wherein the closer the distance between the vectors is, the higher the matching degree is.
The implementation process of the functions and roles of the units in the above device 30 is specifically described in the above embodiments, and will not be described in detail herein. It should be understood that the above-mentioned apparatus 30 may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions into a memory by a processor (CPU) of the device. In addition to the CPU and the memory, the device in which the above apparatus is located generally includes other hardware such as a chip for performing wireless signal transmission and reception, and/or other hardware such as a board for implementing a network communication function.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the units or modules may be selected according to actual needs to achieve the purposes of the present description. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The apparatus, units, modules illustrated in the above embodiments may be implemented in particular by a computer chip or entity or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, vehicle-mounted computer, or a combination of any of these devices.
Corresponding to the method embodiments described above, embodiments of the present specification also provide a computing device. Referring to fig. 4, fig. 4 is a schematic structural diagram of a computing device according to an exemplary embodiment. The computing device shown in fig. 4 may be a computing device in the distributed system shown in fig. 1 described above, which includes a plurality of nodes that each store a different piece of data. As shown in fig. 4, the computing device includes a processor 1001 and memory 1002, and may further include an input device 1004 (e.g., keyboard, etc.) and an output device 1005 (e.g., display, etc.). The processor 1001, memory 1002, input devices 1004, and output devices 1005 may be connected by a bus or other means. As shown in fig. 4, the memory 1002 includes a computer-readable storage medium 1003, which computer-readable storage medium 1003 stores a computer program executable by the processor 1001. The processor 1001 may be a CPU, microprocessor, or integrated circuit for controlling the execution of the above method embodiments. The processor 1001 may execute the steps of the data inference method based on the distributed system in this embodiment when running the stored computer program, where the steps include obtaining target data selected by a plurality of nodes from each of the retrieved data sets, where each of the data sets retrieved by any node includes a plurality of data that matches query information input by a user in a data segment stored by the node, determining a matching degree between the plurality of target data selected by the plurality of nodes and the query information, and determining a matching degree threshold from a plurality of matching degrees corresponding to the plurality of target data, where a number of data sets included in the plurality of data sets retrieved by the plurality of nodes and having a matching degree higher than the matching degree threshold is greater than or equal to a preset screening number, obtaining inferential data selected by the plurality of nodes from each of the data sets and having a matching degree higher than the matching degree threshold, and summarizing the obtained inferential data selected by the plurality of nodes into a data set, assisting LLM in generating a query model corresponding to the query task based on the data set, and so on. For a detailed description of each step of the data reasoning method based on the distributed system, please refer to the previous contents, and the detailed description is omitted here.
Corresponding to the above-described method embodiments, embodiments of the present description also provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the distributed system-based data reasoning method in the embodiments of the present description. Please refer to the description of the above embodiments, and the details are not repeated here.
The foregoing description of the preferred embodiments is provided for the purpose of illustration only, and is not intended to limit the scope of the disclosure, since any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the disclosure are intended to be included within the scope of the disclosure.
In a typical configuration, the terminal device includes one or more CPUs, input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data.
Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, embodiments of the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, embodiments of the present description may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.