CN118820407A

CN118820407A - Hybrid retrieval method and device for lifecycle stream data based on large language model

Info

Publication number: CN118820407A
Application number: CN202411305238.5A
Authority: CN
Inventors: 徐�明; 李楠; 郭静; 周逸航; 齐剑川
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2024-09-19
Filing date: 2024-09-19
Publication date: 2024-10-22
Anticipated expiration: 2044-09-19
Also published as: CN118820407B

Abstract

The invention relates to the technical field of electric digital data processing and data stream retrieval, in particular to a life cycle stream data mixed retrieval method and device based on a large language model, wherein the method comprises the following steps: converting the query information into semantic retrieval English information, full text retrieval Chinese information and full text retrieval English information by using a large language model; carrying out semantic retrieval and full text retrieval in a preset database by utilizing the semantic retrieval character string, the full text retrieval character string and the filtering condition; obtaining a preliminary ranking score according to the weighted sum of the semantic retrieval score and the full text retrieval score; calculating a comprehensive ranking index according to consistency indexes of semantic retrieval and full text retrieval and ranking confidence coefficient attenuation factors; the preliminary ranking results are reordered based on the deep learning model to generate a final result list containing unique identifiers, rankings, and content. Therefore, the problems that the related technology database has incomplete data, non-uniform standard and non-unique stream names, and is difficult to return accurate streams are solved.

Description

Life cycle stream data mixed retrieval method and device based on large language model

Technical Field

The invention relates to the technical field of electric digital data processing and data stream retrieval, in particular to a life cycle stream data mixed retrieval method and device based on a large language model.

Background

With the advent of the large data age, various types of data are generated and accumulated at extremely high speeds, including large amounts of streaming data. In a life cycle analysis flow database, accurate searching of flow information according to requirements is an important link of life cycle analysis, and how to effectively search, analyze and utilize the flow data, especially in combination with a large language model for mixed searching, becomes a hot spot and a difficulty of current research.

In the related technology, the capability of introducing a large language model can analyze the search intention of a user, output unified search key information and realize the efficient and accurate search of the text-multi-mode data by combining the structural search and the vector search.

However, in the related art, the existing databases including GaBi, encoinvent, ELCD and the like generally have the problems of incomplete data, non-uniform standards, non-unique stream names and the like, so that the conventional full text search and semantic search are difficult to return accurate streams, and improvement is needed.

Disclosure of Invention

The invention provides a life cycle stream data mixed retrieval method and device based on a large language model, which are used for solving the problems that databases in related technologies are incomplete in data, nonuniform in standard, non-unique in stream name, difficult to return accurate streams in traditional full text retrieval and semantic retrieval and the like.

An embodiment of a first aspect of the present invention provides a lifecycle stream data hybrid retrieval method based on a large language model, including the steps of: receiving search requests containing stream query information and filtering conditions input by a user; responding to the search request, converting the query information into semantic search English information, full text search Chinese information and full text search English information by using a large language model, and respectively converting the semantic search English information, the full text search Chinese information and the full text search English information into a semantic search character string and a full text search character string; carrying out semantic retrieval and full text retrieval in a preset database by utilizing the semantic retrieval character string, the full text retrieval character string and the filtering condition to respectively obtain a semantic retrieval result list and a full text retrieval result list which are ordered according to the similarity; combining the semantic search result list and the full text search result list, and calculating corresponding semantic search scores and full text search scores according to each result of the combined result list so as to obtain a preliminary ranking score according to the weighted sum of the semantic search scores and the full text search scores; based on the preliminary ranking score, calculating a comprehensive ranking index according to consistency indexes of semantic retrieval and full-text retrieval and ranking confidence decay factors, and based on the comprehensive ranking index, sorting the combined result list in a descending order, returning to a comprehensive sorting result list, and reserving results meeting preset number conditions to generate a preliminary sorting result; and reordering the preliminary ranking results based on a deep learning model to generate a final result list containing unique identifiers, ranks and content.

Optionally, in an embodiment of the present invention, the converting into the semantic search string and the full text search string respectively includes: and converting the full text search Chinese information and the full text search English information into character string types, and connecting the character strings by using preset character strings to obtain the full text search character strings.

Optionally, in an embodiment of the present invention, the converting into the semantic search string and the full text search string respectively further includes: converting the semantic search English information into an embedded vector by using a preset model; and converting the embedded vector into the semantic retrieval character string.

Optionally, in one embodiment of the present invention, a calculation formula of a similarity score between a sentence and a document queried in the full text search result list is:

，

wherein, For statementsComprises the firstThe number of the keywords to be used for the production of the key words,In the case of a document being a document,For the number of documents to be the number,For the reverse document frequency,For the word frequency of the document,For each result in the merged results list,Word frequency for query sentence;

Vectors in the semantic search result list Sum vectorThe calculation formula of the similarity degree is as follows:

，

wherein, For each result in the merged results list,For the number of documents to be mentioned,Respectively represent the vectorsAnd the vectorIs the first of (2)A component.

Optionally, in one embodiment of the present invention, the calculation formula of the semantic search score and the full text search score is:

，

wherein, For each result in the merged results list,In order to avoid calculating a constant with an excessively large value,AndRanking in full text retrieval and semantic retrieval for each result in the merged results list, respectively.

Optionally, in one embodiment of the present invention, a calculation formula of the initial score in the combined result list is:

，

wherein, AndThe weights for full text retrieval and semantic retrieval respectively,For the full-text search of the score,For the semantic search score to be used,For each result in the merged results list,Representing a full-text search,Representing semantic retrieval;

the calculation formula of the consistency index of the semantic retrieval and the full text retrieval is as follows:

，

wherein, In order to retrieve the total number of results,AndRanking each result in the combined result list in full text retrieval and semantic retrieval respectively;

the calculation formula of the consistency score of the semantic retrieval and the full text retrieval is as follows:

，

wherein, Is a controlAn adjustment parameter for the degree of influence on the final score,For the initial score to be a function of the initial score,Consistency indexes of the semantic retrieval and the full text retrieval are obtained;

The calculation formula of the ranking confidence coefficient attenuation factor is as follows:

，

wherein, In order for the decay rate to be a decay rate,For the initial score;

The calculation formula of the comprehensive ranking index is as follows:

，

wherein, For consistency scores of the semantic and full text searches,And (5) a confidence attenuation factor for the ranking.

An embodiment of a second aspect of the present invention provides a lifecycle stream data hybrid retrieval apparatus based on a large language model, including: the receiving module is used for receiving search requests containing stream query information and filtering conditions input by a user; the conversion module is used for responding to the search request, converting the query information into semantic search English information, full text search Chinese information and full text search English information by using a large language model, and respectively converting the semantic search English information, the full text search Chinese information and the full text search English information into a semantic search character string and a full text search character string; the acquisition module is used for carrying out semantic retrieval and full text retrieval in a preset database by utilizing the semantic retrieval character string, the full text retrieval character string and the filtering condition to respectively obtain a semantic retrieval result list and a full text retrieval result list which are ordered according to the similarity; the calculation module is used for merging the semantic search result list and the full text search result list, and calculating corresponding semantic search scores and full text search scores according to each result of the merged result list so as to obtain a preliminary ranking score according to the weighted sum of the semantic search scores and the full text search scores; the descending order sorting module is used for calculating a comprehensive ranking index according to consistency indexes of semantic retrieval and full text retrieval and ranking confidence attenuation factors based on the preliminary ranking score, descending order sorting is carried out on the combined result list based on the comprehensive ranking index, the comprehensive sorting result list is returned, and the results meeting the preset number conditions are reserved to generate a preliminary sorting result; and the reordering module is used for reordering the preliminary ordering results based on the deep learning model and generating a final result list containing unique identifiers, ranks and contents.

Optionally, in one embodiment of the present invention, the conversion module includes: the first conversion unit is used for converting the full text search Chinese information and the full text search English information into character string types and connecting the character string types by using a preset character string to obtain the full text search character string.

Optionally, in one embodiment of the present invention, the conversion module further includes: the second conversion unit is used for converting the semantic search English information into an embedded vector by using a preset model; and the conversion unit is used for converting the embedded vector into the semantic search character string.

，

wherein, In order for the decay rate to be a decay rate,For the initial score;

The calculation formula of the comprehensive ranking index is as follows:

，

An embodiment of a third aspect of the present invention provides an electronic device, including: the system comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the program to realize the lifecycle stream data mixed retrieval method based on the large language model.

A fourth aspect of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the above large language model-based lifecycle stream data hybrid retrieval method.

A fifth aspect of the present invention provides a computer program product storing a computer program which, when executed by a processor, implements a large language model based lifecycle stream data hybrid retrieval method as above.

The embodiment of the invention can summarize and extract the key information of the user input content by utilizing the large language model, avoid the influence of invalid information on semantic retrieval, comprehensively sort the full text retrieval and the semantic retrieval result by using a reciprocal sorting fusion algorithm, realize that all the streams meeting the requirements are accurately retrieved in a database according to the related information of the streams input by the user, improve the quality of the query content, improve the accuracy of the retrieval result and support multi-language query. Therefore, the problems that the database in the related technology is incomplete in data, nonuniform in standard and non-unique in stream name, and the traditional full text retrieval and semantic retrieval are difficult to return to accurate streams are solved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flowchart of a lifecycle stream data hybrid retrieval method based on a large language model, according to an embodiment of the present invention;

FIG. 2 is a flow chart of a large language model based lifecycle stream data hybrid retrieval method, according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a life cycle stream data hybrid retrieval device based on a large language model according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

The following describes a lifecycle stream data mixed retrieval method and apparatus based on a large language model according to an embodiment of the present invention with reference to the accompanying drawings. Aiming at the problems that the databases in the related art mentioned in the background art are incomplete in data, nonuniform in standard and non-unique in stream name, and the conventional full text retrieval and semantic retrieval are difficult to return accurate streams, the invention provides a life cycle stream data mixed retrieval method based on a large language model, in the method, the method can utilize a large language model to summarize and extract key information of user input content, avoid influence of invalid information on semantic retrieval, comprehensively sort full text retrieval and semantic retrieval results by using a reciprocal sorting fusion algorithm, accurately retrieve all streams meeting requirements in a database according to related information of streams input by a user, improve query content quality, improve retrieval result accuracy and support multilingual query. Therefore, the problems that the database in the related technology is incomplete in data, nonuniform in standard and non-unique in stream name, and the traditional full text retrieval and semantic retrieval are difficult to return to accurate streams are solved.

Specifically, fig. 1 is a flow chart of a mixed retrieval method of life cycle stream data based on a large language model according to an embodiment of the present invention.

As shown in fig. 1, the lifecycle stream data hybrid search method based on the large language model includes the following steps:

in step S101, a search request including stream query information and filter conditions, which are input by a user, is received.

It is understood that in the embodiment of the present invention, the query information of the stream includes, but is not limited to: the name of the flow, the chemical and physical properties of the flow, the molecular formula of the flow, the environmental impact of the flow, etc.

In the actual implementation process, the user in the embodiment of the invention can input the 'query information' of the input stream according to the self requirement, such as the Chinese and English names of the input stream, the chemical and physical properties of the stream, the molecular formula of the stream, the environmental influence of the stream and the like. Further, the user in the embodiment of the invention can also select whether to input the 'filtering condition' of the flow, and if not, the filtering condition defaults to null. According to the embodiment of the invention, the quality of the query content is improved by receiving the search request which is input by the user and contains the query information and the filtering condition, so that the key information of the user input content can be summarized and extracted by using a large language model, the influence of invalid information on semantic search is avoided, synonyms are expanded according to the names of the streams, and the accurate and comprehensive full-text query content can be generated.

For example, according to the needs of the user, the English name of the input stream is: hexafluorotopane, chinese name: hexafluoropropylene, the chemical physical properties are: halogenated olefins having the formula: the influence of C3H2F6 on the environment is as follows: may have specific toxic and environmental effects.

In step S102, in response to the search request, the query information is converted into semantic search english information, full-text search chinese information, and full-text search english information using the large language model, and is converted into a semantic search string and a full-text search string, respectively.

It can be understood that the large language model in the embodiment of the invention has strong natural language processing capability and wide application prospect.

As one possible implementation manner, the embodiment of the invention can respond to the search request, preprocess the query information by using a large language model, summarize the content of the condensed query information, and convert the query information into three parts of semantic search English information, full text search Chinese information and full text search English information. The semantic search information comprises English keywords for semantic search, the full-text search information comprises original names and synonyms of streams for full-text search, and the original names and synonyms are converted into semantic search character strings and full-text search character strings respectively, so that multi-language query is supported.

The embodiment of the invention can convert the query information input by the user into the Chinese and English query character strings by utilizing the large language model, solves the problem of retrieval difficulty caused by inconsistent database languages, and can also automatically expand other languages to realize cross-language stream information retrieval.

Optionally, in one embodiment of the present invention, the conversion into the semantic search string and the full text search string respectively includes: and converting the full-text retrieval Chinese information and the full-text retrieval English information into character string types, and connecting the character strings by using preset character strings to obtain full-text retrieval character strings.

It can be understood that the preset string connection in the embodiment of the present invention may be a string "OR" connection.

Specifically, the embodiment of the invention can convert the full text retrieval Chinese information and the full text retrieval English information into character string types, and connect the character strings with the character string OR to obtain the full text retrieval character stringTherefore, the search strings of different languages are connected, the user operation is simplified, the search efficiency is improved, support is provided for meeting the requirements of users of different languages, and the accuracy of search results is further improved.

It should be noted that the preset string connection may be set by those skilled in the art according to the actual situation, and is not particularly limited herein.

Optionally, in one embodiment of the present invention, the conversion into the semantic search string and the full text search string respectively includes: converting the semantic search English information into an embedded vector by using a preset model; the embedded vector is converted into a semantic search string.

It can be appreciated that the preset model in the embodiment of the present invention may be an embedded model, for example, openAI Embeddings model.

In the actual execution process, the embodiment of the invention can utilize OpenAI Embeddings model to convert 'semantic search English information' into an embedded vector and then convert the embedded vector into a semantic search character stringThereby improving the accuracy of the search result and supporting the complex query of multiple languages.

It should be noted that the preset model may be set by those skilled in the art according to actual situations, and is not limited herein.

In step S103, semantic search and full text search are performed in a preset database by using the semantic search string, the full text search string and the filtering condition, so as to obtain a semantic search result list and a full text search result list which are ordered according to the similarity, respectively.

It will be appreciated that embodiments of the present invention may use the search string and filtering conditions to develop a hybrid search within a database.

As a possible implementation manner, the embodiment of the invention can perform full text retrieval in the database by using the full text retrieval character string and the filtering condition, return the full text retrieval result list ordered according to the similarity, perform semantic retrieval in the database by using the semantic retrieval character string and the filtering condition, and return the semantic retrieval result list ordered according to the similarity, thereby obtaining the semantic retrieval result list and the full text retrieval result list ordered according to the similarity.

In one embodiment of the present invention, the calculation formula of the similarity score between the sentence and the document of the query in the full-text search result list is:

，

wherein, For sentences requiring inquiryAnd documentsA similarity score between the two,For statementsComprises the firstThe number of the keywords to be used for the production of the key words,In the case of a document being a document,For the number of documents to be the number,For the reverse document frequency,For the word frequency of the document,To merge each result in the result list,Is the word frequency of the query sentence.

To reverse document frequency, one can measureThe rareness in all documents indirectly reflects how much information it contains, calculated as follows:

，

wherein, Representative comprisesIs a function of the number of documents in the (c),Representing the total number of documents,For statementsComprises the firstA keyword;

for the word frequency of the document, the representative word In a documentThe relative frequencies of (2) are calculated as follows;

，

wherein, Representative ofIn a documentIs used to determine the number of occurrences of the picture,Representative documentIs a function of the number of words of (a),Representing the average article word number; parameters (parameters)For the adjustment factor, a larger value represents a larger influence of the word frequency of the document,Representing the influence of neglecting the word frequency of the document, and is generally set to be 1.2; parameters (parameters)Is a normalization factor that is used to normalize the data,Representing that the normalization is not performed at all,Representing complete normalization, typically 0.75.

Query word frequency, representative wordIn the sentenceThe relative frequencies of (a) are calculated as follows:

，

wherein, Representative ofIn a query statementIs used to determine the number of occurrences of the picture,For the adjustment factor, a larger value represents a larger impact on the query term frequency,Representing the effect of ignoring the frequency of the query word, is typically set to 0.

Further, in the semantic search result list in the embodiment of the invention, vectors are in the semantic search result listSum vectorThe calculation formula of the similarity degree is as follows:

，

wherein, Is vector quantitySum vectorThe similarity between the two is 0,1, the closer to 1 the similarity is higher,For the number of documents to be the number,Respectively represent vectorsSum vectorIs the first of (2)A component.

The embodiment of the invention can utilize semantic search character strings, full text search character strings and filtering conditions to carry out semantic search and full text search in a preset database, can widely capture streams with similar names through full text search, can accurately position streams with similar contents through semantic search, solves the problems of non-unique stream names and incomplete data, and can automatically set weights of two search methods, thereby further improving the accuracy of search results.

It should be noted that the preset database may be set by those skilled in the art according to actual situations, and is not limited herein.

In step S104, the semantic search result list and the full-text search result list are combined, and the corresponding semantic search score and full-text search score are calculated according to each result of the combined result list, so as to obtain a preliminary ranking score according to the weighted sum of the semantic search score and the full-text search score.

It can be appreciated that the embodiment of the invention can comprehensively sort the results returned by the full text search and the semantic search based on a reciprocal sorting fusion (Reciprocal Rank Fusion, RRF) algorithm.

In the actual execution process, the embodiment of the invention can combine the result list returned by the semantic search with the result list returned by the full text search, for example: the full text search returns { result 1, result 2}, the semantic search returns { result 2, result 3}, the combined list is { result 1, result 2, result 3}, and the corresponding semantic search score and full text search score are calculated according to each result of the combined result list, so that the preliminary ranking score is obtained according to the weighted sum of the semantic search score and the full text search score, and the accuracy of the search result can be further improved.

In one embodiment of the present invention, the calculation formulas of the semantic search score and the full text search score are as follows:

，

wherein, To merge each result in the result list,In order to avoid calculating a constant with an excessively large value,AndRanking of each result in the merged results list in the full text search and the semantic search, respectively.

Specifically, for each result in the merge listThe full text search score can be calculated according to the formulaAnd semantic retrieval scores，To avoid calculating a constant with an excessively large value, it can be set by itself,AndRanking of each result in the merged results list in the full text search and the semantic search, respectively. If a result originates from only one search pattern, its score for the other search pattern is 0, for example, if a certain result is retrieved only by the full text retrieval pattern, its semantic retrieval score is 0.

In one embodiment of the present invention, the calculation formula of the initial score in the combined result list is:

，

wherein, AndWeights for full text retrieval and semantic retrieval can be set by themselves,For the full-text search of the score,For the semantic retrieval of the score,To merge each result in the result list,Representing a full-text search,Representing semantic retrieval.

For the resultsCalculating two search method result consistency indexes (Consistency Index, CI):

，

wherein, For the total number of search results, if the document does not appear in the full text search,=Total_results+1; if the document does not appear in the semantic search,= total_results + 1；AndRanking the result in full text retrieval and semantic retrieval respectively;

Will be AndIn combination, the calculation formula for calculating the consistency score considering consistency, namely the consistency score of the semantic search and the full text search is as follows:

，

wherein, Is an adjustment parameter for controllingThe extent of impact on the final score, the range may be set to 0, 1,For the initial score to be a score of,Results of the two search methods are consistent indexes;

Further, considering that the reliability of the search result may decrease with the decrease of the rank, a confidence decay factor (Confidence Decay Factor, CDF) is introduced, and the calculation formula of the rank confidence decay factor is:

，

wherein, In order for the decay rate to be a decay rate,Is an initial score;

further, the calculation formula of the final obtained comprehensive ranking index is as follows:

，

wherein, For consistency scores for semantic and full text searches,Is a rank confidence decay factor.

Specifically, for each result in the merged list, its comprehensive ranking may be calculated according to the formula above, thereby providing support for subsequent generation of a final result list containing unique identifiers, rankings, and content, and further improving the accuracy of the retrieved results.

In step S105, based on the preliminary ranking score, a comprehensive ranking index is calculated according to the consistency index of the semantic search and the full text search and the ranking confidence decay factor, and based on the comprehensive ranking index, the comprehensive ranking result list is ranked in descending order, the comprehensive ranking result list is returned, and the results satisfying the preset number condition are retained, so as to generate a preliminary ranking result.

It can be understood that the results satisfying the preset number of conditions in the embodiment of the present invention may be top-k results.

As a possible implementation manner, the embodiment of the invention can calculate the comprehensive ranking index according to the consistency index and the ranking confidence coefficient attenuation factor of the two retrieval methods of semantic retrieval and full-text retrieval based on the preliminary ranking score, sort the comprehensive result list in a descending order based on the comprehensive ranking index, return the comprehensive ranking result list and the corresponding score, reserve top-k results, provide the preliminary ranking result of the retrieval, and filter out the obviously irrelevant results.

The embodiment of the invention improves the accuracy of the search result and integrates the advantages of two search methods: the full text search can be used for widely capturing streams with similar names, and the semantic search can be used for accurately positioning streams with similar contents, so that the problems of non-unique stream names, incomplete data and the like are solved.

It should be noted that the preset number of conditions may be set by those skilled in the art according to actual situations, and is not particularly limited herein.

In step S106, the preliminary ranking result list is reordered based on the deep learning model, generating a final result list containing unique identifiers, ranks, and content.

In the actual execution process, the embodiment of the invention can reorder the preliminary retrieval result based on the deep learning model, return a final result list containing unique identifiers (uuid), ranks and contents, and further improve the accuracy of the retrieval result by a reordering method based on the deep learning model. The deep learning network structure comprises an input layer, a coding layer, a relevance scoring layer and a sequencing layer. The specific operation method is as follows:

(1) Input layer: based on ordered document list For each documentSum it with query statement、、Form an input pairAnd uses special delimiter connections:

，

wherein, A flag indicating the start of a task,For separators between tasks, text is converted into a token sequence that the model can understand.

(2) Coding layer: for each input pair using a pre-trained transducer modelRespectively carrying out full text and semantic feature coding to obtain vector representation:

，

due to the presence of special separators, a transducer model acts on the input pair Multiple vectors are available above, reserving for coding to represent the entire inputThe corresponding vector, i.e. the first vector, is marked. And carrying out weighted fusion on the full text features and the semantic features:

，

wherein, AndA full text vector representation and a semantic vector representation respectively,AndRespectively the weights of full text retrieval and semantic retrieval.

(3) Correlation scoring layer: using linear layersActivation function(E.g., softmax) to calculate a relevance score for each input pair:

，

wherein, For a weighted fusion of full text features and semantic features,Is a linear layer.

(4) Sequencing layer: sorting according to the similarity score from high to low to obtain a rearranged document list:

，

wherein, In order to order the documents,A relevance score is provided for each input pair. At the same time, the unique identifier (uuid), ranking, and content of the document are returned, resulting in a final result list.

Specifically, with reference to fig. 2, the working principle of the lifecycle stream data hybrid search method based on a large language model in the embodiment of the present invention may be described in detail in a specific embodiment.

As shown in fig. 2, an embodiment of the present invention may include the steps of:

step S201: query content and filtering conditions are input.

According to the query information of the input stream, the user in the embodiment of the invention can select whether the filtering condition of the input stream is selected according to the self requirement.

Step S202: and extracting, embedding and converting the large language model into a query character string.

The embodiment of the invention can convert the query information and the filtering condition into the search character string by using a large language model, namely, the search character string.

Step S203: and (5) performing full text retrieval.

The embodiment of the invention can perform full text retrieval in a database.

Step S204: and carrying out semantic retrieval.

The embodiment of the invention can carry out semantic retrieval in the database.

Step S205: and returning the full text retrieval result.

The embodiment of the invention can return full-text retrieval results ranked according to the similarity.

Step S206: and returning a semantic retrieval result.

The embodiment of the invention can return the semantic retrieval results ranked according to the similarity.

Step S207: and integrating the search result.

The embodiment of the invention can combine the result list returned by the full text search and the semantic search and integrate the search result.

Step S208: a weighted score of the search result is calculated.

The method and the device can calculate the weighted sum of the semantic retrieval score and the full text retrieval score as the preliminary ranking score.

Step S209: and returning a final sorting result list.

The method and the device can reorder the preliminary ranking results based on the deep learning model and return a final result list containing unique identifiers, ranks and contents.

According to the life cycle stream data mixed retrieval method based on the large language model, which is provided by the embodiment of the invention, key information of user input content can be summarized and extracted by using the large language model, the influence of invalid information on semantic retrieval is avoided, and the comprehensive sorting of full-text retrieval and semantic retrieval results is performed by using a reciprocal sorting fusion algorithm, so that all streams meeting requirements are accurately retrieved in a database according to related information of the streams input by a user, the quality of query content is improved, the accuracy of the retrieval results is improved, and multi-language query is supported. Therefore, the problems that the database in the related technology is incomplete in data, nonuniform in standard and non-unique in stream name, and the accurate stream is difficult to return in the traditional full text retrieval and semantic retrieval are solved.

Next, a life cycle stream data mixed retrieval device based on a large language model according to an embodiment of the present invention will be described with reference to the accompanying drawings.

Fig. 3 is a schematic structural diagram of a life cycle stream data mixed retrieval device based on a large language model according to an embodiment of the present invention.

As shown in fig. 3, the large language model-based life cycle stream data hybrid retrieval device 10 includes: the system comprises a receiving module 100, a converting module 200, an acquiring module 300, a calculating module 400, a descending order ordering module 500 and a reordering module 600.

Specifically, the receiving module 100 is configured to receive a search request including query information of a stream and filtering conditions, which are input by a user.

The conversion module 200 is configured to convert query information into semantic search english information, full-text search chinese information, and full-text search english information using a large language model in response to a search request, and to convert the query information into a semantic search string and a full-text search string, respectively.

The obtaining module 300 is configured to perform semantic search and full text search in a preset database by using the semantic search string, the full text search string and the filtering condition, so as to obtain a semantic search result list and a full text search result list that are ordered according to the similarity, respectively.

The calculation module 400 is configured to combine the semantic search result list and the full-text search result list, and calculate a corresponding semantic search score and full-text search score according to each result of the combined result list, so as to obtain a preliminary ranking score according to a weighted sum of the semantic search score and the full-text search score.

The descending order sorting module 500 is configured to calculate a comprehensive ranking index according to the consistency index of the semantic search and the full text search and the ranking confidence decay factor based on the preliminary ranking score, sort the combined result list in descending order based on the comprehensive ranking index, return to the comprehensive sorting result list, and reserve the results satisfying the preset number condition to generate a preliminary sorting result.

The reordering module 600 is configured to reorder the preliminary ranking results based on the deep learning model, and generate a final result list including unique identifiers, ranks, and content.

Optionally, in one embodiment of the present invention, the conversion module 200 includes: a first conversion unit.

The first conversion unit is used for converting the full text search Chinese information and the full text search English information into character string types and connecting the character string types by using a preset character string to obtain a full text search character string.

Optionally, in one embodiment of the present invention, the conversion module 200 further includes: a second conversion unit and a conversion unit.

The second conversion unit is used for converting the semantic search English information into an embedded vector by using a preset model.

And the conversion unit is used for converting the embedded vector into a semantic retrieval character string.

Optionally, in one embodiment of the present invention, a calculation formula of a similarity score between a sentence and a document of a query in the full-text search result list is:

，

wherein, For statementsComprises the firstThe number of the keywords to be used for the production of the key words,In the case of a document being a document,For the number of documents to be the number,For the reverse document frequency,For the word frequency of the document,To merge each result in the result list,Word frequency for query sentence;

vector in semantic search result list Sum vectorThe calculation formula of the similarity degree is as follows:

，

wherein, To merge each result in the result list,For the number of documents to be the number,Respectively represent vectorsSum vectorIs the first of (2)A component.

Optionally, in one embodiment of the present invention, the calculation formulas of the semantic search score and the full text search score are:

，

Optionally, in one embodiment of the present invention, the calculation formula of the initial score in the combined result list is:

，

wherein, AndThe weights for full text retrieval and semantic retrieval respectively,For the full-text search of the score,For the semantic retrieval of the score,To merge each result in the result list,Representing a full-text search,Representing semantic retrieval;

The calculation formula of the consistency index of semantic retrieval and full text retrieval is as follows:

，

The calculation formula of the consistency score of the semantic search and the full text search is as follows:

，

wherein, Is a controlAn adjustment parameter for the degree of influence on the final score,For the initial score to be a score of,Consistency indexes of semantic retrieval and full text retrieval are obtained;

The calculation formula of the ranking confidence decay factor is:

，

wherein, In order for the decay rate to be a decay rate,Is an initial score;

the calculation formula of the comprehensive ranking index is as follows:

，

It should be noted that the explanation of the foregoing embodiment of the method for mixed searching of lifecycle stream data based on a large language model is also applicable to the apparatus for mixed searching of lifecycle stream data based on a large language model of this embodiment, and will not be repeated here.

According to the life cycle stream data mixed retrieval device based on the large language model, which is provided by the embodiment of the invention, key information of user input content can be summarized and extracted by using the large language model, the influence of invalid information on semantic retrieval is avoided, and the comprehensive sorting of full-text retrieval and semantic retrieval results is performed by using a reciprocal sorting fusion algorithm, so that all streams meeting requirements are accurately retrieved in a database according to related information of the streams input by a user, the quality of query content is improved, the accuracy of retrieval results is improved, and multi-language query is supported. Therefore, the problems that the database in the related technology is incomplete in data, nonuniform in standard and non-unique in stream name, and the accurate stream is difficult to return in the traditional full text retrieval and semantic retrieval are solved.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. The electronic device may include:

memory 401, processor 402, and a computer program stored on memory 401 and executable on processor 402.

The processor 402 implements the large language model-based lifecycle stream data hybrid retrieval method provided in the above embodiments when executing a program.

Further, the electronic device further includes:

A communication interface 403 for communication between the memory 401 and the processor 402.

A memory 401 for storing a computer program executable on the processor 402.

Memory 401 may comprise high-speed RAM memory or may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

If the memory 401, the processor 402, and the communication interface 403 are implemented independently, the communication interface 403, the memory 401, and the processor 402 may be connected to each other by a bus and perform communication with each other. The bus may be an industry standard architecture (Industry Standard Architecture, abbreviated ISA) bus, an external device interconnect (PERIPHERAL COMPONENT, abbreviated PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 4, but not only one bus or one type of bus.

Alternatively, in a specific implementation, if the memory 401, the processor 402, and the communication interface 403 are integrated on a chip, the memory 401, the processor 402, and the communication interface 403 may complete communication with each other through internal interfaces.

Processor 402 may be a central processing unit (Central Processing Unit, abbreviated as CPU), or an Application SPECIFIC INTEGRATED Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the invention.

The present embodiment also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the lifecycle stream data hybrid retrieval method based on a large language model as above.

The embodiment of the invention also provides a computer program product, wherein the computer program product is stored with a computer program, and the computer program is executed by a processor to realize the life cycle stream data mixed retrieval method based on the large language model.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or N embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "N" means at least two, for example, two, three, etc., unless specifically defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order from that shown or discussed, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present invention.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or N wires, a portable computer cartridge (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium may even be paper or other suitable medium upon which the program is printed, as the program may be electronically captured, via optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the N steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or part of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, and the program may be stored in a computer readable storage medium, where the program when executed includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented as software functional modules and sold or used as a stand-alone product.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. A life cycle stream data mixed retrieval method based on a large language model is characterized by comprising the following steps:

receiving search requests containing stream query information and filtering conditions input by a user;

Responding to the search request, converting the query information into semantic search English information, full text search Chinese information and full text search English information by using a large language model, and respectively converting the semantic search English information, the full text search Chinese information and the full text search English information into a semantic search character string and a full text search character string;

Carrying out semantic retrieval and full text retrieval in a preset database by utilizing the semantic retrieval character string, the full text retrieval character string and the filtering condition to respectively obtain a semantic retrieval result list and a full text retrieval result list which are ordered according to the similarity;

Combining the semantic search result list and the full text search result list, and calculating corresponding semantic search scores and full text search scores according to each result of the combined result list so as to obtain a preliminary ranking score according to the weighted sum of the semantic search scores and the full text search scores;

Based on the preliminary ranking score, calculating a comprehensive ranking index according to consistency indexes of semantic retrieval and full-text retrieval and ranking confidence decay factors, and based on the comprehensive ranking index, sorting the combined result list in a descending order, returning to a comprehensive sorting result list, and reserving results meeting preset number conditions to generate a preliminary sorting result;

and reordering the preliminary ranking results based on a deep learning model to generate a final result list containing unique identifiers, ranks and content.

2. The large language model based life cycle stream data mixed search method according to claim 1, wherein the converting into semantic search strings and full text search strings, respectively, comprises:

And converting the full text search Chinese information and the full text search English information into character string types, and connecting the character strings by using preset character strings to obtain the full text search character strings.

3. The large language model based life cycle stream data mixed search method according to claim 1, wherein the converting into semantic search strings and full text search strings, respectively, further comprises:

converting the semantic search English information into an embedded vector by using a preset model;

and converting the embedded vector into the semantic retrieval character string.

4. The large language model-based life cycle stream data mixed retrieval method according to claim 1, wherein the calculation formula of the similarity score between the sentence and the document queried in the full text retrieval result list is:

，

5. The large language model based lifecycle stream data hybrid retrieval method according to claim 4, wherein the semantic retrieval score and the full text retrieval score are calculated as:

，

6. The mixed retrieval method of life cycle stream data based on a large language model according to claim 5, wherein the calculation formula of the initial score in the combined result list is:

，

wherein, In order for the decay rate to be a decay rate,For the initial score;

The calculation formula of the comprehensive ranking index is as follows:

，

7. A lifecycle stream data hybrid retrieval apparatus based on a large language model, comprising:

The receiving module is used for receiving search requests containing stream query information and filtering conditions input by a user;

The conversion module is used for responding to the search request, converting the query information into semantic search English information, full text search Chinese information and full text search English information by using a large language model, and respectively converting the semantic search English information, the full text search Chinese information and the full text search English information into a semantic search character string and a full text search character string;

the acquisition module is used for carrying out semantic retrieval and full text retrieval in a preset database by utilizing the semantic retrieval character string, the full text retrieval character string and the filtering condition to respectively obtain a semantic retrieval result list and a full text retrieval result list which are ordered according to the similarity;

The calculation module is used for merging the semantic search result list and the full text search result list, and calculating corresponding semantic search scores and full text search scores according to each result of the merged result list so as to obtain a preliminary ranking score according to the weighted sum of the semantic search scores and the full text search scores;

The descending order sorting module is used for calculating a comprehensive ranking index according to consistency indexes of semantic retrieval and full text retrieval and ranking confidence attenuation factors based on the preliminary ranking score, descending order sorting is carried out on the combined result list based on the comprehensive ranking index, the comprehensive sorting result list is returned, and the results meeting the preset number conditions are reserved to generate a preliminary sorting result;

And the reordering module is used for reordering the preliminary ordering results based on the deep learning model and generating a final result list containing unique identifiers, ranks and contents.

8. The large language model based lifecycle stream data hybrid retrieval apparatus as described in claim 7, wherein the conversion module comprises:

the first conversion unit is used for converting the full text search Chinese information and the full text search English information into character string types and connecting the character string types by using a preset character string to obtain the full text search character string.

9. The large language model based lifecycle stream data hybrid retrieval apparatus as described in claim 7, wherein the conversion module further comprises:

the second conversion unit is used for converting the semantic search English information into an embedded vector by using a preset model;

and the conversion unit is used for converting the embedded vector into the semantic search character string.

10. The large language model based life cycle stream data mixed retrieval device according to claim 7, wherein the calculation formula of similarity score between the sentence and the document queried in the full text retrieval result list is:

，

11. The large language model based life cycle stream data mixed retrieval device according to claim 10, wherein the calculation formulas of the semantic retrieval score and the full text retrieval score are:

，

12. The large language model based life cycle stream data mixed retrieval device according to claim 11, wherein the calculation formula of the initial score in the combined result list is:

，

wherein, In order for the decay rate to be a decay rate,For the initial score;

The calculation formula of the comprehensive ranking index is as follows:

，

13. An electronic device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the large language model based lifecycle stream data hybrid retrieval method as recited in any one of claims 1-6.

14. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program is executed by a processor for implementing the large language model based lifecycle stream data hybrid retrieval method as claimed in any one of claims 1-6.

15. A computer program product comprising a computer program, characterized in that the computer program is executed for implementing the large language model based lifecycle stream data hybrid retrieval method as claimed in any one of claims 1-6.