CN118349621A

CN118349621A - Index establishment method, index retrieval method and electronic equipment

Info

Publication number: CN118349621A
Application number: CN202410465580.5A
Authority: CN
Inventors: 刘鑫滨; 李聪; 张旭东
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2024-04-17
Filing date: 2024-04-17
Publication date: 2024-07-16

Abstract

The application discloses an index establishing method, a retrieval method and electronic equipment, wherein the retrieval method comprises the following steps: obtaining first query information; performing word matching processing on the first query information and content information of the data blocks in the data source to obtain first data blocks meeting first matching conditions with the first query information; carrying out semantic matching processing on the first query information and the content information of the data blocks in the data source to obtain second data blocks which meet second matching conditions with the first query information; the data blocks in the data source are the results obtained by partitioning the data files in the data source; determining a retrieval result corresponding to the first query information according to the first data block and the second data block; the search result comprises data pages corresponding to at least part of data blocks in the first data block and the second data block in the data file corresponding to the data source.

Description

Index establishment method, index retrieval method and electronic equipment

Technical Field

The application belongs to the technical field of information retrieval, and particularly relates to an index establishing method, a retrieval method and electronic equipment.

Background

The knowledge base is used for providing data organization, storage and management functions for users such as individuals or enterprises, and the users can find specific information from the knowledge base through information retrieval to perform reading, copying or editing and other applications. However, the current information retrieval technology has various defects such as low retrieval accuracy and fineness, and how to overcome at least some of the defects becomes a technical problem to be solved in the field.

Disclosure of Invention

Therefore, the application discloses the following technical scheme:

An index building method, comprising:

Partitioning a data file in a data source to obtain each data block of the data file; the data source comprises: a knowledge base constructed on an electronic device integrated with a large language model and applicable at least to the large language model;

Constructing first index information and second index information of the data block;

The first index information is used for mapping word information of the data blocks to the data blocks where the word information is located in the data file; the second index information is used for mapping the semantic information of the data blocks to the data blocks corresponding to the semantic information in the corresponding data files.

Optionally, constructing the first index information of the data block includes:

Word segmentation processing is carried out on the data blocks to obtain words contained in the data blocks;

constructing the first index information containing at least part of words of a data block and first corresponding relation information; the first correspondence information includes: correspondence information between each word in the at least part of words and the data file and the data block in the data file;

constructing second index information of the data block, including:

Carrying out semantic extraction on the data block to obtain semantic information of the data block;

Constructing the second index information containing semantic information and second corresponding relation information of the data block; the second correspondence information includes: semantic information of each data block and corresponding relation information between corresponding data files and corresponding data blocks in the corresponding data files;

The first index information and the second index information are used for executing mixed search based on word matching and semantic matching on query information.

Optionally, after the data file in the data source is subjected to the blocking processing, the method further includes:

If the number of the corresponding original data blocks after the corresponding data files in the data source are segmented exceeds a threshold value, clustering is carried out based on semantic information of the corresponding original data blocks of the corresponding data files, and content summarization is carried out on each cluster obtained by clustering through a large language model, so that summarized data blocks containing summarized content are obtained;

Wherein one cluster corresponds to one summary data block; in the case that a summary data block exists, the first index information and the second index information of the data block include the first index information and the second index information of the original data block, and the first index information and the second index information of the summary data block.

A retrieval method, comprising:

Obtaining first query information;

performing word matching processing on the first query information and content information of the data blocks in the data source to obtain first data blocks meeting first matching conditions with the first query information; carrying out semantic matching processing on the first query information and the content information of the data blocks in the data source to obtain second data blocks which meet second matching conditions with the first query information; the data blocks in the data source are the results obtained by partitioning the data files in the data source;

Determining a retrieval result corresponding to the first query information according to the first data block and the second data block; the search result comprises data pages corresponding to at least part of data blocks in the first data block and the second data block in the data file corresponding to the data source.

Optionally, the word matching processing is performed on the first query information and the content information of the data block in the data source to obtain a first data block that meets a first matching condition with the first query information, where the word matching processing includes:

extracting keywords from the first query information to obtain first keywords;

Performing keyword matching on the first keywords and each second keyword contained in the first index information to obtain target keywords which meet keyword matching conditions with the first keywords;

And determining a data block where the target keyword is located in a data file belonging to the data source according to the first index information, and taking the data block as the first data block.

Optionally, the performing semantic matching processing on the first query information and the content information of the data block in the data source to obtain a second data block that meets a second matching condition with the first query information includes:

Carrying out semantic extraction on the first query information to obtain first semantic information;

Performing semantic matching on the first semantic information and each piece of second semantic information contained in the second index information to obtain target semantic information meeting semantic matching conditions with the first semantic information;

and determining a data block where the target semantic information is located in the data file belonging to the data source according to the second index information as the second data block.

Optionally, the determining, according to the first data block and the second data block, a search result that meets a target matching condition with the first query information includes:

Determining a target data block meeting a target matching condition with the first query information according to the first data block and the second data block;

and determining a retrieval result corresponding to the first query information according to the target data block.

Optionally, the data block of each data file in the data source includes at least one of: the original data block obtained by blocking the data file is divided into blocks; clustering semantic information of the original data blocks of the data file under the condition that the number of the original data blocks of the data file exceeds a threshold value, and summarizing the content of each cluster obtained by clustering to obtain summarized data blocks containing summarized content; wherein one cluster corresponds to one summary data block;

the determining, according to the target data block, a search result corresponding to the first query information includes:

If the corresponding target data block in the target data blocks is an original data block, determining a first data page where the corresponding target data block is located in the data file;

If the corresponding target data block in the target data blocks is a summary data block, performing word matching processing and semantic matching processing on each original data block corresponding to the corresponding target data block on the first query information, so as to determine a target original data block meeting the target matching condition with the first query information from each original data block, and determine a second data page of the target original data block in the data file; the search result includes at least one of the first data page and the second data page.

Optionally, after obtaining the first query information, the method further includes:

Entity extraction is carried out on the first query information through a large language model, so that an entity contained in the first query information is obtained;

Determining a target subgraph associated with the entity from the knowledge graph;

Generating at least one piece of second query information through a large language model according to the target subgraph, so as to respectively carry out word matching processing or semantic matching processing on the first query information and the at least one piece of second query information and the content information of the data block in the data source;

The knowledge graph is constructed based on the extracted triples by performing triples extraction on each data file in the data source.

An electronic device, comprising:

a memory for storing at least one set of computer instructions;

a processor for implementing a method as claimed in any one of the preceding claims by executing the set of instructions stored in the memory.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is apparent that the drawings in the following description are only embodiments of the present application, and other drawings may be obtained according to the provided drawings without inventive effort for those skilled in the art.

FIG. 1 is a flowchart of an index building method applied to a data source index building side provided by the present application;

FIG. 2 is another flow chart of the index building method applied to the data source index building side provided by the application;

FIG. 3 is a further flowchart of the index building method provided by the present application applied to the data source index building side;

FIG. 4 is a flow chart of a search method applied to an information search side provided by the application;

FIG. 5 is a flow chart of a word matching process provided by the present application;

FIG. 6 is a flow chart of a semantic matching process provided by the present application;

FIG. 7 is another flow chart of the retrieval method applied to the information retrieval side provided by the present application;

FIG. 8 is a flow chart of the construction of a hybrid index and knowledge graph in an application example provided by the application;

FIG. 9 is a flow chart of information retrieval in an application example provided by the present application;

FIG. 10 is a block diagram showing the construction of a processing device applied to the data source index construction side according to the present application;

fig. 11 is a block diagram showing the constitution of a processing device applied to an information retrieval side according to the present application;

fig. 12 is a component configuration diagram of an electronic device provided by the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The application provides an index establishing method, a retrieval method and electronic equipment, which are at least used for solving the problems of low retrieval accuracy and low retrieval fineness in the prior art. The provided method comprises an index establishing method applied to a data source index establishing side and a retrieval method applied to an information retrieval side.

The index establishment method applied to the data source index establishment side can be executed through the first electronic device, the retrieval method applied to the information retrieval side can be executed through the second electronic device, the first electronic device and the second electronic device can be the same electronic device or different electronic devices, and the method is not limited and depends on actual requirements. The first electronic device and the second electronic device may be numerous general purpose or special purpose computing apparatus environments or electronic devices in a configuration, such as: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, and the like.

Referring to a flowchart of a method shown in fig. 1, an index building method applied to a data source index building side according to an embodiment of the present application at least includes the following processing steps:

and 101, performing block processing on a data file in a data source to obtain each data block of the data file.

The data source may be a database, such as a knowledge base that includes any one or more types of files, including text documents, PDFs, slides, and spreadsheets.

Further optionally, the data source may include: a knowledge base constructed on an electronic device integrated with a large language model is applicable at least to the large language model.

Currently, with the advent of large models such as Chat-GPT (CHAT GENERATIVE PRE-trained Transformer), a range of products based on these models are emerging. PC (Personal Computer ) manufacturers are also following the trend to bring out devices that integrate large models on the end side, thereby attracting users. One of the core applications of the end-side large model is the construction and optimization of the local personal knowledge base. At the current of the rapid increase of information quantity, a user personal knowledge base is constructed and maintained on user equipment, which is important for improving the work and learning efficiency of users. The user's personal knowledge base typically contains various types of files, such as text documents, PDFs, slides, spreadsheets, and the like, that carry the valuable information and knowledge that the user has accumulated during learning and work.

In the above-mentioned contexts, users often need to quickly find specific data information from their personal knowledge base in order to perform operations such as reading, copying, or editing. The index establishing method provided by the application can be used for but not limited to providing rapid, accurate and fine data retrieval support for the application scene, so that a user can rapidly and efficiently extract valuable information from huge data information of a personal knowledge base thereof for use.

According to the embodiment of the application, the data index construction is carried out on the data source by carrying out block processing on the data file in the data source.

Optionally, for each data file in the data source, the data file may be specifically divided according to a set data size to obtain each data block with the data size equal to the set data size, where in the dividing process, the remaining data (remaining data at the tail of the data file) that is not smaller than the set data size and remains in the data file is divided into one data block, that is, the data size of each data block is smaller than or equal to the set data size. The number of data blocks included after each data file is divided depends on the data amount of the data included in the data file.

Step 102, constructing first index information and second index information of the data block.

After the data blocks of each data file are obtained after the data files are segmented, the first index information and the second index information of the data blocks can be further constructed. The first index information and the second index information are used for executing mixed search based on word matching and semantic matching on query information.

The process of constructing the first index information may be implemented as:

11 The word segmentation processing is carried out on the data block, and words contained in the data block are obtained.

The word segmentation process may be performed on each data block using a word segmentation device.

12 Constructing first index information containing at least part of words of the data block and first correspondence information; the first correspondence information includes: correspondence information between each word in the at least part of words and the data file and the data block in the data file;

alternatively, the at least some words of the data block may contain keywords of the data block.

The first index information may be, but is not limited to, an inverted index. Taking the keyword that at least part of the words of the data block include the data block as an example, the inverted index may be specifically constructed for the extracted keyword after the word is segmented for the data block and the keyword in the word segmentation result is extracted, where the constructed inverted index may specifically include: keywords of each data file in the data source, and correspondence information between each keyword and an ID (Identity Document, identification number) of the data file to which the keyword belongs and a data block ID of a data block in the data file to which the keyword belongs.

The process of constructing the second index information may be implemented as:

21 Semantic extraction is carried out on the data block to obtain semantic information of the data block.

In particular, but not limited to, an encoder (Encoder) is used to extract an embedded vector (embedding) of the data block, and semantic information of the data block is characterized based on the embedded vector of the data block.

22 Constructing the second index information containing semantic information and second corresponding relation information of the data block; the second correspondence information includes: semantic information of each data block and corresponding relation information between corresponding data files and corresponding data blocks in the corresponding data files.

After the semantic information of the data block is obtained, the embodiment further constructs second index information including the semantic information of the data block and the second correspondence information, for example, constructs second index information including the embedded vector of the data block and the second correspondence information.

Wherein the second index information may be, but is not limited to, an ANN (Approximate NearestNeighbor-nearest neighbor search) index, such as HNSW (HIERARCHICAL NAVIGABLE SMALL WORLD GRAPHS, hierarchical-navigable-small world-graph).

According to the embodiment, the data files in the data source are subjected to blocking processing, and the first index information and the second index information of the data blocks are constructed, so that the method can be used for performing mixed retrieval based on word matching and semantic matching on user query information, and the mixed retrieval based on the word matching and the semantic matching is beneficial to accurately searching required data information from the data source, so that the accuracy of data retrieval can be improved. In addition, the embodiment can make the information retrieval granularity fine to the data block level by carrying out the blocking processing on the data file and constructing the first index information/the second index information which can be used for mapping the word information/the semantic information of the data block to the data block corresponding to the word information/the semantic information in the corresponding data file, thereby being convenient for positioning according to the retrieved data block and returning the retrieval result with fine granularity to the user, such as positioning and returning the data page where the data block is located to the user.

In an alternative embodiment, referring to the flowchart of the index creating method shown in fig. 2, the index creating method applied to the data source index creating side provided in the embodiment of the present application may further include, after step 101, the following processing:

Step 201, if the number of corresponding original data blocks after the corresponding data files in the data source are segmented exceeds a threshold value, clustering is performed based on semantic information of the corresponding original data blocks of the corresponding data files, and content summarization is performed on each cluster obtained by clustering through a large language model, so that summarized data blocks containing summarized content are obtained.

Wherein one cluster corresponds to one summary data block.

In the case that a summary data block exists, the first index information and the second index information of the data block include the first index information and the second index information of the original data block, and the first index information and the second index information of the summary data block.

It is easy to understand that the first index information of the summary data block is used for mapping word information in the summary data block to the summary data block where the word information is located in the affiliated data file, and specifically may include at least part of words of the summary data block, and correspondence information between each word in the at least part of words and the affiliated data file and the summary data block where the word information is located in the affiliated data file.

Similarly, the second index information of the summary data blocks is used for mapping the semantic information of the summary data blocks to the summary data blocks corresponding to the semantic information in the corresponding data files, and specifically may include the semantic information of the summary data blocks and the corresponding relation information between the semantic information of each summary data block and the corresponding data file and between the semantic information of each summary data block and the corresponding summary data block in the corresponding data file.

The completion of the index construction of the data blocks contained in each data file can be regarded as the completion of registration of the data file.

In the case that the summary data block exists, the index information of the summary data block and the index information of the original data block form a hierarchical (two-layer) index architecture, and can support the hierarchical search of the data block in the subsequent information search.

In this embodiment, under the condition that the number of corresponding original data blocks after the corresponding data file is segmented exceeds a threshold value, the clustering and content summarizing processing are performed on the original data blocks of the corresponding data file, and corresponding first index information and second index information are constructed for the summarized data blocks, so as to form the hierarchical index architecture, so that the problem that the search workload is increased due to the excessive number of the data blocks of the data file can be avoided, and the search workload can be reduced and the search efficiency can be improved by performing hierarchical search on the data blocks in the data file.

In an alternative embodiment, referring to the flowchart of the method shown in fig. 3, the method for establishing an index applied to a data source index establishment side according to the embodiment of the present application may further include the following processing:

Step 301, extracting triples of all data files in the data source through a large language model.

And 302, constructing a knowledge graph based on the extracted triples.

The execution sequence of steps 301-302, 101, and 102 is not limited, and steps 301-302 may be executed between steps 101 and 102, steps 301-302 may be executed before step 101 or after step 102, or steps 301-302 may be executed alternately with steps 101 and 102, depending on the actual application.

The triples extracted from each data file by the large language model are triples formed by entities and relationships between entities, i.e. "entity-relationship-entity" triples.

After the triples of each data file in the data source are extracted, an entity-relation-entity in a series of triples is further constructed as a corresponding knowledge graph, in the constructed knowledge graph, nodes represent the entities, and edges between different nodes represent the relations between different entities corresponding to the different nodes.

According to the method, the system and the device, the triples of all the data files in the data source are extracted, the knowledge graph is constructed based on the extracted triples, the follow-up expansion of the query information of the user based on the knowledge graph can be facilitated, and the missing retrieval or the deviation of the retrieval result and the user requirement caused by the fact that the query information input by the user is inaccurate are avoided.

On the basis of completing the index construction, based on the constructed mixed index (first index information and second index information), the data in the data source can be subjected to mixed retrieval based on word matching and semantic matching, and the mixed retrieval process is realized by the retrieval method applied to the information retrieval side provided by the embodiment of the application, and referring to the method flow chart shown in fig. 4, the retrieval method applied to the information retrieval side provided by the embodiment of the application at least comprises the following processing steps:

step 401, obtaining first query information.

The first query information is query information input by a user for searching according to requirements, and specifically may include one or more words, such as one or more keywords, input by the user for searching, or may also include a piece of content description information input by the user for reflecting the requirements.

Step 402, performing word matching processing on the first query information and content information of the data blocks in the data source to obtain first data blocks meeting first matching conditions with the first query information; carrying out semantic matching processing on the first query information and the content information of the data blocks in the data source to obtain second data blocks which meet second matching conditions with the first query information; the data blocks in the data source are the results obtained by partitioning the data files in the data source.

The sequence of execution of the word matching process and the semantic matching process is not limited, and in practical application, either process may be executed first based on a serial manner, and then the other process may be executed, or both processes may be executed simultaneously based on a parallel manner.

Alternatively, the first matching condition may include a keyword matching condition, and the second matching condition may include a semantic matching condition.

Referring to an exemplary word matching process flow diagram provided in FIG. 5, the process of word matching may be implemented as:

Step 501, extracting keywords from the first query information to obtain a first keyword.

Wherein, for the case that the first query information includes at least one word, each word included in the first query information may be directly used as the first keyword for retrieval, or a part of words may be extracted from each word included in the first query information as the first keyword for retrieval.

For the case that the first query information includes a piece of content description information, the content description information may be, but is not limited to, segmented by using a segmenter, words (such as prepositions, exclamation, etc.) that are nonsensical to search in the segmentation result are filtered, and then all the remaining words or a part of the remaining words are used as the first keywords for searching.

But not limited thereto, a phrase and/or a phrase in the piece of content description information may be extracted, and the extracted phrase and/or phrase may be used as a first keyword for retrieval.

That is, the keyword used for searching in the embodiment of the present application is a broad keyword, a word, a term, or a phrase, or the like, which can be used as a keyword for searching.

Step 502, performing keyword matching on the first keywords and each second keyword contained in the first index information to obtain target keywords which meet keyword matching conditions with the first keywords.

The respective second keywords included in the first index information may refer to respective words included in the first index information.

The matching method in keyword matching may include at least one of absolute matching and fuzzy matching.

The keyword matching condition may mean that the similarity between the keywords reaches a set similarity threshold.

Optionally, for the case that the first index information is an inverted index, but not limited to, a BM25 algorithm may be used to perform word retrieval on the inverted index content based on the first keyword, so as to implement keyword matching between the first keyword and each second keyword included in the inverted index, and obtain a target keyword that meets the keyword matching condition with the first keyword.

Step 503, determining, according to the first index information, a data block in which the target keyword is located in a data file belonging to the data source, as the first data block.

As can be seen from the above description, the first index information includes first correspondence information in addition to at least some words of the data block. The first correspondence information specifically includes: and the corresponding relation information between each word in the at least partial words and the data file and the data block in the data file.

On the basis of obtaining a target keyword which meets the keyword matching condition with the first keyword, the data block of the target keyword in the data file belonging to the data source can be determined according to the first corresponding relation information contained in the first index information, and the determined data block is used as a first data block which meets the first matching condition with the first query information of the user.

Referring to an exemplary semantic matching process flow chart provided in fig. 6, the process of semantic matching may be implemented as:

and 601, carrying out semantic extraction on the first query information to obtain first semantic information.

Specifically, but not limited to, an encoder, that is, encoder, is used to extract an embedded vector of the first query information, and semantic information of the first query information is represented based on the extracted embedded vector.

Step 602, performing semantic matching on the first semantic information and each piece of second semantic information contained in the second index information to obtain target semantic information meeting a semantic matching condition with the first semantic information.

Alternatively, the semantic matching between the first semantic information and each second semantic information contained in the second index information may be achieved by calculating a semantic distance between the first semantic information and each second semantic information contained in the second index information.

The semantic matching condition may, but is not limited to, mean that the semantic similarity between different semantic information reaches a set similarity threshold. Based on the semantic matching condition, the second semantic information, of which the semantic similarity with the first semantic information reaches a similarity threshold, in each piece of second semantic information contained in the second index information can be specifically determined to be the target semantic information.

Semantic similarity may be measured based on semantic distance between different semantic information. The larger the semantic distance between different semantic information, the lower the semantic similarity.

The semantic distance may be, but is not limited to, a cosine distance.

For the case that the second index information is an ANN index, for example, semantic search may be performed in the ANN index in a Cosine method based on the first semantic information, so as to implement semantic matching between the first semantic information and each second semantic information included in the second index information.

Where Cosine Metric refers to a Cosine Metric that measures the degree of similarity of two objects by calculating their Cosine distance between them.

Step 603, determining, according to the second index information, a data block where the target semantic information is located in a data file belonging to the data source as the second data block.

As can be seen from the above description, the second index information includes second correspondence information in addition to semantic information of the data block. The second correspondence information specifically includes: semantic information of each data block and corresponding relation information between corresponding data files and corresponding data blocks in the corresponding data files.

On the basis of obtaining target semantic information meeting the semantic matching condition with the first semantic information, the data block corresponding to the target semantic information in the data file corresponding to the data source can be determined according to the second corresponding relation information contained in the second index information, and the determined data block is used as a second data block meeting the second matching condition with the first query information of the user.

Step 403, determining a search result corresponding to the first query information according to the first data block and the second data block; the search result comprises data pages corresponding to at least part of data blocks in the first data block and the second data block in the data file corresponding to the data source.

This step may be further implemented as:

31 According to the first data block and the second data block, determining a target data block meeting a target matching condition with the first query information.

Optionally, specifically, a union set of a first data block set formed by the first data blocks and a second data block set formed by the second data blocks may be determined, and each data block in the union set is scored, on the basis of which, a data block meeting a scoring condition in the union set is determined, and the data block meeting the scoring condition is used as a target data block meeting a target matching condition with the first query information.

The scoring condition may be, but is not limited to, set to any one of the following:

a. the score corresponding to the data block reaches a score threshold;

b. the score for a data block belongs to the top _{_} k in the descending order of scores.

The score descending sequence is a sequence obtained by sorting the data blocks in the union according to the descending order of the corresponding scores.

The higher the score of the data block, the higher the correlation between the characterizing data block and the first query information.

In implementation, corresponding weights can be respectively allocated to the word matching mode and the semantic matching mode, and when the data block in the union set is scored, the score of the data block can be calculated according to the keyword matching degree of the data block and the first query information, the first weight corresponding to the word matching mode, and the second weight corresponding to the semantic matching degree and the semantic matching mode of the data block and the first query information, and an exemplary score calculation formula is as follows:

Score＝(Md₁*w₁+Md₂*w₂)*100％；

Wherein Score represents a Score of a data block, md ₁ represents a keyword matching degree of the data block and the first query information, md ₂ represents a semantic matching degree of the data block and the first query information, w ₁ represents a first weight corresponding to a word matching mode, and w ₂ represents a second weight corresponding to the semantic matching mode.

32 And determining a retrieval result corresponding to the first query information according to the target data block.

After determining the target data block meeting the target matching condition with the first query information, the data page where the target data block is located in the data file can be further determined, and the related information of the determined data page can be used as a search result corresponding to the first query information to be returned to the user interface for the user to use.

The related information of the data page may include, but is not limited to, the number, the name, the content digest of the data file to which the data page belongs, the page number corresponding to the data page in the data file to which the data page belongs, the content digest of the data page, and the like.

According to the embodiment, the data in the data source is subjected to mixed retrieval based on word matching and semantic matching based on the constructed first retrieval information and the constructed second retrieval information, so that the matching of the user query information and the content in the data source from two different dimensions of literal and semantic is realized, and the required data information can be accurately searched from the data source, so that the accuracy of data retrieval is improved. In addition, the data blocks are used as granularity for searching, so that the data pages related to the user query information in the data file can be accurately positioned.

In an alternative embodiment, the data blocks of each data file in the data source include at least one of: the original data block obtained by blocking the data file is divided into blocks; clustering semantic information of the original data blocks of the data file under the condition that the number of the original data blocks of the data file exceeds a threshold value, and summarizing the content of each cluster obtained by clustering to obtain summarized data blocks containing summarized content; wherein one cluster corresponds to one summary data block.

Accordingly, step 402 of the method for searching information on the information searching side, that is, determining the search result corresponding to the first query information according to the target data block, may be implemented as follows:

41 If the corresponding target data block in the target data blocks is the original data block, determining a first data page of the corresponding target data block in the data file.

42 If the corresponding target data block in the target data blocks is a summary data block, performing word matching processing and semantic matching processing on each original data block corresponding to the corresponding target data block on the first query information, so as to determine a target original data block meeting the target matching condition with the first query information from each original data block, and determine a second data page of the target original data block in the data file.

For the case that the corresponding target data block in the target data blocks is the summary data block, the embodiment continues to use each original data block corresponding to the corresponding target data block (i.e. the summary data block) as a search range, and continues to perform mixed search based on word matching and semantic matching in the search range, so as to find the original data block related to the first query information of the user from the search range, and further locate the second data page where the original data block is located. The searching process is essentially hierarchical searching based on a hierarchical index architecture, wherein in the first-stage searching, summary data blocks related to first query information of a user are searched from different summary data blocks, in the second-stage searching, the searching range is reduced, and specifically, the searching range is limited to each original data block contained in the summary data block searched in the first-stage searching stage, and the original data blocks related to the first query information of the user are further searched from the original data blocks, so that the data pages of the original data blocks in the data file are positioned.

On this basis, the search result corresponding to the first query information may be obtained according to the processing results of steps 41) to 42). The search result corresponding to the first query information includes at least one of the first data page and the second data page.

According to the data retrieval method and device, the data in the data source is retrieved in a grading mode based on the pre-built hierarchical index architecture, the problem that the retrieval workload is increased due to the fact that the number of data blocks of the data file is too large is avoided, the retrieval workload is further reduced, and the retrieval efficiency is improved.

In an alternative embodiment, referring to the flowchart of the method shown in fig. 7, the searching method applied to the information searching side provided by the present application may further include the following processing after step 401:

and 701, extracting the entity of the first query information through a large language model to obtain the entity contained in the first query information.

The large language model may implement entity extraction of the first query information by performing Named Entity Recognition (NER) processing on the first query information.

Step 702, determining a target subgraph associated with the entity from the knowledge graph.

Specifically, an association depth value when the knowledge graph is subjected to sub-graph retrieval based on the entity can be preset, and on the basis, the entity extracted from the first query information is utilized to perform sub-graph retrieval on the knowledge graph according to the set association depth value, so that a target sub-graph associated with the entity is obtained.

And the maximum association depth between the entity in the target subgraph and the entity in the first query information does not exceed the set association depth value.

The following examples are given: assuming that the entity in the first query information is the entity o ₁ and the preset association depth value is 2, the association depth value between the entity in the target sub-graph and the entity o ₁ is less than or equal to 2. For example, the entities in the target subgraph include the knowledge graph, where the entities o ₂(o₂ and o ₁ directly connected to o ₁ have a direct correlation, the correlation depth value between the two is 1, and the correlation depth values between the entities o ₃ and o ₄(o₃ directly connected to o ₂ and not directly connected to o ₁ and o ₄ and o ₁ are 2).

Step 703, generating the at least one piece of second query information through a large language model according to the target subgraph, so as to perform word matching processing or semantic matching processing on the first query information and the at least one piece of second query information and the content information of the data block in the data source respectively.

Specifically, the target subgraph and the first query information can be input into a large language model, and the large language model expands the first query information according to the target subgraph to obtain at least one piece of expanded second query information.

For example, a large language model "how many components are included in an xxx country" for the first query information input according to the target subgraph input? How many geographic areas are the "extended" xxx countries included? "," how many provinces are included in xxx countries? "," how many cities are included in xxx countries? "and the like.

Correspondingly, the step 402 of the information retrieval side retrieval method according to the present application may be implemented as:

Step 704, performing word matching processing on the first query information and the at least one piece of second query information and content information of the data blocks in the data source to obtain first data blocks meeting first matching conditions with the first query information or the corresponding second query information; and carrying out semantic matching processing on the first query information and the at least one piece of second query information and the content information of the data blocks in the data source to obtain second data blocks which meet second matching conditions with the first query information or the corresponding second query information.

The process of performing word matching processing or semantic matching processing on the second query information and the content information of the data block in the data source is basically the same as the process of performing word matching processing or semantic matching processing on the first query information and the content information of the data block in the data source, and the difference is only that the query information is different, and the description of "performing word matching processing or semantic matching processing on the first query information and the content information of the data block in the data source" can be specifically referred to above, which is not repeated here.

According to the method, the first query information input by the user is expanded by utilizing the large language model, the first query information and at least one piece of expanded second query information are used as query basis, the data in the data source are subjected to mixed retrieval based on word matching and semantic matching, the problem of missing retrieval caused by insufficient comprehensiveness of the first query information input by the user can be effectively avoided, the problem that a retrieval result deviates from the user requirement caused by insufficient accuracy (namely, the requirement cannot be accurately embodied) of the first query information input by the user is avoided, and the accuracy and comprehensiveness of information retrieval are further improved.

An example of an application of the method of the application is provided below.

In this example, the data source is a user personal knowledge base, abbreviated as user knowledge base, built on the electronic device integrated with the large language model.

The method mainly comprises a preparation stage and a retrieval stage.

1. Preparation stage

The task of the preparation stage mainly comprises the steps of constructing a mixed index and a knowledge graph for a user knowledge base, referring to a construction flow chart of the mixed index and the knowledge graph shown in fig. 8, and the task flow of the preparation stage comprises:

(1) And cutting the data file in the user knowledge base into data blocks chunk.

(2) Constructing a mixed index and a knowledge graph based on chunk, which specifically comprises the following steps a) -c):

a) After extracting the embedded vector embedding of the chunk using the encoder, construct an ANN index based on embedding (e.g., HNSW);

b) After word segmentation is carried out on chunk by using a word segmentation device, an inverted index is constructed based on word segmentation results;

c) And extracting triples of the chunk by using a large language model LLM, and constructing a knowledge graph based on the triples.

(3) And judging whether the original count of the data files exceeds a threshold value, and if the original count of the data files does not exceed the threshold value, completing index construction, namely completing data file registration.

(4) If the threshold is exceeded, embedding of the chunk are clustered.

(5) And (3) summarizing all chunks in each cluster obtained by clustering by using LLM to obtain summary chunks containing summarized contents, and constructing a mixed index for the summary chunks through the step (2).

2. Retrieval phase

Referring to the information retrieval flow diagram shown in fig. 9, the information retrieval flow at this stage may be implemented as:

(1) Obtaining a user query, and extracting entity identity in the user query based on NER by using LLM.

It is easy to understand that the user query is the first query information described above.

(2) And (3) performing sub-graph detection on the extracted entity of the LLM in the knowledge graph according to the set association depth (such as the maximum depth max depth of the knowledge graph).

(3) And inputting the retrieved target subgraph and the user query into LLM (LLM) for query expansion (namely query information expansion) to obtain n expansion queries.

And n extended queries are at least one piece of second query information.

Wherein n is an integer greater than or equal to 1.

(4) And respectively carrying out mixed search based on the user query and the n extended queries.

Optionally, specifically, word retrieval based on word matching can be performed on the inverted index content by using a BM25 algorithm based on keywords in the user query and n extended queries; based on embedding of the user query and n extended queries, semantic search based on semantic matching is performed in an ANN index by using a Cosine method, so as to obtain a mixed search result obtained by two search modes, wherein the mixed search result comprises: and merging the set formed by the chunk obtained based on word retrieval and the set formed by the chunk obtained based on semantic retrieval.

(5) Each chunk in the mixed search result is scored, and each chunk is subjected to rerank (rearrangement) based on the corresponding score, and at least one target chunk of the score top _{_} k is selected from the scores.

K is an integer greater than or equal to 1.

(6) Whether the target chunk is a summary chunk (i.e., chunk not originally segmented) is detected, and if not, the data page where the target chunk is located.

(7) If the target chunk is a summary chunk, it indicates that the chunk content does not exist in the source file and is a more abstract summary content, and accordingly, the mixed search based on word matching and semantic matching is continuously performed in each original chunk contained in the summary chunk, and the search is positioned to a data page where the searched original chunk is located, and it is noted that the search range is reduced, and only child nodes of the summary chunk searched in this round, namely each original chunk contained in the summary chunk, are searched.

Finally, the relevant information of the data page positioned in the steps (6) - (7) is returned to the user interface so as to be used for the user to view, copy or edit the content of the positioned data page.

Corresponding to the above-mentioned index establishing method applied to the data source index establishing side, the embodiment of the present application further provides a processing device applied to the data source index establishing side, where the composition structure of the processing device is shown in fig. 10, and the processing device includes:

The block module 1001 is configured to perform a block processing on a data file in a data source, so as to obtain each data block of the data file; the data source comprises: a knowledge base constructed on an electronic device integrated with a large language model and applicable at least to the large language model;

a construction module 1002, configured to construct first index information and second index information of a data block;

In an alternative embodiment, the construction module 1002, when constructing the first index information of the data block, is specifically configured to:

The construction module 1002, when constructing the second index information of the data block, is specifically configured to:

In an alternative embodiment, the apparatus further comprises: an additional processing module for:

After the data files in the data sources are subjected to blocking processing, if the number of the corresponding original data blocks after the corresponding data files in the data sources are blocked exceeds a threshold value, clustering is performed on the basis of semantic information of the corresponding original data blocks of the corresponding data files, and content summarization is performed on each cluster obtained through clustering through a large language model, so that summarized data blocks containing summarized content are obtained;

In an alternative embodiment, the construction module 1002 is further configured to:

extracting triples of all data files in the data source through a large language model;

and constructing a knowledge graph based on the extracted triples.

Corresponding to the above-mentioned searching method applied to the information searching side, the embodiment of the present application further provides a processing device applied to the information searching side, where the composition structure is as shown in fig. 11, and the processing device includes:

An obtaining module 1101, configured to obtain first query information;

The retrieval module 1102 is configured to obtain a first data block that meets a first matching condition with the first query information by performing word matching processing on the first query information and content information of the data block in the data source; carrying out semantic matching processing on the first query information and the content information of the data blocks in the data source to obtain second data blocks which meet second matching conditions with the first query information; the data blocks in the data source are the results obtained by partitioning the data files in the data source;

A determining module 1103, configured to determine a search result corresponding to the first query information according to the first data block and the second data block; the search result comprises data pages corresponding to at least part of data blocks in the first data block and the second data block in the data file corresponding to the data source.

In an alternative embodiment, the retrieving module 1102 is specifically configured to, when performing word matching processing on the first query information and content information of a data block in the data source to obtain a first data block that meets a first matching condition with the first query information:

extracting keywords from the first query information to obtain first keywords;

In an alternative embodiment, the retrieving module 1102 is specifically configured to, when performing semantic matching processing on the first query information and content information of a data block in the data source to obtain a second data block that meets a second matching condition with the first query information:

In an alternative embodiment, the determining module 1103 is specifically configured to:

In an alternative embodiment, the data blocks of each data file in the data source include at least one of: the original data block obtained by blocking the data file is divided into blocks; clustering semantic information of the original data blocks of the data file under the condition that the number of the original data blocks of the data file exceeds a threshold value, and summarizing the content of each cluster obtained by clustering to obtain summarized data blocks containing summarized content; wherein one cluster corresponds to one summary data block;

The determining module 1103 is specifically configured to, when determining, according to the target data block, a search result corresponding to the first query information:

In an optional embodiment, the apparatus further includes an information expansion module configured to:

Generating at least one piece of second query information through a large language model according to the target subgraph, so that the retrieval module carries out word matching processing or semantic matching processing on the first query information and the at least one piece of second query information and the content information of the data blocks in the data source respectively;

The embodiment of the application also provides an electronic device, and the composition structure of the electronic device, as shown in fig. 12, at least includes:

a memory 10 for storing a set of computer instructions;

the set of computer instructions may be implemented in the form of a computer program.

The processor 20 is configured to implement an index building method applied to a data source index building side, or a retrieval method applied to an information retrieval side, as provided in any of the above method embodiments, by executing a set of computer instructions.

The processor 20 may be a central processing unit (Central Processing Unit, CPU), application-specific integrated circuit (ASIC), digital Signal Processor (DSP), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), neural Network Processor (NPU), deep learning processor (DPU), or other programmable logic device, etc.

The electronic device is provided with a display device and/or a display interface, and can be externally connected with the display device.

Optionally, the electronic device further includes a camera assembly, and/or an external camera assembly is connected thereto.

In addition, the electronic device may include communication interfaces, communication buses, and the like. The memory, processor and communication interface communicate with each other via a communication bus.

The communication interface is used for communication between the electronic device and other devices. The communication bus may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc., and may be classified as an address bus, a data bus, a control bus, etc.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

For convenience of description, the above system or apparatus is described as being functionally divided into various modules or units, respectively. Of course, the functions of each element may be implemented in the same piece or pieces of software and/or hardware when implementing the present application.

From the above description of embodiments, it will be apparent to those skilled in the art that the present application may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be embodied essentially or inventive contributing portions thereof in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or portions of the embodiments of the present application.

Finally, it is further noted that relational terms such as first, second, third, fourth, and the like are used herein to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. An index building method, comprising:

2. The index creating method according to claim 1, building first index information of the data block, comprising:

constructing second index information of the data block, including:

3. The index creating method according to claim 1, further comprising, after the data file in the data source is subjected to the blocking process:

4. A retrieval method, comprising:

Obtaining first query information;

5. The retrieval method according to claim 4, wherein the step of obtaining the first data block satisfying the first matching condition with the first query information by performing word matching processing on the first query information and content information of the data block in the data source includes:

extracting keywords from the first query information to obtain first keywords;

6. The retrieval method according to claim 4, wherein the obtaining the second data block satisfying the second matching condition with the first query information by performing semantic matching processing on the first query information and content information of the data block in the data source includes:

7. The retrieval method according to claim 4, wherein the determining, according to the first data block and the second data block, a retrieval result satisfying a target matching condition with the first query information includes:

8. The retrieval method of claim 7, the data blocks of each data file in the data source comprising at least one of: the original data block obtained by blocking the data file is divided into blocks; clustering semantic information of the original data blocks of the data file under the condition that the number of the original data blocks of the data file exceeds a threshold value, and summarizing the content of each cluster obtained by clustering to obtain summarized data blocks containing summarized content; wherein one cluster corresponds to one summary data block;

9. The retrieval method according to any one of claim 4, further comprising, after obtaining the first query information:

10. An electronic device, comprising:

a memory for storing at least one set of computer instructions;

A processor for implementing the index building method according to any one of claims 1-3 or the retrieval method according to any one of claims 4-9 by executing the set of instructions stored in the memory.