CN112925912A

CN112925912A - Text processing method, and synonymous text recall method and device

Info

Publication number: CN112925912A
Application number: CN202110220258.2A
Authority: CN
Inventors: 冯朝兵; 连义江
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2021-06-08
Anticipated expiration: 2041-02-26
Also published as: CN112925912B

Abstract

The present disclosure provides a text processing method for a text database, which relates to the technical field of computers such as search, natural language processing and deep learning. The text processing method comprises the following steps: acquiring feature vectors of all texts in a text database; classifying and clustering all texts according to the feature vectors of all the texts to obtain a plurality of synonymous text clusters, wherein the synonymous text clusters comprise a plurality of texts with synonymous relations; determining a text from the synonymous text cluster aiming at each synonymous text cluster to serve as a representative text corresponding to the synonymous text cluster; and creating a target query index of the text database according to all the feature vectors representing the texts. The disclosure also provides a method, a device, an electronic device and a computer readable medium for recalling the synonymous text.

Description

Text processing method, and synonymous text recall method and device

Technical Field

The present disclosure relates to the field of computer technologies such as search, natural language processing, and deep learning, and in particular, to a text processing method for a text database, a method and an apparatus for recalling synonymous text, an electronic device, and a computer-readable medium.

Background

In an application scenario in the search field, a search engine can provide three key text matching services for advertisers to meet different advertisement promotion requirements, namely, exact matching, phrase matching, and broad matching. The exact match means that the search requirement (query) of the user is completely consistent with the keyword text (also called keyword, auction word) or the synonymous variant thereof, and the exact traffic reaching capability of the user still is an extremely important matching mode in the search engine.

In an advertisement mechanism of a search engine, the key texts triggering the system are oversized, great challenges are brought to recall and matching, and the recall efficiency of the system and the number of the key texts to be retrieved are in negative correlation. When the triggering system is preset with the constraints of recall efficiency and storage (the constraints aim to reduce the system flat sound, storage and computing resource cost), the limited key texts can reduce the keyword coverage of the triggering system, and further lead to the decline of revenue.

Currently, in a search engine, in order to implement a search recall, a synonymy text recall is generally implemented by directly searching for a full amount of key texts in a trigger system through a search requirement (query).

Disclosure of Invention

The disclosure provides a text processing method, a synonymy text recall method and device for a text database, an electronic device, a computer readable medium and a computer program product.

According to a first aspect of the present disclosure, there is provided a text processing method for a text database, including: acquiring feature vectors of all texts in a text database; classifying and clustering all texts according to the feature vectors of all the texts to obtain a plurality of synonymous text clusters, wherein the synonymous text clusters comprise a plurality of texts with synonymous relations; determining a text from the synonymous text cluster aiming at each synonymous text cluster to serve as a representative text corresponding to the synonymous text cluster; and creating a target query index of the text database according to all the feature vectors representing the texts.

According to a second aspect of the present disclosure, the present disclosure provides a method for recalling synonymous text, where the recall method is implemented based on a target query index of a text database, the target query index is created by using the above-mentioned text processing method, and the recall method includes: acquiring a search request, wherein the search request comprises a search text; acquiring a feature vector corresponding to the search text; inputting the feature vector of the search text into the target query index to query a representative text matched with the search text; and taking all texts in the representative text and the synonymous text cluster corresponding to the representative text as the synonymous texts of the search text for searching and recalling.

According to a third aspect of the present disclosure, there is provided a text processing apparatus comprising: the first vector acquisition module is configured to acquire feature vectors of all texts in the text database; the text classification module is configured to classify and cluster all texts according to the feature vectors of all the texts to obtain a plurality of synonymous text clusters, and the synonymous text clusters comprise a plurality of texts with synonymous relations; the screening module is configured to determine a text from the synonymous text cluster aiming at each synonymous text cluster to serve as a representative text corresponding to the synonymous text cluster; a building module configured to create a target query index for the text database from all feature vectors representing text.

According to a fourth aspect of the present disclosure, there is provided a synonymous text recall device including: a request acquisition module configured to acquire a search request, the search request including a search text; the second vector acquisition module is configured to acquire a feature vector corresponding to the search text; the query module is configured to input the feature vector of the search text into a target query index corresponding to a text database so as to query a representative text matched with the search text; the target query index is created by adopting the text processing method; and the text recall module is configured to take all texts in the representative text and the synonymous text cluster corresponding to the representative text as the synonymous texts of the search text for searching and recalling.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising: at least one processor, and a memory communicatively coupled to the at least one processor; wherein the memory stores one or more computer programs executable by the at least one processor, the one or more computer programs being executable by the at least one processor to enable the at least one processor to perform the method provided by any of the above aspects.

According to a sixth aspect of the present disclosure, there is provided a computer readable medium having a computer program stored thereon, wherein the computer program when executed implements the method provided in any of the above aspects.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method provided by any of the above aspects.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. The above and other features and advantages will become more apparent to those skilled in the art by describing in detail exemplary embodiments thereof with reference to the attached drawings, in which:

fig. 1 is a flowchart of a text processing method for a text database according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a model structure of a metrology representative model;

FIG. 3 is a flowchart of one specific implementation of step S2 in FIG. 1;

FIG. 4 is a schematic diagram of a neighbor search technique;

FIG. 5 is a diagram illustrating a model structure of a synonymous discriminant model;

FIG. 6 is a flowchart of one embodiment of step S24 of FIG. 3;

FIG. 7 is a schematic illustration of a connectivity sub-diagram;

FIG. 8 is a flowchart of one embodiment of step S3 of FIG. 1;

FIG. 9 is a flow chart of another text processing method provided by the embodiments of the present disclosure;

FIG. 10 is a flowchart of a method for recalling synonymous text according to an embodiment of the present disclosure;

FIG. 11 is a flowchart of another method for recalling synonymous text according to the embodiment of the present disclosure;

fig. 12 is a block diagram illustrating a text processing apparatus according to an embodiment of the present disclosure;

FIG. 13 is a block diagram illustrating a synonymous text recall device according to an exemplary embodiment of the present disclosure;

fig. 14 is a block diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

To facilitate a better understanding of the technical aspects of the present disclosure, exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, wherein various details of the embodiments of the present disclosure are included to facilitate an understanding, and they should be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Embodiments of the present disclosure and features of embodiments may be combined with each other without conflict.

As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

At present, in a search engine, in order to implement a search recall, a full amount of key texts are generally retrieved in a trigger system directly through a search requirement (query), so as to implement a synonymous text recall, however, such a recall manner of directly retrieving the full amount of key texts in the trigger system is time-consuming and has low recall efficiency, which limits recall capability, and the trigger system needs to construct a query index based on the full amount of key texts, which consumes a large amount of storage resources.

Therefore, the disclosed embodiments provide a text processing method for a text database, a method and an apparatus for recalling synonymous text, an electronic device, a computer readable medium, and a computer program product, so as to effectively reduce the text space of query indexes of the text database, save storage resources, improve retrieval and recall efficiency, and improve recall capability.

It should be noted that in the search field, the term "recall" refers to acquiring a matched text or document related to a search text or document input by a user.

Fig. 1 is a flowchart of a text processing method of a text database according to an embodiment of the present disclosure.

Referring to fig. 1, an embodiment of the present disclosure provides a text processing method of a text database, where the text processing method may be performed by a text processing apparatus, the text processing apparatus may be implemented in software and/or hardware, and the text processing apparatus may be integrated in an electronic device such as a server, and the text processing method includes:

and step S1, acquiring the feature vectors of all texts in the text database.

In the embodiment of the present disclosure, the text database may be a query database constructed for implementing an accurate search in a search engine system, or may be a database of a trigger system in an advertisement mechanism of a search engine, where the text in the database of the trigger system is a key text (also referred to as a keyword or an auction word).

And step S2, classifying and clustering all texts according to the feature vectors of all texts to obtain a plurality of synonymous text clusters, wherein each synonymous text cluster comprises a plurality of texts with synonymous relations.

Step S3, for each synonymous text cluster, determining a text from the synonymous text cluster as a representative text corresponding to the synonymous text cluster.

And step S4, creating a target query index of the text database according to all the feature vectors representing the texts.

In the embodiment of the present disclosure, the query index of the text database may be created in any suitable index creation manner, and the embodiment of the present disclosure does not specifically limit the creation manner of the index. For example, a Hierarchical Navigable Small World (HNSW) algorithm may be used to create a target query index for a text database based on feature vectors representing all texts, and HNSW is a vector index algorithm that uses HNSW techniques to create a query index based on feature vectors representing texts.

The text processing method for the text database provided by the embodiment of the disclosure classifies and clusters the texts in the text database according to the synonymous relationship, and selects the representative texts of each cluster to construct the query index of the text database, so that on one hand, the text space of the query index of the text database can be effectively reduced, the storage resources are saved, on the other hand, when the text database is used for retrieval, the full amount of texts in the text database do not need to be retrieved, and only the representative texts of each cluster in the text database need to be retrieved through the target query index, so that the retrieval efficiency and the search recall efficiency can be effectively improved, and the representative texts and all texts in the cluster where the representative texts are located can be effectively recalled, thereby improving the search recall capability and effectively avoiding the occurrence of a recall missing phenomenon.

In some embodiments, in step S1, a feature vector of each text in the text database is obtained using a preset metric representation model. The metric representation model can be a language model obtained by deep learning algorithm training, the input of the metric representation model is a text, and the output of the metric representation model is vector representation of the text, namely a feature vector of the text.

In some embodiments, the metric representation model is a model constructed based on a multi-layer Transformer structure, for example, the metric representation model is implemented by using a BERT (Bidirectional Encoder Representations from transforms) model, which realizes conversion of a text input into a feature vector output. In some embodiments, the metric representation model is a BERT pre-trained model, which is a model obtained by BERT pre-training using a mass corpus of text, which is a base model of BERT.

FIG. 2 is a schematic diagram of a model structure of a metric representation model, as shown in FIG. 2, "Single Sennce" is an input Single-Sentence text; toki denotes the ith symbol (Token) in the inputted text, i is 1, 2, 3, …, N is a positive integer; e denotes an embedding vector, Ei denotes an embedding vector of an ith symbol (Token); the output C is a feature vector, and Ti represents the feature vector obtained after the ith symbol (Token) is processed by the BERT.

It should be noted that, the embodiment of the present disclosure does not specifically limit the specific implementation manner of the metric representation model, as long as the feature vector of the text can be obtained.

Fig. 3 is a flowchart of a specific implementation manner of step S2 in fig. 1, and as shown in fig. 3, step S2 may further include steps S21 to S24.

And step S21, creating an initial query index of the text database according to the feature vectors of all the texts in the text database.

In the embodiment of the present disclosure, the query index of the text database may be created in any suitable index creation manner, and the embodiment of the present disclosure does not specifically limit the creation manner of the index. The initial query index may be created, for example, using the HNSW algorithm.

Step S22, for each text in the text database, querying out a text matching the text through the initial query index, and generating initial synonymy relationship information, where the initial synonymy relationship information includes the text and the text matching the text.

It will be appreciated that the initial query index is created based on feature vectors of text, with the input being feature vectors of text and the output being feature vectors of text that match the input feature vectors.

Specifically, in step S22, in the initial query index, a text matching the text is queried by using a neighbor search technique to generate initial synonym relationship information.

In some embodiments, the neighbor retrieval technique employs the HNSW algorithm. Fig. 4 is a schematic diagram of a neighbor search technique, and as shown in fig. 4, "I" represents a feature vector of an input text, "M" represents a feature vector of a text stored in each layer structure of an initial query index, and "O" represents a feature vector of a text searched by using the neighbor search technique. As shown in fig. 4, in some embodiments, in step S22, in the initial query index, a search is started from the top Layer (e.g., Layer2 shown in fig. 4) by using a neighbor search technique, and after finding the node closest to the input feature vector in the current Layer, the next Layer is entered, and the starting node of the next Layer search is the closest node of the previous Layer, and the loop is repeated until the query result (e.g., "O" shown in fig. 4) is found.

In some embodiments, in step S22, the text that matches the text refers to text in which the distance between the feature vectors is smaller than a predetermined distance, or text in which the distance between the feature vectors is the smallest among all the text. The text matched with the text can be one text or a plurality of texts, and is determined according to the actual retrieval condition.

And step S23, carrying out synonymy discrimination on the text in each piece of initial synonymy relation information by using a preset synonymy discrimination model, and removing the initial synonymy relation information which does not meet the synonymy relation in all pieces of initial synonymy relation information.

In this disclosure, the minimum distance between the feature vectors or the distance smaller than the predetermined distance indicates that it may be preliminarily determined that there is a synonymous relationship between the two texts, but does not indicate that there is a synonymous relationship between the two texts really, in order to further improve the accuracy of identifying the synonymous relationship between the texts, in some embodiments, in step S23, a preset synonymous judgment model is used to perform synonymous judgment on the text in each piece of initial synonymous relationship information, and if the text in the piece of initial synonymous relationship information does not actually satisfy the synonymous relationship, the piece of initial synonymous relationship information is removed, and the piece of initial synonymous relationship information that actually satisfies the synonymous relationship is retained.

Specifically, in step S23, for each piece of initial synonymous relationship information, the feature vectors of any two texts in the initial synonymous relationship information are input into a preset synonymous judgment model, and the similarity (synonymous degree) of the feature vectors of the two texts is calculated by using the preset synonymous judgment model, and whether the two texts satisfy (have) the synonymous relationship is identified, for example, by judging whether the similarity is greater than or equal to a preset similarity threshold, if so, judging that the synonymous relationship is satisfied, and if not, judging that the synonymous relationship is not satisfied. If the two texts meet the synonymy relationship, the two texts are reserved, otherwise, the texts which do not meet the synonymy relationship with any text are removed.

In some embodiments, the synonymy discriminant model may be a language model trained by a deep learning algorithm, where the synonymy discriminant model has two texts to be discriminable as input, and has a synonymy relationship discriminant result representation of the two texts as output.

In some embodiments, the synonymous discriminant model is a classification model constructed based on a multi-layer transform structure. In some embodiments, the synonymy discriminant model is a BERT classification model obtained by performing a Fine Tuning (Fine Tuning) process on a BERT base model (i.e., the BERT pre-training model described above), so as to realize classification prediction on whether a synonymy variant relationship between two texts is satisfied.

FIG. 5 is a diagram illustrating a model structure of a synonymous discriminant model, as shown in FIG. 5, where "Sennce 1" is the first text entered and "Sennce 2" is the second text entered; toki denotes the ith symbol (Token) in the inputted text, i is 1, 2, 3, …, N (or M), N, M is a positive integer; e_[cls]An embedded vector representing the input Sennce 1, E_[SEP]An embedded vector representing the input Sennce 2, E_iAn embedding vector representing the ith symbol (Token); c is the feature vector of Sennce 1, T_[SEP]Ti represents a feature vector obtained by subjecting the ith symbol (Token) to BERT processing, which is a feature vector of the sequence 2; and the Class Label represents the final output, the value of the Class Label is 0 or 1, 1 represents that the two input texts meet the synonymy relationship, and 0 represents that the two input texts do not meet the synonymy relationship.

For example, the entered text pair is: the method comprises the following steps of 'requiring little money for double-fold eyelid surgery' and 'cutting the price of double-fold eyelids', predicting that the text pair meets the synonymy relationship through a synonymy discriminant model, and inputting the text pair as follows: the synonymous relationship between the money required for the double-fold eyelid operation and the pain and pain caused by cutting the double-fold eyelid is not satisfied.

In the model training process of the synonymy discriminant model, for the feature vectors of the two input texts, the similarity of the two texts can be calculated through a preset measurement function, so that whether the two texts meet the synonymy relation or not is discriminated. The metric function may be, for example, a Cosine (COS) function or a dot product function.

In the embodiments of the present disclosure, it is understood that "synonymous" means that the meanings of the texts are the same or substantially the same.

It should be noted that, in the embodiment of the present disclosure, a specific implementation manner of the synonymy discriminant model is not particularly limited, as long as whether the text pairs have the synonymy relationship can be identified.

And step S24, dividing the initial synonymous relation information with intersection in the rest initial synonymous relation information into a type of initial synonymous relation information, and taking each type of initial synonymous relation information as a synonymous text cluster.

For example, text a has a synonymous relationship with text B, text B has a synonymous relationship with text C, and text B has a synonymous relationship with text D, then text A, B, C, D may be categorized as being in a synonymous text cluster.

It is understood that a synonymous text cluster refers to a collection of texts having a synonymous relationship, and there is no intersection between different synonymous text clusters.

Fig. 6 is a flowchart of an embodiment of step S24 in fig. 3, and as shown in fig. 6, in some embodiments, step S24 may further include steps S241 to S243.

And S241, regarding the remaining initial synonymous relationship information, taking each text in the initial synonymous relationship information as a node, taking a matching relationship between the texts in the initial synonymous relationship information as an edge, and constructing an Euler graph.

Specifically, for the remaining initial synonymous relationship information, each text in the initial synonymous relationship information is used as a node, and the node corresponding to each text is connected with the node corresponding to the text matched with the text to form an edge, so that an euler graph is constructed.

Step S242, determining all connected subgraphs in the Euler graph by using a preset connected subgraph discovery algorithm.

In some embodiments, the preset connected subgraph discovery algorithm comprises a union set algorithm, and all connected subgraphs in the euler graph can be found through the union set algorithm. In the process of discovering the connected subgraphs, the scale of a single connected subgraph is limited, for example, if the number of connected nodes exceeds a limited threshold, continuous connected expansion is stopped, and the possibility of error explosion caused by overlong path depth between the nodes is avoided.

And S243, determining the initial synonymy relation information with intersection according to the connected subgraph to generate a synonymy text cluster.

It is understood that each connected subgraph mined by the step S242 corresponds to a synonymous text cluster.

Fig. 7 is a schematic diagram of connected subgraphs, as shown in fig. 7, by way of example, after the processing in step S23, in step S241, in the remaining initial synonym relationship information, text a and text B, text C both match, so a and B, C are connected, while text B and text D, text E both match, so B and D, E are connected, finally, it is determined in steps S242 and S243 that the text A, B, C, D, E is connected to each other as a connected subgraph, and so on, it is determined that the text F, G, H is connected to each other as a connected subgraph, and the text J, K, L, P is connected to each other as a connected subgraph.

For example, the plurality of synonym text clusters finally determined through the steps S21 to S24 are, for example: [ how much money is spent on double eyelid surgery, price for double eyelid surgery, how much money is spent on double eyelid cutting surgery, how much money is spent on double eyelid surgery ], [ things to be noticed for eyebrow tattooing, things to be noticed for eyebrow tattooing ], [ how much money is spent on eyebrow tattooing once, and price for eyebrow tattooing ].

Fig. 8 is a flowchart illustrating an embodiment of step S3 in fig. 1, and as shown in fig. 8, in some embodiments, step S3 may further include steps S31 to S32.

Step S31, determining, for each text in each synonymous text cluster, the number of synonymy relationships corresponding to the text in the synonymous text cluster.

It is understood that the corresponding synonymy relation number of the text in the synonymy text cluster is the number of texts having synonymy relations with the text in the synonymy text cluster.

In some embodiments, a preset synonymy discriminant model may be used to identify a text having a synonymy relationship with the text in the synonymy text cluster, so as to determine a corresponding synonymy relationship number of the text in the synonymy text cluster. For the description of the preset synonymous discriminant model, reference may be made to the description of the synonymous discriminant model, which is not repeated herein.

And step S32, taking any text with the most synonymy relation in the synonymy text cluster as the representative text corresponding to the synonymy text cluster.

In some embodiments, if the number of the texts with the largest number of corresponding synonyms in the synonym text cluster is multiple, for each text in the multiple texts, firstly, a synonym discriminant model is used to calculate the synonymity degree (similarity) corresponding to each synonym corresponding to the text; then, according to the synonymy degrees of all the synonymy relations corresponding to the text, calculating the synonymy degree average value corresponding to the text; and finally, comparing the synonymy degree average values corresponding to the texts respectively, and selecting the text with the highest synonymy degree average value as the representative text of the synonymy text cluster.

In some embodiments, in addition to the representative text determination manner through the above steps S31 and S32, one text may be randomly selected from each synonymous text cluster as the representative text of the synonymous text cluster.

Fig. 9 is a flowchart of another text processing method provided in an embodiment of the present disclosure, and as shown in fig. 9, in some embodiments, to ensure a synonymous relationship between a representative text and each text in a synonymous text cluster and reduce a usage error of the synonymous text cluster in an actual retrieval scenario, before step S4, the text processing method may further include steps S33 to S34.

Step S33, for each text in each synonymous text cluster, identifying whether the text and the representative text corresponding to the synonymous text cluster have a synonymous relationship by using a preset synonymous recognition model.

Specifically, the synonymy discriminant model is used to calculate the similarity between the feature vector of the text and the feature vector of the representative text, and identify whether there is a synonymy relationship between the text and the representative text according to the similarity, for example, by determining whether the similarity is greater than or equal to a preset similarity threshold, if so, it is determined that there is a synonymy relationship, and if not, it is determined that there is no synonymy relationship. For the description of the preset synonymous discriminant model, reference may be made to the description of the synonymous discriminant model, which is not repeated herein.

And step S34, under the condition that the text is identified not to have the synonymy relation with the representative text corresponding to the synonymy text cluster, the text is removed from the synonymy text cluster.

And in the case that the text is identified to have the synonymy relation with the representative text corresponding to the synonymy text cluster, the text is retained, and in the case that the text is identified to have no synonymy relation with the representative text corresponding to the synonymy text cluster, the text is removed from the synonymy text cluster. Therefore, the synonymy relation between the representative text and each text in the synonymy text cluster is ensured, and the use error of the synonymy text cluster in an actual retrieval scene is reduced.

In the embodiment of the disclosure, the target query index of the text database is created based on the representative text, so that in the retrieval process, the whole amount of text in the text database does not need to be retrieved, and only the representative text needs to be retrieved, so that on one hand, the text space of the index is reduced, the storage resource is saved, and on the other hand, the retrieval and recall efficiency is greatly improved.

Fig. 10 is a flowchart of a method for recalling synonymous text according to an embodiment of the present disclosure.

The embodiment of the disclosure provides a synonymous text recalling method, which may be executed by a synonymous text recalling device, which may be implemented in a software and/or hardware manner, and which may be integrated in an electronic device such as a server.

Referring to fig. 10, the method for recalling synonymous text may be implemented based on a target query index of a text database, where the target query index is created by using the text processing method described above, and the method for recalling synonymous text includes:

step S5, obtaining a search request, where the search request includes a search text.

In some embodiments, a user's search request (query) is received in real-time through an online environment.

In some embodiments, in step S5, a search request (query) input by the user on the interactive system is obtained. The interactive system may be an intelligent interactive system such as an intelligent terminal, a platform, an application, a client and the like capable of providing an intelligent interactive service for a user, for example, an intelligent sound box, an intelligent video sound box, an intelligent story machine, an intelligent interactive platform, an intelligent interactive application, a search engine, a question and answer system. The embodiment of the present disclosure does not particularly limit the implementation manner of the interactive system as long as the interactive system can interact with the user.

In the embodiment of the present disclosure, the "interaction" may include voice interaction and text interaction, where the voice interaction is implemented based on technologies such as voice recognition, voice synthesis, and natural language understanding, and in various practical application scenarios, an interactive system is given an intelligent human-computer interaction experience of "being able to listen, speak, and understand you", and the voice interaction is applicable to a plurality of application scenarios, including scenarios such as intelligent question answering, intelligent playing, and intelligent searching. The character interaction is realized based on the technologies of character recognition, extraction, natural language understanding and the like, and can be also suitable for a plurality of application scenes.

In some embodiments, in step S5, the user may input the search request through a voice interactive manner, and after obtaining the voice information input by the user, the voice information may be subjected to operations such as voice recognition, voice conversion of words, and the like, so as to obtain the corresponding search text.

In some embodiments, in step S5, the user may also input the search request in a text interaction manner, and when the user inputs text information, the text information input by the user may be directly obtained, where the text information is the search text. The text information refers to natural language type text.

And step S6, acquiring the feature vector corresponding to the search text.

In the embodiment of the present disclosure, after the search text of the user is obtained, in step S6, a feature vector corresponding to the search text may be obtained by using a preset metric representation model. For a specific description of the metric representation model, reference may be made to the above description of the metric representation model, and details are not repeated here.

And step S7, inputting the feature vector of the search text into a target query index to query out a representative text matched with the search text.

In the embodiment of the present disclosure, after the feature vector of the search text is obtained, in step S7, the feature vector of the search text is input into a pre-created target query index, and a representative text matching the search text is queried in the target query index by using a neighbor search technique. For a detailed description of the neighbor search technique, reference may be made to the above description of the neighbor search technique, and details are not repeated here.

And step S8, taking all texts in the representative text and the synonymy text cluster corresponding to the representative text as the synonymy texts of the search text for searching and recalling.

As an example, assume that there is a synonym text cluster in the text database as: [ how much money is required for the double-eyelid surgery, the price for the double-eyelid surgery, how much money is required for the double-eyelid surgery, and how much money is required for the double-eyelid surgery ], wherein the "price for the double-eyelid surgery" is a representative text corresponding to the synonymous text cluster, and the search text is "how much money is required for the double-eyelid surgery", and the representative text matched with the "how much money is required for the double-eyelid surgery" is searched in the step S7, and the "price for the double-eyelid surgery", "how much money is required for the double-eyelid surgery", and "how much money is required for the double-eyelid surgery" in the corresponding synonymous text cluster are all used as the synonymous texts of the search text.

Fig. 11 is a flowchart of another method for recalling synonymous text according to an embodiment of the present disclosure, and as shown in fig. 11, in some embodiments, before step S8, the method for recalling synonymous text may further include step S71.

And S71, identifying whether the search text and the representative text matched with the search text have a synonymy relation by using a preset synonymy discriminant model, if so, executing the step S8, otherwise, not performing further processing.

For specific description of the synonymous discriminant model, reference may be made to the description of the synonymous discriminant model, which is not repeated herein.

In step S71, in a case where it is recognized that the search text and the representative text matching the search text have a synonymous relationship, step S8 is executed; if the search text and the representative text matched with the search text are identified not to have the synonymous relationship, the representative text database does not have the text synonymous with the search text, so that the synonymous text recall is not carried out and the further processing is not carried out.

In some embodiments, the text processing method may be performed in an offline environment, while the synonymous text recall method may be performed in real-time in an online environment.

In the embodiment of the present disclosure, in the method for recalling the synonymous text, only the representative text matched with the search text of the query needs to be retrieved, and the entire amount of texts in the database does not need to be retrieved, so that the retrieval efficiency and the recall efficiency can be effectively improved, and the synonymy judgment is performed on the search text of the query and the retrieved representative text by using the synonymy judgment model, so that the recall quality can be effectively improved. When the representative text is searched, the representative text and all texts in the corresponding synonymous text cluster can be effectively recalled, and the phenomenon of missed recall is effectively avoided.

Fig. 12 is a block diagram of a text processing apparatus according to an embodiment of the present disclosure.

Referring to fig. 12, an embodiment of the present disclosure provides a text processing apparatus 300, where the text processing apparatus 300 includes: a first vector acquisition module 301, a text classification module 302, a filtering module 303, and a construction module 304.

The first vector acquisition module 301 is configured to acquire feature vectors of all texts in the text database; the text classification module 302 is configured to classify and cluster all the texts according to the feature vectors of all the texts to obtain a plurality of synonymous text clusters, wherein the synonymous text clusters include a plurality of texts with a synonymous relationship; the screening module 303 is configured to determine, for each synonymous text cluster, one text from the synonymous text cluster as a representative text corresponding to the synonymous text cluster; the building module 304 is configured to create a target query index of the text database from all feature vectors representing text.

In some embodiments, the apparatus 300 further comprises an in-cluster text filtering module (not shown) configured to: aiming at each text in each synonymous text cluster, identifying whether the text and a representative text corresponding to the synonymous text cluster have a synonymous relationship by using a preset synonymous judgment model; and under the condition that the text is identified not to have the synonymy relation with the representative text corresponding to the synonymy text cluster, removing the text from the synonymy text cluster.

It should be noted that, the text processing apparatus provided in the embodiment of the present disclosure is used to implement the text processing method provided in any one of the embodiments, and for specific description of the text processing apparatus, reference may be made to the description in the embodiment above, and details are not repeated here.

Fig. 13 is a block diagram illustrating a synonymous text recall device according to an embodiment of the present disclosure.

Referring to fig. 13, an embodiment of the present disclosure provides a synonymous text recall device 400, where the synonymous text recall device 400 includes: a request acquisition module 401, a second vector acquisition module 402, a query module 403, and a text recall module 404.

The request obtaining module 401 is configured to obtain a search request, where the search request includes search text; the second vector obtaining module 402 is configured to obtain a feature vector corresponding to the search text; the query module 403 is configured to input the feature vector of the search text into a target query index corresponding to the text database to query out a representative text matching the search text; wherein, the target query index is created by adopting the text processing method; the text recall module 404 is configured to recall all texts in the representative text and the synonymous text cluster corresponding to the representative text as the synonymous text of the search text.

In some embodiments, the apparatus 400 further includes a synonym identification module (not shown in the figure), which is configured to identify whether the search text and the representative text matching with the search text have a synonymy relationship by using a preset synonym discriminant model after the query module 403 queries the representative text matching with the search text, and in a case that the search text and the representative text matching with the search text are identified to have a synonymy relationship, trigger the text recall module 404 to perform a step of recalling all texts in a synonym text cluster corresponding to the representative text and the representative text as the synonym text of the search text.

It should be noted that the synonymous text recall device provided in the embodiment of the present disclosure is used for implementing the synonymous text recall method provided in any one of the embodiments, and specific descriptions of the synonymous text recall device may refer to the descriptions in the embodiments, and are not described herein again.

The present disclosure also provides an electronic device, a computer readable medium and a computer program product according to embodiments of the present disclosure.

FIG. 14 shows a schematic block diagram of an electronic device 800 that may be used to implement embodiments of the present disclosure. The electronic device 800 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

Referring to fig. 14, the electronic apparatus includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as the text processing method and/or the synonymous text recall method. For example, in some embodiments, the text processing methods and/or synonymous text recall methods described above may be implemented as computer software programs or instructions tangibly embodied in a machine (computer) readable medium, such as storage unit 808. In some embodiments, some or all of the computer programs or instructions may be loaded onto and/or installed onto device 800 via ROM 802 and/or communications unit 809. When computer programs or instructions are loaded into RAM 803 and executed by computing unit 801, one or more steps of the text processing method and/or the synonymous text recall method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the text processing method and/or the synonymous text recall method described above in any other suitable manner (e.g., by way of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs or instructions that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine (computer) readable medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

The present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above-described text processing method and/or the above-described synonymous text recall method.

According to the technical scheme of the embodiment of the disclosure, the texts in the text database are classified and clustered according to the synonymous relationship, and the representative texts of each cluster are selected to construct the query index of the text database, so that on one hand, the text space of the query index of the text database can be effectively reduced, the storage resources are saved, on the other hand, when the text database is used for retrieval, the whole amount of texts in the text database do not need to be retrieved, and only the representative texts of each cluster in the text database need to be retrieved through the target query index, so that the retrieval efficiency and the search recall efficiency can be effectively improved, and the representative texts and all texts in the cluster where the representative texts are located can be effectively recalled, the search recall capability is improved, and the occurrence of a recall missing phenomenon is effectively avoided.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

It is to be understood that the above-described embodiments are merely exemplary embodiments that have been employed to illustrate the principles of the present disclosure, and that the above-described specific embodiments are not to be construed as limiting the scope of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A text processing method of a text database comprises the following steps:

acquiring feature vectors of all texts in a text database;

classifying and clustering all texts according to the feature vectors of all the texts to obtain a plurality of synonymous text clusters, wherein the synonymous text clusters comprise a plurality of texts with synonymous relations;

determining a text from the synonymous text cluster aiming at each synonymous text cluster to serve as a representative text corresponding to the synonymous text cluster;

and creating a target query index of the text database according to all the feature vectors representing the texts.

2. The text processing method of claim 1, wherein the obtaining feature vectors of all texts in the text database comprises:

and acquiring a feature vector of each text in the text database by using a preset measurement representation model.

3. The method of text processing according to claim 2, wherein the metric representation model is a BERT pre-trained model.

4. The text processing method according to claim 1, wherein the classifying and clustering all the texts according to the feature vectors of all the texts to obtain a plurality of synonymous text clusters comprises:

creating an initial query index of the text database according to the feature vectors of all texts in the text database;

for each text in the text database, querying a text matched with the text through an initial query index to generate initial synonymy relationship information, wherein the initial synonymy relationship information comprises the text and the text matched with the text;

carrying out synonymy discrimination on the text in each piece of initial synonymy relationship information by using a preset synonymy discrimination model, and removing initial synonymy relationship information which does not meet the synonymy relationship in all pieces of initial synonymy relationship information;

and dividing the initial synonymous relation information with intersection in the rest initial synonymous relation information into a type of initial synonymous relation information, wherein each type of initial synonymous relation information is used as one synonymous text cluster.

5. The text processing method of claim 4, wherein the generating initial synonymity information by dividing the text matching the text by the initial query index query comprises:

and in the initial query index, querying out a text matched with the text by utilizing a neighbor retrieval technology to generate the initial synonymy relationship information.

6. The text processing method of claim 5, wherein the neighbor retrieval technique employs an HNSW algorithm.

7. The method of claim 4, wherein the synonymous discriminant model is a BERT classification model.

8. The text processing method according to claim 4, wherein the dividing of the initial synonymous relationship information having an intersection in the remaining initial synonymous relationship information into a type of initial synonymous relationship information, each type of initial synonymous relationship information being one of the synonymous text clusters, comprises:

regarding the remaining initial synonymy relationship information, taking each text in the initial synonymy relationship information as a node, taking the matching relationship between the texts in the initial synonymy relationship information as an edge, and constructing an Euler graph;

determining all connected subgraphs in the Euler graph by using a preset connected subgraph discovery algorithm;

and determining the initial synonymy relationship information with intersection according to the connected subgraph to generate the synonymy text cluster.

9. The text processing method of claim 8, wherein the connected subgraph discovery algorithm comprises a union set algorithm.

10. The text processing method according to claim 1, wherein the determining a text from the synonymous text cluster as a representative text corresponding to the synonymous text cluster comprises:

determining the corresponding synonymy relation quantity of the texts in the synonymy text cluster aiming at each text in the synonymy text cluster;

and taking any text with the maximum number of synonymy relations in the synonymy text cluster as a representative text corresponding to the synonymy text cluster.

11. The text processing method of claim 1, wherein before creating the target query index of the text database based on all the feature vectors representing the text, further comprising:

aiming at each text in each synonymous text cluster, identifying whether the text and the representative text corresponding to the synonymous text cluster have a synonymous relationship by using a preset synonymous judgment model;

and under the condition that the text is identified not to have the synonymy relation with the representative text corresponding to the synonymy text cluster, removing the text from the synonymy text cluster.

12. A method for recalling synonymous text, the recall method being implemented based on a target query index of a text database, the target query index being created by the text processing method of any one of claims 1 to 11, the recall method comprising:

acquiring a search request, wherein the search request comprises a search text;

acquiring a feature vector corresponding to the search text;

inputting the feature vector of the search text into the target query index to query a representative text matched with the search text;

and taking all texts in the representative text and the synonymous text cluster corresponding to the representative text as the synonymous texts of the search text for searching and recalling.

13. The method for recalling synonymous text according to claim 12, wherein the obtaining the feature vector corresponding to the search text includes:

and acquiring a feature vector corresponding to the search text by using a preset measurement representation model.

14. The method for recalling synonymous texts according to claim 12, wherein the step of inputting the feature vector of the search text into the target query index to query out the representative text matching with the search text comprises:

and inputting the feature vector of the search text into the target query index, and querying a representative text matched with the search text in the target query index by utilizing a neighbor retrieval technology.

15. The method for recalling synonymous text according to claim 12, wherein before the step of recalling all texts in the representative text and the synonymous text cluster corresponding to the representative text as the synonymous text of the search text, the method further comprises: identifying whether the search text and the representative text matched with the search text have a synonymy relation or not by using a preset synonymy discrimination model;

and under the condition that the search text and the representative text matched with the search text are identified to have the synonymous relationship, executing the step of taking all texts in the synonymous text cluster corresponding to the representative text and the representative text as the synonymous text of the search text for searching and recalling.

16. A text processing apparatus comprising:

the first vector acquisition module is configured to acquire feature vectors of all texts in the text database;

the text classification module is configured to classify and cluster all texts according to the feature vectors of all the texts to obtain a plurality of synonymous text clusters, and the synonymous text clusters comprise a plurality of texts with synonymous relations;

the screening module is configured to determine a text from the synonymous text cluster aiming at each synonymous text cluster to serve as a representative text corresponding to the synonymous text cluster;

a building module configured to create a target query index for the text database from all feature vectors representing text.

17. A synonymous text recall device, comprising:

a request acquisition module configured to acquire a search request, the search request including a search text;

the second vector acquisition module is configured to acquire a feature vector corresponding to the search text;

the query module is configured to input the feature vector of the search text into a target query index corresponding to a text database so as to query a representative text matched with the search text; the target query index is created by using the text processing method of any one of claims 1 to 11;

and the text recall module is configured to take all texts in the representative text and the synonymous text cluster corresponding to the representative text as the synonymous texts of the search text for searching and recalling.

18. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores one or more computer programs executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-15.

19. A computer-readable medium having a computer program stored thereon, wherein the computer program when executed implements the method of any of claims 1-15.

20. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1-15.