CN112182145A

CN112182145A - Text similarity determination method, device, equipment and storage medium

Info

Publication number: CN112182145A
Application number: CN201910600981.6A
Authority: CN
Inventors: 王艳花; 邱龙泉
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2019-07-04
Filing date: 2019-07-04
Publication date: 2021-01-05
Anticipated expiration: 2039-07-04
Also published as: CN112182145B

Abstract

Embodiments of the present invention disclose a text similarity determination method, apparatus, device and storage medium. The method includes: acquiring target text and candidate text whose similarity is to be determined; determining word semantic similarity and word frequency inverse word frequency similarity between the target text and the candidate text, wherein the word frequency inverse word frequency similarity includes The word frequency inverse word frequency similarity without part of speech and/or the word frequency inverse word frequency similarity degree with part of speech; according to the preset word semantic weight, the preset word frequency inverse word frequency weight, and the word semantic similarity and the word frequency inverse word frequency similarity, determine the the text similarity between the target text and the candidate text, wherein the preset word frequency inverse word frequency weight includes a preset word frequency inverse word frequency weight without part of speech and/or a preset word frequency inverse word frequency weight with part of speech. Through the above technical solution, the similarity of text is more accurately determined.

Description

Text similarity determination method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to a big data mining technology, in particular to a text similarity determining method, a text similarity determining device, text similarity determining equipment and a storage medium.

Background

With the development of the internet, more and more data appear on the internet in the form of texts, such as microblog messages, news headlines, post words, commodity comments in an e-commerce platform, commodity questions, and answers of buyers to the commodity questions. The internet text data application machine learning technology is aimed at, valuable information is mined from the internet text data application machine learning technology to provide useful convenience for life of people, and the problem that the demand of different aspects is met becomes a very hot topic in the current big data application technology is solved.

Take the question and answer for the goods in the e-commerce platform as an example. When making a purchase decision, a shopper will typically ask a question of the person who purchased the item, or browse the question and answer data under the item, or ask a customer service to fully understand the actual information of the item. From the perspective of a user, the user needs to browse historical question and answer data based on questions that the user wants to ask, check whether people ask similar questions, and possibly find out satisfactory answers with difficulty if the number of questions is large; from the customer service perspective, similarity calculation needs to be performed on the questions posed by the user and the existing questions in the question bank, and several most similar questions are found out, so as to answer the questions asked by the user with the help of the answers of the similar questions. Thus, it is necessary to calculate the similarity between the user's question and an existing question in the question bank.

At present, a text similarity determination scheme aiming at the problems mainly adopts a vector space model, namely, each word in a text is mapped into a vector space, the cosine distance between vectors is calculated, and the smaller the distance is, the greater the similarity of the words is. There are two main similarity determination schemes based on the vector space model, one is a similarity determination scheme based on word importance, such as a Term Frequency-Inverse Document Frequency (TF-IDF) model. TF-IDF is used to evaluate the importance of a word to a document in a corpus of text, the importance of the word increasing in direct proportion to the number of occurrences in the document, but decreasing in inverse proportion to the frequency of occurrences in the corpus. The flow of the TF-IDF based similarity determination (i.e., problem matching) scheme is shown in fig. 1. The typical application of the scheme is that in a search engine, a search problem is input from a user, the search problem is segmented, a mapping relation between the segmentation and a document is established, after all documents are found out, the similarity score and the ranking of the search problem of the user and the existing documents in a knowledge base are calculated according to a similarity algorithm (namely TF/IDF algorithm), and the result is returned according to the ranking of the documents. The other is a similarity determination scheme based on Natural Language Processing (NLP), and the common model is a Word2vec model. The Word2vec model is trained based on Harris' distribution hypothesis (i.e. words with similar contexts are similar in semantics), the neural network model obtained by training can map each Word into a K-dimensional space to represent the Word by a vector, and then the semantic similarity of the Word is measured by the similarity between vectors in the vector space. The scheme flow of the NLP-based semantic similarity model is shown in fig. 2.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art: 1) the similarity determination scheme based on the TF-IDF only uses the word frequency inverse word frequency to measure the importance of a word to a document, does not consider the position information of the word and does not consider the semantic information of the word, and the similarity accuracy between determined texts is low, so that the practical application requirement of finding out the most similar text in a question-answering system cannot be met. 2) The similarity determination scheme based on NLP has good effect only on similarity calculation at a word level, and when the similarity calculation is extended to sentence level, due to the complexity of a syntax structure, ideal effect cannot be achieved in actual application when sentence vectors are expressed simply by accumulating or splicing word vectors of all words in a sentence.

Disclosure of Invention

The embodiment of the invention provides a text similarity determination method, a text similarity determination device, text similarity determination equipment and a storage medium, so that the text similarity can be determined more accurately.

In a first aspect, an embodiment of the present invention provides a method for determining text similarity, including:

acquiring a target text and an alternative text with similarity to be determined;

determining word sense similarity and word frequency inverse word frequency similarity of the target text and the alternative text, wherein the word frequency inverse word frequency similarity comprises word frequency inverse word frequency similarity without word property and/or word frequency inverse word frequency similarity with word property;

determining the text similarity between the target text and the alternative text according to a preset word semantic weight, a preset word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity, wherein the preset word frequency inverse word frequency weight comprises a preset word frequency inverse word frequency weight without word property and/or a preset word frequency inverse word frequency weight with word property.

In a second aspect, an embodiment of the present invention further provides a text similarity determining apparatus, where the apparatus includes:

the target text acquisition module is used for acquiring a target text and an alternative text with similarity to be determined;

the first similarity determining module is used for determining word sense similarity and word frequency inverse word frequency similarity of the target text and the alternative text, wherein the word frequency inverse word frequency similarity comprises word frequency inverse word frequency similarity without word property and/or word frequency inverse word frequency similarity with word property;

and the second similarity determining module is used for determining the text similarity between the target text and the alternative text according to a preset word sense weight, a preset word frequency inverse word frequency weight, the word sense similarity and the word frequency inverse word frequency similarity, wherein the preset word frequency inverse word frequency weight comprises a preset word frequency inverse word frequency weight without word property and/or a preset word frequency inverse word frequency weight with word property.

In a third aspect, an embodiment of the present invention further provides an apparatus, where the apparatus includes:

one or more processors;

a storage device for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors implement the text similarity determination method provided in any embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the text similarity determining method provided in any embodiment of the present invention.

The word sense similarity between the target text and the alternative text and the word frequency inverse word frequency similarity containing the part-of-speech word frequency inverse word frequency similarity and/or the word frequency inverse word frequency similarity containing the part-of-speech word frequency inverse word frequency similarity are generated; and determining the text similarity between the target text and the alternative text according to the preset word semantic weight, the preset word frequency inverse word frequency weight containing the preset non-word frequency inverse word frequency weight and/or the preset word frequency inverse word frequency weight containing the word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity. The method and the device have the advantages that the whole sentence semantics of the text can be grasped by utilizing the word senses and the multiple dimensional characteristics of the word frequency and/or the word property, so that the text similarity is represented from different dimensions, the comprehensive similarity of the text is obtained, and the accuracy of the text similarity is improved to a great extent.

Drawings

FIG. 1 is a flow chart of a similarity determination method based on a word-frequency inverse word-frequency model in the prior art;

FIG. 2 is a flow chart of a similarity determination method based on a semantic similarity model of NLP in the prior art;

fig. 3a is a flowchart of a text similarity determining method according to a first embodiment of the present invention;

fig. 3b is a logic framework diagram of a text similarity determination method in the first embodiment of the present invention;

fig. 4a is a flowchart of a text similarity determining method in the second embodiment of the present invention;

FIG. 4b is a schematic diagram of the structure of the CBOW model according to the second embodiment of the present invention;

FIG. 4c is a schematic diagram of the Skip-Gram model structure in the second embodiment of the present invention;

fig. 5 is a flowchart of a text similarity determination method in the third embodiment of the present invention;

fig. 6a is a flowchart of a text similarity determining method in the fourth embodiment of the present invention;

fig. 6b is a logic framework diagram of a text similarity determination method in the fourth embodiment of the present invention;

fig. 7 is a schematic structural diagram of a text similarity determination apparatus according to a fifth embodiment of the present invention;

fig. 8 is a schematic structural diagram of an apparatus in the sixth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

The text similarity determination method provided by the embodiment can be suitable for topic, message and reply, consultation, suggestion and opinion feedback in an internet forum, and text similarity calculation based on network intelligent question answering, instant chat records and the like. The method may be performed by a text similarity determination apparatus, which may be implemented by software and/or hardware, and may be integrated in a device with a large-scale data operation function, such as a personal computer or a server. In the embodiment of the present invention, an intelligent question answering is taken as an example for explanation. Referring to fig. 3a, the method of the present embodiment specifically includes the following steps:

and S110, acquiring a target text and an alternative text with the similarity to be determined.

The target text is a text for which similarity needs to be calculated, and may be an existing text or a new text obtained from the outside. The alternative text is text for calculating the similarity with the target text. The text may be short text or long text. Short text refers to short length text such as a sentence or a few short sentences (small paragraphs).

In particular, the content input by the user can be received as the target text. Meanwhile, one or more alternative texts are obtained from the available texts which can be collected. It should be understood that the text database may be constructed in advance according to the existing text that can be collected, and then the alternative text may be obtained from the text database.

And S120, determining the word sense similarity and the word frequency inverse word frequency similarity of the target text and the alternative text.

Wherein the term semantic similarity is a similarity determined from the semantic dimension of the term. The word frequency inverse word frequency similarity is a similarity determined from the importance dimension of a word. The word frequency similarity is measured according to whether the two sentences have the same word and the frequencies are similar, and if so, the words are similar. The inverse word frequency measures the importance of each word to the sentence, and is mainly based on the fact that the more the word appears in the sentence, the less the word appears in other sentences, the more the word is important to the sentence. The word frequency and the inverse word frequency jointly represent the characteristics of each word in the sentence, so that the similarity of the sentences is calculated.

Illustratively, the word frequency inverse word frequency similarity includes an inflexibility word frequency inverse word frequency similarity and/or a word frequency inverse word frequency similarity. The word frequency-free inverse word frequency similarity refers to a word frequency-inverse word frequency similarity which does not distinguish the part of speech of each word in the target text, and the word frequency-containing inverse word frequency similarity refers to a word frequency-inverse word frequency similarity which distinguishes the part of speech of each word in the target text. Because the part of speech can reflect the semantics of the word to a certain extent, the part of speech-containing word frequency inverse word frequency similarity can be understood as the similarity determined from two dimensions of the semantics of the word and the importance of the word.

In the related technology, the similarity can be determined only from one dimension of the word, and when the similarity of texts with more words and more sentence semantics is calculated, the similarity determining method in the related technology cannot determine the similarity of the texts more accurately. Therefore, in the embodiment of the present invention, the similarity of at least two dimensions is used to determine the similarity of the text, specifically, the word meaning dimension and the word importance dimension are used, and the word importance dimension can be divided into the word importance dimension without word property and the word importance dimension with word property, so that although the text similarity measurement is still performed from the word granularity, the similarity of the text can be more completely represented by fusing the similarities of multiple dimensions.

In specific implementation, the target text and the alternative text can be used as input of a word meaning similarity model for word meaning similarity calculation, and the word meaning similarity between the target text and the alternative text can be obtained through the similarity calculation of the word meaning similarity model. Similarly, the target text and the alternative text can be used as input of a word frequency inverse word frequency similarity model for calculating the word frequency inverse word frequency similarity, and the word frequency inverse word frequency similarity between the target text and the alternative text can be obtained through the similarity calculation of the word frequency inverse word frequency similarity model. The Word sense similarity model may be a similarity model based on NLP, and may be a Word vector learning model such as a Word2vec model, a Glove model, or a Bert model. The word frequency inverse word frequency similarity model needs to calculate the word frequency inverse word frequency similarity without word property and the word frequency inverse word frequency similarity with word property, so the word frequency inverse word frequency similarity model can be divided into a word frequency inverse word frequency similarity model without word property and a word frequency inverse word frequency similarity model with word property, the two word frequency inverse word frequency similarity models can be the same model, such as a TF-IDF model, and only the input data is different, such as one input does not distinguish word groups with word property, and the other input distinguishes word groups with word property; or two different models can be adopted, and the model containing the word frequency and the inverse word frequency similarity can be required to perform specific processing aiming at the word property.

In the above process, the similarity calculation is performed on a single candidate text, and if the number of candidate texts is multiple, the similarity calculation between all candidate texts and the target text can be completed only by performing the loop operation of multiple above processes. In order to increase the calculation speed and thus increase the determination efficiency of candidate texts that are more similar to the target text in the multiple candidate texts, all the candidate texts may be calculated in parallel, for example, each candidate text is represented by a vector with different dimensions in advance, and then all the candidate texts are represented in different matrix forms. In specific implementation, the process can refer to fig. 3b, and all the candidate texts are characterized in advance as word sense feature matrices and word frequency inverse word frequency feature matrices (including word frequency inverse word frequency feature matrices without parts of speech and/or word frequency inverse word frequency feature matrices with parts of speech), which is called stock calculation. After the target text is determined, the target text is characterized into word sense characteristic vectors and word frequency inverse word frequency characteristic vectors (including word frequency inverse word frequency characteristic vectors without word parts and/or word frequency inverse word frequency characteristic vectors with word parts) based on the same model of stock calculation, and the process is called real-time incremental calculation. And finally, performing word sense similarity calculation by using the word sense characteristic vector and the word sense characteristic matrix, and performing word frequency inverse word frequency similarity calculation by using the word frequency inverse word frequency characteristic vector and the word frequency inverse word frequency characteristic matrix to determine the word sense similarity and the word frequency inverse word frequency similarity between the target text and each candidate text.

It should be noted that the part-of-speech-free word frequency inverse word frequency similarity and the part-of-speech-containing word frequency inverse word frequency similarity are not necessarily all calculated, and only one of them may be selected. That is, one or both of the part-of-speech-free word-frequency inverse word-frequency similarity model and the part-of-speech-frequency inverse word-frequency similarity model may be used in the present operation. At this time, the similarity determined in the operation may include word sense similarity, word frequency-free inverse word frequency similarity, word sense similarity, word frequency-containing inverse word frequency similarity, word sense similarity, word frequency-free inverse word frequency similarity, and word frequency-containing inverse word frequency similarity.

S130, determining the text similarity between the target text and the alternative text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity.

The preset word semantic weight and the preset word frequency inverse word frequency weight are preset weight values and are respectively used for determining the proportion of the word semantic similarity and the word frequency inverse word frequency similarity in the similarity fusion process, and the preset word semantic weight and the preset word frequency inverse word frequency weight can be set manually in advance. The text similarity refers to the similarity obtained after the similarities of different dimensions are fused, and can represent the comprehensive similarity of the target text and the alternative text.

Illustratively, the preset word frequency inverse word frequency weight comprises a preset word frequency inverse word frequency weight without a part of speech and/or a preset word frequency inverse word frequency weight with a part of speech. It should be noted that the preset word frequency inverse word frequency weight is correspondingly consistent with the word frequency inverse word frequency similarity. If the word frequency inverse word frequency similarity is the word frequency inverse word frequency similarity without the part of speech, the preset word frequency inverse word frequency weight is the preset word frequency inverse word frequency weight without the part of speech; if the word frequency inverse word frequency similarity is the word frequency inverse word frequency similarity containing the part of speech, the preset word frequency inverse word frequency weight is the preset word frequency inverse word frequency weight containing the part of speech; if the word frequency inverse word frequency similarity is the word frequency inverse word frequency similarity without the part of speech and the word frequency inverse word frequency similarity containing the part of speech, the preset word frequency inverse word frequency weight is the preset word frequency inverse word frequency weight without the part of speech and the preset word frequency inverse word frequency weight containing the part of speech.

After obtaining the similarity between the target text and the candidate text in different dimensions, the similarity of each dimension needs to be fused so as to obtain the multi-dimensional text similarity. In specific implementation, according to the weight of the similarity under each dimension, the obtained similarities of different dimensions are subjected to weighted summation to determine the text similarity, and a weighted summation formula for determining the text similarity is as shown in formula (1):

score＝w₁·score₁+w₂·score₂ (1)

wherein score represents the text similarity between the target text and the alternative text, w₁And w₂Respectively representing a preset word semantic weight and a preset word frequency inverse word frequency weight, score₁And score₂Respectively representing the word sense similarity and the word frequency inverse word frequency similarity.

According to the technical scheme of the embodiment, word sense similarity between a target text and a candidate text and word frequency inverse word frequency similarity containing word frequency without word and/or word frequency inverse word frequency are generated; and determining the text similarity between the target text and the alternative text according to the preset word semantic weight, the preset word frequency inverse word frequency weight containing the preset non-word frequency inverse word frequency weight and/or the preset word frequency inverse word frequency weight containing the word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity. The method and the device have the advantages that the whole sentence semantics of the text can be grasped by utilizing the word senses and the multiple dimensional characteristics of the word frequency and/or the word property, so that the text similarity is represented from different dimensions, the comprehensive similarity of the text is obtained, and the accuracy of the text similarity is improved to a great extent.

Example two

In this embodiment, based on the first embodiment, further optimization is performed on "determining the word sense similarity and the word frequency inverse word frequency similarity of the target text and the candidate text", and optimization is performed on "determining the text similarity of the target text and the candidate text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word sense similarity and the word frequency inverse word frequency similarity". Wherein explanations of the same or corresponding terms as those of the above embodiments are omitted. Referring to fig. 4a, the text similarity determining method provided in this embodiment includes:

and S210, acquiring a target text and an alternative text with similarity to be determined.

S220, performing word segmentation on the target text to obtain each target non-part-of-speech word corresponding to the target text.

Since the target text contains at least one sentence and the similarity is determined on the granularity of words, the word segmentation processing needs to be performed on the target text first so as to split the target text into a plurality of words. Because each word after word segmentation has no part of speech, each word obtained by word segmentation is each target part of speech-free word corresponding to the target text.

And S230, determining word sense similarity of the target text and the alternative text based on a pre-trained word sense similarity model according to each target non-part-of-speech word and the alternative text.

According to the description of the above embodiment, the Word sense similarity model may be a Word2vec model, and the model uses a shallow two-layer neural network to predict a target Word by inputting a context of the target Word (which may correspond to a CBOW model structure), or predict a context of the Word by inputting the target Word (which may correspond to a Skip-Gram model structure), so as to train a text, obtain a parameter of a network hidden layer, and obtain the trained Word2vec model. The trained Word2vec model can map each Word to a vector space, and characterize the words as corresponding feature vectors.

Referring to FIG. 4b, the CBOW model has three layers, an input layer, a hidden layer and an output layer, which are prediction P (w)_t|w_t-k,w_t-(k-1),L,w_t-1,w_t+1,w_t+2,L,w_t+k) Wherein w is_tFor the target word to be predicted, w_t-k,w_t-(k-1),L,w_t-1,w_t+1,w_t+2,L,w_t+kAnd (k is 2) is the context of the target word, namely, the first two words and the last two words of the target word are selected as the contexts, the operation from the input layer to the hidden layer is the addition of context vectors, and the hidden layer to the output layer adopts the measuring layer Softmax or Negative sampling (Negative sampling). Referring to FIG. 4c, the Skip-Gram model also has three layers, an input layer, a hidden layer and an output layer, but in contrast to the CBOW model, the Skip-Gram model is a predictive P (w)_i|w_t) Where t-c ≦ i ≦ t + c and i ≠ t, c is the window size (constant representing the context size), assuming there is one w₁,w₂,w₃,…,w_TThe target of Skip-gram is to maximize:

after the Word2vec model structure is determined, model training is required, and the model training process can be referred to as the inventory calculation flow of fig. 3 b.

Illustratively, the Word2vec model is trained in advance based on a short text database and a long text database, wherein the long text database is constructed in advance based on service data corresponding to a service scene, and the short text database is constructed in advance based on service data corresponding to a service requirement under the service scene.

The short text database is a database composed of a large number of short texts, and the collection source of the short text data in the short text database is related to specific service requirements. For example, the service requirement is only to calculate the similarity, the data source of the short text database can be the short text in any network platform, and the more the sources, the better; if the service requirement is that the reference answer is determined according to the similar questions, the data sources of the short text database can only be the question questions in the intelligent question-answering system corresponding to the target text because the emphasis points of the answers in different intelligent question-answering systems are different. The long text database refers to texts with more sentences, such as articles or product specifications, and the collection source of the data is related to specific business requirements. For example, if the service scenario is intelligent question answering, a long text database can be constructed by collecting long texts with stronger sentence logicality, such as a daemon recommendation article, a product introduction or a product specification, from any intelligent question answering system.

Since the target text may be a short text or a long text, in order to enhance text compatibility of the Word2vec model, in this embodiment, the short text database and the long text database are used simultaneously for model training, and in order to improve the semantic expression degree of the Word2vec model, the more complete the Word amount in the long text database, the better the training effect of the Word2vec model. In specific implementation, firstly, training data needs to be acquired, namely, a long text database and a short text database are acquired according to a service scene and service requirements. Then, it is preprocessed, such as word segmentation and data cleansing, which removes some stop words and punctuation marks and only retains valid data. And then, inputting the preprocessed long text database and short text database into a Word2vec model for model training to obtain a trained Word2vec model.

And in the increment calculation part, inputting each target non-part-of-speech Word into a Word sense similarity model Word2vec model obtained by training, and obtaining the row vector representation of each target non-part-of-speech Word. And then, performing mean calculation on the corresponding columns of the line vectors to obtain a line vector of a column mean value as a word sense feature vector of the target text. Likewise, word sense feature vectors corresponding to the alternative texts may be obtained. And finally, calculating the vector cosine of the word sense characteristic vector of the target text and the word sense characteristic vector corresponding to the alternative text, so as to obtain the word sense similarity of the target text and the alternative text.

Exemplarily, S230 includes: inputting each target non-part-of-speech word into a word sense similarity model to generate a word sense characteristic vector corresponding to the target text; determining word meaning similarity of the target text and the alternative text corresponding to the line vector according to the word meaning feature vector and the line vector in the word meaning feature matrix; the word meaning characteristic matrix is generated according to the word segmentation result without the part of speech of the text database and the word meaning similarity model.

When each text in the text database is an alternative text, in order to improve the operation efficiency, each alternative text in the text database may be preprocessed by word segmentation and data cleaning in advance to obtain a corresponding word segmentation result without part of speech. And then inputting the Word sense feature matrix into a Word2vec model to obtain a Word sense feature matrix of the text database, wherein each row vector in the Word sense feature matrix represents a feature vector of an alternative text. And finally, respectively calculating vector cosines between the word sense characteristic vector and each row vector in the word sense characteristic matrix, and obtaining the word sense similarity between the target text and each alternative text in the text database.

S240, determining the word frequency inverse word frequency similarity of the target text and the alternative text based on the word frequency inverse word frequency similarity model according to each target word without word frequency, the target text and the alternative text.

The word frequency inverse word frequency similarity model in the embodiment adopts a word frequency inverse word frequency model TF-IDF model, which is constructed based on the word frequency TF value and the inverse word frequency IDF value of a word. The Term Frequency (TF) is the frequency of occurrence of a given word in the text, and the TF value is calculated by normalizing the term frequency to prevent longer text. Inverse Document Frequency (IDF) is a measure of the general importance of a word. The specific calculation formula is as follows:

wherein n is_i,jMeaning word t_iIn the text d_jNumber of occurrences, Σ_kn_k,jRepresenting text d_jThe sum of the number of all words in the text database, | D | is the total number of texts in the text database, | { j: t |, where_i∈d_jDenotes the inclusion of a word t in the text database_iThe amount of text of (c). To ensure that the denominator is not 0, | { j: t ] is typically used_i∈d_j}|+1。

According to the calculation formula of the TF-IDF value, the word frequency inverse word frequency model needs to be trained in advance according to the text database to determine | D |. Meanwhile, the vector dimension of the vector representation of one word needs to be determined according to the number of non-repeated words in the text database. And after the trained TF-IDF model is obtained, inputting each target non-part-of-speech word, the target text and the text database into the TF-IDF model, obtaining a TF-IDF value of each target non-part-of-speech word to form a non-part-of-speech word frequency inverse word frequency characteristic vector of the target text, wherein the column number of the vector is consistent with the dimension number of the determined vector, the positions of elements corresponding to the target non-part-of-speech words are filled with the TF-IDF value, and the positions of the other elements are filled with 0. Similarly, a part-of-speech word frequency inverse word frequency feature vector corresponding to the candidate text can be obtained. And finally, calculating the vector cosine of the part-of-speech-free word frequency inverse word frequency characteristic vector of the target text and the part-of-speech-free word frequency inverse word frequency characteristic vector corresponding to the alternative text, so as to obtain the part-of-speech-free word frequency inverse word frequency similarity of the target text and the alternative text.

Exemplarily, S240 includes: inputting each target word without word property and target text into a word frequency inverse word frequency similarity model, and generating a word frequency inverse word frequency eigenvector without word property corresponding to the target text; determining the similarity of the word frequency inverse word frequency without the part of speech of the target text and the alternative text corresponding to the row vector according to the word frequency inverse word frequency without the part of speech eigenvector and the row vector in the word frequency inverse word frequency without the part of speech eigenvector characteristic matrix; and generating the characteristic matrix of the word frequency inverse word frequency without the part of speech according to the word segmentation result without the part of speech of the text database and the similarity model of the word frequency inverse word frequency.

When each text in the text database is an alternative text, in order to improve the operation efficiency, each alternative text in the text database may be preprocessed by word segmentation and data cleaning in advance to obtain a corresponding word segmentation result without part of speech. And then inputting the characteristic matrix into a TF-IDF model to obtain a part-of-speech word frequency-free inverse word frequency characteristic matrix of the text database, wherein the column number of the part-of-speech word frequency-free inverse word frequency characteristic matrix is consistent with the dimension of the vector, the element position of a corresponding part-of-speech word is not supplemented with 0, and each row vector in the matrix represents a characteristic vector of a candidate text. And finally, respectively calculating vector cosines between the word frequency inverse word frequency eigenvector without the part of speech and each row vector in the word frequency inverse word frequency eigenvector without the part of speech, so as to obtain the word frequency inverse word frequency similarity between the target text and each alternative text.

And S250, determining the text similarity between the target text and the alternative text according to the preset word sense weight, the preset word frequency inverse weight without word property, the word sense similarity and the word frequency inverse similarity without word property.

Respectively determining the preset word frequency inverse word frequency weight and the word frequency inverse word frequency similarity in the formula (1) as a preset word frequency inverse word frequency weight w without word property₂₁And word frequency inverse word frequency similarity score without part of speech₂₁The text similarity between the target text and the alternative text can be obtained according to the following formula:

score＝w₁·score₁+w₂₁·score₂₁ (7)

according to the technical scheme of the embodiment, the text similarity between the target text and the alternative text is determined by determining the word sense similarity and the word frequency inverse similarity without word property of the target text and the alternative text, and according to the preset word semantic weight, the preset word frequency inverse weight without word property, the word sense similarity and the word frequency inverse similarity without word property, the text similarity between the target text and the alternative text is determined, the text similarity of the target text is determined from two dimensions of the word sense and the word frequency inverse similarity without word property, and the determination accuracy of the text similarity is improved to a certain extent.

EXAMPLE III

In this embodiment, based on the first embodiment, further optimization is performed on "determining the word sense similarity and the word frequency inverse word frequency similarity of the target text and the candidate text", and optimization is performed on "determining the text similarity of the target text and the candidate text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word sense similarity and the word frequency inverse word frequency similarity". Wherein explanations of the same or corresponding terms as those of the above embodiments are omitted. Referring to fig. 5, the text similarity determining method provided in this embodiment includes:

s310, obtaining a target text and an alternative text with similarity to be determined.

And S320, performing word segmentation and part-of-speech tagging on the target text to obtain target non-part-of-speech words and target part-of-speech tagged words corresponding to the target text.

In this embodiment, the part of speech of each word needs to be distinguished, so after the target text is segmented, part of speech tagging needs to be performed on each obtained target non-part of speech word to obtain a target part of speech tagged word. If a target part-of-speech word has two or more parts-of-speech, the target part-of-speech word generates two or more target part-of-speech tagged words.

S330, determining word meaning similarity of the target text and the alternative text based on a pre-trained word meaning similarity model according to each target non-part-of-speech word and the alternative text.

S340, determining word frequency inverse word frequency similarity of the target text and the alternative text based on the word frequency inverse word frequency similarity model according to each target part of speech tagging word, the target text and the alternative text.

In the embodiment, the model for determining the word frequency-containing inverse word frequency similarity also samples the word frequency-containing inverse word frequency model TF-IDF model, only the model parameter sigma in the TF-IDF model_kn_k,jAnd | { j: t_i∈d_jWill increase due to the introduction of parts of speech. Then, the process of determining the word frequency-containing inverse word frequency similarity is as follows: marking each target part of speech with words, target texts and textsAnd inputting a TF-IDF model into the database, obtaining a TF-IDF value of each target part-of-speech tagging word to form a part-of-speech-frequency-containing inverse word-frequency feature vector of the target text, wherein the column number of the vector is consistent with the determined vector dimension, the TF-IDF value is filled in the position of the corresponding element of the target part-of-speech-containing word, and the positions of the other elements are supplemented with 0. Similarly, word segmentation, part-of-speech tagging and part-of-speech tagging can be performed on the alternative text, so that a part-of-speech-frequency-containing inverse word-frequency feature vector corresponding to the alternative text can be obtained. And finally, calculating the vector cosine of the part-of-speech-frequency-containing inverse word-frequency characteristic vector of the target text and the part-of-speech-frequency-containing inverse word-frequency characteristic vector corresponding to the alternative text, so as to obtain the part-of-speech-frequency-containing inverse word-frequency similarity of the target text and the alternative text.

Exemplarily, S340 includes: inputting each target part-of-speech tagging word and the target text into a word frequency inverse word frequency similarity model, and generating a word frequency inverse word frequency characteristic vector containing the part-of-speech corresponding to the target text; determining the part-of-speech-containing word frequency inverse word frequency similarity of the target text and the alternative text corresponding to the row vector according to the part-of-speech-containing word frequency inverse word frequency feature vector and the row vector in the part-of-speech-containing word frequency inverse word frequency feature matrix; and generating a word frequency-inverse word frequency feature matrix containing parts of speech according to the part of speech tagging word segmentation result of the text database and the word frequency-inverse word frequency similarity model.

When each text in the text database is an alternative text, in order to improve the operation efficiency, preprocessing of word segmentation, part-of-speech tagging and data cleaning can be performed on each alternative text in the text database in advance to obtain a corresponding word segmentation result containing part-of-speech. And then inputting the characteristic matrix into a TF-IDF model to obtain a part-of-speech-frequency-containing inverse word-frequency characteristic matrix of the text database, wherein the column number of the part-of-speech-frequency-containing inverse word-frequency characteristic matrix is consistent with the number of non-repeated words without repeated parts of speech in the text database, the element positions without corresponding part-of-speech-containing words are supplemented with 0, and each row vector in the matrix represents a characteristic vector of an alternative text. And finally, respectively calculating vector cosines between the word frequency inverse word frequency characteristic vector containing the part of speech and each row vector in the word frequency inverse word frequency characteristic matrix containing the part of speech, so as to obtain the word frequency inverse word frequency similarity between the target text and each alternative text.

S350, determining the text similarity between the target text and the alternative text according to the preset word semantic weight, the preset word frequency inverse weight containing the part of speech, the word semantic similarity and the word frequency inverse similarity containing the part of speech.

Respectively determining the preset word frequency inverse word frequency weight and the word frequency inverse word frequency similarity in the formula (1) as a preset word frequency inverse word frequency weight w containing word frequency₂₂And a part-of-speech-frequency-containing inverse word-frequency similarity score₂₂The text similarity between the target text and the alternative text can be obtained according to the following formula:

score＝w₁·score₁+w₂₂·score₂₂ (8)

according to the technical scheme of the embodiment, the similarity of the target text and the candidate text is determined by determining the similarity of word senses and the inverse word frequency similarity containing part-of-speech of the target text and the candidate text, and determining the similarity of the target text and the candidate text according to the preset word semantic weight, the preset inverse word frequency weight containing part-of-speech, the similarity of word senses and the inverse word frequency similarity containing part-of-speech, so that the similarity of the target text is determined from two dimensions of the word senses and the importance of the word containing part-of-speech, and the determination accuracy of the text similarity is improved to a certain extent.

Example four

In this embodiment, based on the first embodiment, further optimization is performed on "determining the word sense similarity and the word frequency inverse word frequency similarity of the target text and the candidate text", and optimization is performed on "determining the text similarity of the target text and the candidate text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word sense similarity and the word frequency inverse word frequency similarity". Wherein explanations of the same or corresponding terms as those of the above embodiments are omitted. Referring to fig. 6a, the text similarity determining method provided in this embodiment includes:

and S410, acquiring a target text and an alternative text with similarity to be determined.

And S420, performing word segmentation and part-of-speech tagging on the target text to obtain target non-part-of-speech words and target part-of-speech tagged words corresponding to the target text.

Referring to fig. 6b, the target text is subjected to word segmentation processing to obtain each target non-part-of-speech word, and is subjected to word segmentation and part-of-speech tagging processing to obtain each target part-of-speech tagged word.

S430, determining word sense similarity of the target text and the alternative text based on a pre-trained word sense similarity model according to each target non-part-of-speech word and the alternative text.

The Word sense characteristic vector of the target text can be obtained by each target non-part-of-speech Word and the trained Word2vec model, and the Word sense similarity between the Word sense characteristic vector and the line vector in the Word sense characteristic vector corresponding to the text database is calculated to obtain the Word sense similarity score between the target text and the alternative text in the text database₁. There are n alternative texts in the text database, so that n score can be obtained₁。

S440, determining the word frequency inverse word frequency similarity of the target text and the alternative text based on the word frequency inverse word frequency similarity model according to each target word without word frequency, the target text and the alternative text.

The method comprises the steps that each target non-part-of-speech word and a trained TF-IDF model can obtain a non-part-of-speech word frequency inverse word frequency characteristic vector of a target text, the non-part-of-speech word frequency inverse word frequency characteristic vector is subjected to non-part-of-speech word frequency inverse word frequency similarity calculation with a row vector in the non-part-of-speech word frequency inverse word frequency characteristic vector corresponding to a text database, and then the non-part-of-speech word frequency inverse word frequency similarity score of the target text and an alternative text in the text database can be obtained₂₁. There are n alternative texts in the text database, so that n score can be obtained₂₁。

S450, determining the word frequency inverse word frequency similarity of the target text and the alternative text based on the word frequency inverse word frequency similarity model according to each target part of speech tagging word, the target text and the alternative text.

The word frequency and inverse word frequency characteristic vector containing the part of speech of the target text can be obtained by each target part of speech tagging word and the trained TF-IDF model, and the word frequency and inverse word frequency similarity containing the part of speech of the target text is calculated with each row vector in the word frequency and inverse word frequency characteristic vector containing the part of speech corresponding to the text database, so that the word frequency and inverse word frequency similarity containing the part of speech of the target text and the alternative text in the text database can be obtained as score₂₂. There are n alternative texts in the text database, so that n score can be obtained₂₂。

And S460, determining the text similarity between the target text and the alternative text according to the preset word sense weight, the preset word frequency inverse weight without word property, the preset word frequency inverse weight with word property, the word sense similarity, the word frequency inverse similarity without word property and the word frequency inverse similarity with word property.

Determining the preset word frequency inverse word frequency weight in the formula (1) as a preset word frequency inverse word frequency weight w without word property₂₁And presetting a word frequency inverse word frequency weight w containing part of speech₂₂Determining the word frequency inverse word frequency similarity as the non-part-of-speech word frequency inverse word frequency similarity score₂₁And a part-of-speech-frequency-containing inverse word-frequency similarity score₂₂The text similarity between the target text and the alternative text can be obtained according to the following formula:

score＝w₁·score₁+w₂₁·score₂₁+w₂₂·score₂₂ (9)

there are n candidate texts in the text database, so that n scores can be obtained through formula (9), and each score represents the text similarity between the target text and the corresponding candidate text.

And S470, performing descending order arrangement on the alternative texts according to the similarity of the texts, and generating an ordering result.

After determining the similarity of each text, a plurality of candidate texts with the similarity satisfying the service requirement can be determined from all the candidate texts. At this time, in order to further improve the service operation efficiency, all the candidate texts may be sorted in a descending order according to the text similarity, and a sorting result is generated.

And S480, extracting a preset number of alternative texts from the sequencing result to be used as similar texts of the target text.

And determining the number of the alternative texts to be selected, namely the preset number according to the service requirement. And then, extracting a preset number of alternative texts ranked at the top from the ranking result as similar texts which are similar to the target text.

And S490, when the service scene is an intelligent question-answering scene and the service requirement is to determine an alternative answer of the target text, extracting the answer corresponding to each similar text from the short text database to serve as the alternative answer of the target text.

In the service scenario of intelligent question and answer, the question and answer are usually short texts, so the target text is the target short text. If the service requirement is simply to determine similar short text, S480 may end. However, if the service requirement is an alternative answer for determining the target text, the alternative text needs to be a short text in a short text database. At this time, the short answer text for each similar text is extracted from the short text database to serve as an alternative answer of the target text, so that a more accurate answer is provided for a user in an intelligent customer service in a faster and more convenient manner, or the short answer is provided for a human customer service in assisting the human customer service to serve the answer of a similar question.

According to the technical scheme of the embodiment, the text similarity of the target text is determined from three dimensions of word meaning, word frequency similarity and word frequency-containing word importance by determining the word meaning similarity, word frequency-free inverse word frequency similarity and word frequency-containing word frequency inverse word frequency similarity of the target text and the alternative text, and determining the text similarity of the target text and the alternative text according to the preset word semantic weight, the preset word frequency-free inverse word frequency weight, the preset word frequency-containing word frequency inverse word frequency weight, the word meaning similarity, the word frequency-free inverse word frequency similarity and the word frequency-containing word frequency similarity, so that the text similarity of the target text is determined from the three dimensions of word meaning, word frequency-free word importance and word frequency-containing word importance, and the determination precision of the text similarity is improved to a greater extent. The candidate texts are sorted according to the text similarity and the answer of the similar text which is sorted in the front is determined to be used as the candidate answer of the target text, so that the determination accuracy and efficiency of the candidate answer in the intelligent question-answering system can be improved.

EXAMPLE five

The present embodiment provides a text similarity determining apparatus, and referring to fig. 7, the apparatus specifically includes:

the target text determination module 710 is configured to obtain a target text and an alternative text with similarity to be determined;

a first similarity determining module 720, configured to determine word sense similarity and word frequency inverse word frequency similarity between the target text and the candidate text, where the word frequency inverse word frequency similarity includes word frequency inverse word frequency similarity without word property and/or word frequency inverse word frequency similarity with word property;

the second similarity determining module 730 is configured to determine the text similarity between the target text and the candidate text according to the preset word sense weight, the preset word frequency inverse word frequency weight, and the word sense similarity and the word frequency inverse word frequency similarity, where the preset word frequency inverse word frequency weight includes a preset word frequency inverse word frequency weight without word property and/or a preset word frequency inverse word frequency weight with word property.

Optionally, the first similarity determining module 720 is specifically configured to:

performing word segmentation on the target text to obtain target non-part-of-speech words corresponding to the target text;

determining word meaning similarity of the target text and the alternative text based on a pre-trained word meaning similarity model according to each target non-part-of-speech word and the alternative text;

determining the part-of-speech-free word frequency inverse word frequency similarity of the target text and the alternative text based on the word frequency inverse word frequency similarity model according to each target part-of-speech-free word, the target text and the alternative text;

accordingly, the second similarity determination module 730 is specifically configured to:

and determining the text similarity between the target text and the alternative text according to the preset word sense weight, the preset word frequency inverse weight without word property, the word sense similarity and the word frequency inverse similarity without word property.

performing word segmentation and part-of-speech tagging on a target text to obtain target non-part-of-speech words and target part-of-speech tagged words corresponding to the target text;

determining part-of-speech-containing word frequency inverse word frequency similarity between the target text and the alternative text based on the word frequency inverse word frequency similarity model according to each target part-of-speech tagging word, the target text and the alternative text;

and determining the text similarity between the target text and the alternative text according to the preset word semantic weight, the preset word frequency inverse weight containing the part of speech, the word semantic similarity and the word frequency inverse similarity containing the part of speech.

and determining the text similarity between the target text and the alternative text according to the preset word sense weight, the preset word frequency inverse weight without word property, the preset word frequency inverse weight with word property, the word sense similarity, the word frequency inverse similarity without word property and the word frequency inverse similarity with word property.

Optionally, the Word frequency inverse Word frequency similarity model is a Word frequency inverse Word frequency model, and the Word sense similarity model is a Word2vec model;

the Word2vec model is pre-trained on the basis of a short text database and a long text database, wherein the long text database is pre-constructed on the basis of service data corresponding to a service scene, and the short text database is pre-constructed on the basis of service data corresponding to service requirements in the service scene.

Optionally, on the basis of the foregoing apparatus, the apparatus further includes a similar text determination module, configured to:

when a plurality of alternative texts are available, after the text similarity between the target text and the alternative texts is determined, performing descending order arrangement on the alternative texts according to the text similarity to generate an ordering result;

and extracting a preset number of alternative texts from the sequencing result to be used as similar texts of the target text.

Further, on the basis of the above device, the device further includes an alternative answer determining module, configured to:

when the service scene is an intelligent question-answering scene and the service requirement is to determine the alternative answers of the target text, the target text is the target short text, the alternative texts are short texts in a short text database, a preset number of alternative texts are extracted from the sequencing result and serve as the similar texts of the target text, and then the answer corresponding to each similar text is extracted from the short text database and serves as the alternative answer of the target text.

Through the text similarity determining device in the fifth embodiment of the invention, the overall semantics of the sentences of the text can be grasped by utilizing the word senses and the multiple dimensional characteristics of the word frequency and/or the word property, so that the text similarity is represented by different dimensions, the comprehensive similarity of the text is obtained, and the accuracy of the text similarity is improved to a great extent.

The text similarity determining device provided by the embodiment of the invention can execute the text similarity determining method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the executing method.

It should be noted that, in the embodiment of the text similarity determining apparatus, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

EXAMPLE six

Referring to fig. 8, the present embodiment provides an apparatus, which includes: one or more processors 820; the storage 810 is configured to store one or more programs, and when the one or more programs are executed by the one or more processors 820, the one or more processors 820 implement the text similarity determination method provided in the embodiment of the present invention, including:

determining the text similarity between the target text and the alternative text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity, wherein the preset word frequency inverse word frequency weight comprises a preset word frequency inverse word frequency weight without word property and/or a preset word frequency inverse word frequency weight with word property.

Of course, those skilled in the art will understand that the processor 820 may also implement the technical solution of the text similarity determination method provided in any embodiment of the present invention.

The device shown in fig. 8 is only an example and should not bring any limitation to the function and the scope of use of the embodiments of the present invention. As shown in fig. 8, the apparatus includes a processor 820, a storage device 810, an input device 830, and an output device 840; the number of the processors 820 in the device may be one or more, and one processor 820 is taken as an example in fig. 8; the processor 820, storage 810, input 830, and output 840 of the apparatus may be connected by a bus or other means, such as by bus 850 in fig. 8.

The storage device 810, which is a computer-readable storage medium, may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the text similarity determination method in the embodiment of the present invention (for example, a target text acquisition module, a first similarity determination module, and a second similarity determination module in the text similarity determination device).

The storage device 810 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the storage 810 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, storage 810 may further include memory located remotely from processor 820, which may be connected to devices over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 830 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function controls of the apparatus. The output device 840 may include a display device such as a display screen.

EXAMPLE seven

The present embodiments provide a storage medium containing computer-executable instructions which, when executed by a computer processor, are operable to perform a method of text similarity determination, the method comprising:

Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the above method operations, and may also perform related operations in the text similarity determination method provided by any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions to enable a device (which may be a personal computer, a server, or a network device) to execute the text similarity determining method provided in the embodiments of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. a text similarity determination method, is characterized in that, comprises:

Obtain the target text and candidate text for the similarity to be determined;

Determining the word semantic similarity and word frequency inverse word frequency similarity between the target text and the candidate text, wherein the word frequency inverse word frequency similarity includes the word frequency inverse word frequency similarity without part of speech and/or the word frequency inverse word frequency similarity with part of speech ;

According to the preset word semantic weight, the preset word frequency inverse word frequency weight, and the word semantic similarity and the word frequency inverse word frequency similarity, determine the text similarity between the target text and the candidate text, wherein the The preset word frequency inverse word frequency weight includes a preset word frequency inverse word frequency weight without a part of speech and/or a preset word frequency inverse word frequency weight with a part of speech.

2. The method according to claim 1, wherein determining the word semantic similarity and word frequency inverse word frequency similarity between the target text and the candidate text comprises:

Perform word segmentation on the target text, and obtain each target part-of-speech word corresponding to the target text;

According to each of the target part-of-speech words and the candidate text, based on a pre-trained word semantic similarity model, determine the word semantic similarity between the target text and the candidate text;

According to each of the target non-part-of-speech words, the target text and the candidate text, based on the word frequency inverse word frequency similarity model, determine the non-part-of-speech word frequency inverse word frequency similarity of the target text and the candidate text;

Correspondingly, according to the preset word semantic weight, the preset word frequency inverse word frequency weight, and the word semantic similarity and the word frequency inverse word frequency similarity, determining the text similarity between the target text and the candidate text includes:

According to the preset word semantic weight, the preset non-part-of-speech word frequency inverse word frequency weight, and the word semantic similarity and the non-part-of-speech word frequency inverse word frequency similarity, determine the target text and the candidate text. Text similarity.

3. The method according to claim 1, wherein determining the word semantic similarity and word frequency inverse word frequency similarity between the target text and the candidate text comprises:

Perform word segmentation and part-of-speech tagging on the target text, and obtain each target part-of-speech-free word and each target part-of-speech tagged word corresponding to the target text;

According to each of the target part-of-speech tagged words, the target text and the candidate text, based on a word frequency inverse word frequency similarity model, determine the part-of-speech-containing word frequency inverse word frequency similarity between the target text and the candidate text;

According to the preset word semantic weight, the preset part-of-speech word frequency inverse word frequency weight, and the word semantic similarity and the part-of-speech word frequency inverse word frequency similarity, determine the target text and the candidate text. Text similarity.

4. The method according to claim 1, wherein determining the word semantic similarity and word frequency inverse word frequency similarity between the target text and the candidate text comprises:

According to the preset word semantic weight, the preset non-part-of-speech word frequency inverse word frequency weight, the preset part-of-speech word frequency inverse word frequency weight, and the word semantic similarity, the non-part-of-speech word frequency inverse word frequency similarity and all The word frequency and inverse word frequency similarity including part of speech are stated, and the text similarity between the target text and the candidate text is determined.

5. The method according to any one of claims 2-4, wherein the word frequency inverse word frequency similarity model is a word frequency inverse word frequency model, and the word semantic similarity model is a Word2vec model;

The Word2vec model is pre-trained based on a short text database and a long text database, wherein the long text database is pre-built based on business data corresponding to business scenarios, and the short text database is based on business data corresponding to business requirements under the business scenario. And pre-built.

6. The method according to claim 1, wherein when there are multiple candidate texts, after the determining the text similarity between the target text and the candidate text, the method further comprises:

Arrange the candidate texts in descending order according to the similarity of the texts to generate a sorting result;

A preset number of the candidate texts are extracted from the sorting result as similar texts of the target text.

7. The method according to claim 6, wherein, when the business scenario is an intelligent question-and-answer scenario, and the business requirement is to determine an alternative answer of the target text, the target text is a target short text, and the alternative The text is the short text in the short text database, and after extracting a preset number of the candidate texts from the sorting result as similar texts of the target text, it also includes:

The answer corresponding to each of the similar texts is extracted from the short text database as the candidate answer of the target text.

8. A text similarity determination device, characterized in that, comprising:

The target text acquisition module is used to acquire the target text and candidate text whose similarity is to be determined;

The first similarity determination module is used to determine the word semantic similarity and the word frequency inverse word frequency similarity between the target text and the candidate text, wherein the word frequency inverse word frequency similarity includes no part-of-speech word frequency inverse word frequency similarity and / or word frequency inverse word frequency similarity with part-of-speech;

The second similarity determination module is configured to determine the target text and the candidate text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity The text similarity of , wherein the preset word frequency inverse word frequency weight includes a preset word frequency inverse word frequency weight without part of speech and/or a preset word frequency inverse word frequency weight with part of speech.

9. A device, characterized in that the device comprises:

one or more processors;

storage means for storing one or more programs,

When the one or more programs are executed by the one or more processors, the one or more processors implement the text similarity determination method according to any one of claims 1-7.

10 . A computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the method for determining text similarity according to any one of claims 1 to 7 is implemented.