[go: up one dir, main page]

CN112182145A - Text similarity determination method, device, equipment and storage medium - Google Patents

Text similarity determination method, device, equipment and storage medium Download PDF

Info

Publication number
CN112182145A
CN112182145A CN201910600981.6A CN201910600981A CN112182145A CN 112182145 A CN112182145 A CN 112182145A CN 201910600981 A CN201910600981 A CN 201910600981A CN 112182145 A CN112182145 A CN 112182145A
Authority
CN
China
Prior art keywords
word frequency
text
similarity
word
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910600981.6A
Other languages
Chinese (zh)
Other versions
CN112182145B (en
Inventor
王艳花
邱龙泉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201910600981.6A priority Critical patent/CN112182145B/en
Publication of CN112182145A publication Critical patent/CN112182145A/en
Application granted granted Critical
Publication of CN112182145B publication Critical patent/CN112182145B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

本发明实施例公开了一种文本相似度确定方法、装置、设备和存储介质。该方法包括:获取待确定相似度的目标文本和备选文本;确定所述目标文本与所述备选文本的词语义相似度和词频逆词频相似度,其中,所述词频逆词频相似度包含无词性词频逆词频相似度和/或含词性词频逆词频相似度;依据预设词语义权重、预设词频逆词频权重、以及所述词语义相似度和所述词频逆词频相似度,确定所述目标文本与所述备选文本的文本相似度,其中,所述预设词频逆词频权重包含预设无词性词频逆词频权重和/或预设含词性词频逆词频权重。通过上述技术方案,实现了更加准确地确定文本的相似度。

Figure 201910600981

Embodiments of the present invention disclose a text similarity determination method, apparatus, device and storage medium. The method includes: acquiring target text and candidate text whose similarity is to be determined; determining word semantic similarity and word frequency inverse word frequency similarity between the target text and the candidate text, wherein the word frequency inverse word frequency similarity includes The word frequency inverse word frequency similarity without part of speech and/or the word frequency inverse word frequency similarity degree with part of speech; according to the preset word semantic weight, the preset word frequency inverse word frequency weight, and the word semantic similarity and the word frequency inverse word frequency similarity, determine the the text similarity between the target text and the candidate text, wherein the preset word frequency inverse word frequency weight includes a preset word frequency inverse word frequency weight without part of speech and/or a preset word frequency inverse word frequency weight with part of speech. Through the above technical solution, the similarity of text is more accurately determined.

Figure 201910600981

Description

Text similarity determination method, device, equipment and storage medium
Technical Field
The embodiment of the invention relates to a big data mining technology, in particular to a text similarity determining method, a text similarity determining device, text similarity determining equipment and a storage medium.
Background
With the development of the internet, more and more data appear on the internet in the form of texts, such as microblog messages, news headlines, post words, commodity comments in an e-commerce platform, commodity questions, and answers of buyers to the commodity questions. The internet text data application machine learning technology is aimed at, valuable information is mined from the internet text data application machine learning technology to provide useful convenience for life of people, and the problem that the demand of different aspects is met becomes a very hot topic in the current big data application technology is solved.
Take the question and answer for the goods in the e-commerce platform as an example. When making a purchase decision, a shopper will typically ask a question of the person who purchased the item, or browse the question and answer data under the item, or ask a customer service to fully understand the actual information of the item. From the perspective of a user, the user needs to browse historical question and answer data based on questions that the user wants to ask, check whether people ask similar questions, and possibly find out satisfactory answers with difficulty if the number of questions is large; from the customer service perspective, similarity calculation needs to be performed on the questions posed by the user and the existing questions in the question bank, and several most similar questions are found out, so as to answer the questions asked by the user with the help of the answers of the similar questions. Thus, it is necessary to calculate the similarity between the user's question and an existing question in the question bank.
At present, a text similarity determination scheme aiming at the problems mainly adopts a vector space model, namely, each word in a text is mapped into a vector space, the cosine distance between vectors is calculated, and the smaller the distance is, the greater the similarity of the words is. There are two main similarity determination schemes based on the vector space model, one is a similarity determination scheme based on word importance, such as a Term Frequency-Inverse Document Frequency (TF-IDF) model. TF-IDF is used to evaluate the importance of a word to a document in a corpus of text, the importance of the word increasing in direct proportion to the number of occurrences in the document, but decreasing in inverse proportion to the frequency of occurrences in the corpus. The flow of the TF-IDF based similarity determination (i.e., problem matching) scheme is shown in fig. 1. The typical application of the scheme is that in a search engine, a search problem is input from a user, the search problem is segmented, a mapping relation between the segmentation and a document is established, after all documents are found out, the similarity score and the ranking of the search problem of the user and the existing documents in a knowledge base are calculated according to a similarity algorithm (namely TF/IDF algorithm), and the result is returned according to the ranking of the documents. The other is a similarity determination scheme based on Natural Language Processing (NLP), and the common model is a Word2vec model. The Word2vec model is trained based on Harris' distribution hypothesis (i.e. words with similar contexts are similar in semantics), the neural network model obtained by training can map each Word into a K-dimensional space to represent the Word by a vector, and then the semantic similarity of the Word is measured by the similarity between vectors in the vector space. The scheme flow of the NLP-based semantic similarity model is shown in fig. 2.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art: 1) the similarity determination scheme based on the TF-IDF only uses the word frequency inverse word frequency to measure the importance of a word to a document, does not consider the position information of the word and does not consider the semantic information of the word, and the similarity accuracy between determined texts is low, so that the practical application requirement of finding out the most similar text in a question-answering system cannot be met. 2) The similarity determination scheme based on NLP has good effect only on similarity calculation at a word level, and when the similarity calculation is extended to sentence level, due to the complexity of a syntax structure, ideal effect cannot be achieved in actual application when sentence vectors are expressed simply by accumulating or splicing word vectors of all words in a sentence.
Disclosure of Invention
The embodiment of the invention provides a text similarity determination method, a text similarity determination device, text similarity determination equipment and a storage medium, so that the text similarity can be determined more accurately.
In a first aspect, an embodiment of the present invention provides a method for determining text similarity, including:
acquiring a target text and an alternative text with similarity to be determined;
determining word sense similarity and word frequency inverse word frequency similarity of the target text and the alternative text, wherein the word frequency inverse word frequency similarity comprises word frequency inverse word frequency similarity without word property and/or word frequency inverse word frequency similarity with word property;
determining the text similarity between the target text and the alternative text according to a preset word semantic weight, a preset word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity, wherein the preset word frequency inverse word frequency weight comprises a preset word frequency inverse word frequency weight without word property and/or a preset word frequency inverse word frequency weight with word property.
In a second aspect, an embodiment of the present invention further provides a text similarity determining apparatus, where the apparatus includes:
the target text acquisition module is used for acquiring a target text and an alternative text with similarity to be determined;
the first similarity determining module is used for determining word sense similarity and word frequency inverse word frequency similarity of the target text and the alternative text, wherein the word frequency inverse word frequency similarity comprises word frequency inverse word frequency similarity without word property and/or word frequency inverse word frequency similarity with word property;
and the second similarity determining module is used for determining the text similarity between the target text and the alternative text according to a preset word sense weight, a preset word frequency inverse word frequency weight, the word sense similarity and the word frequency inverse word frequency similarity, wherein the preset word frequency inverse word frequency weight comprises a preset word frequency inverse word frequency weight without word property and/or a preset word frequency inverse word frequency weight with word property.
In a third aspect, an embodiment of the present invention further provides an apparatus, where the apparatus includes:
one or more processors;
a storage device for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors implement the text similarity determination method provided in any embodiment of the present invention.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the text similarity determining method provided in any embodiment of the present invention.
The word sense similarity between the target text and the alternative text and the word frequency inverse word frequency similarity containing the part-of-speech word frequency inverse word frequency similarity and/or the word frequency inverse word frequency similarity containing the part-of-speech word frequency inverse word frequency similarity are generated; and determining the text similarity between the target text and the alternative text according to the preset word semantic weight, the preset word frequency inverse word frequency weight containing the preset non-word frequency inverse word frequency weight and/or the preset word frequency inverse word frequency weight containing the word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity. The method and the device have the advantages that the whole sentence semantics of the text can be grasped by utilizing the word senses and the multiple dimensional characteristics of the word frequency and/or the word property, so that the text similarity is represented from different dimensions, the comprehensive similarity of the text is obtained, and the accuracy of the text similarity is improved to a great extent.
Drawings
FIG. 1 is a flow chart of a similarity determination method based on a word-frequency inverse word-frequency model in the prior art;
FIG. 2 is a flow chart of a similarity determination method based on a semantic similarity model of NLP in the prior art;
fig. 3a is a flowchart of a text similarity determining method according to a first embodiment of the present invention;
fig. 3b is a logic framework diagram of a text similarity determination method in the first embodiment of the present invention;
fig. 4a is a flowchart of a text similarity determining method in the second embodiment of the present invention;
FIG. 4b is a schematic diagram of the structure of the CBOW model according to the second embodiment of the present invention;
FIG. 4c is a schematic diagram of the Skip-Gram model structure in the second embodiment of the present invention;
fig. 5 is a flowchart of a text similarity determination method in the third embodiment of the present invention;
fig. 6a is a flowchart of a text similarity determining method in the fourth embodiment of the present invention;
fig. 6b is a logic framework diagram of a text similarity determination method in the fourth embodiment of the present invention;
fig. 7 is a schematic structural diagram of a text similarity determination apparatus according to a fifth embodiment of the present invention;
fig. 8 is a schematic structural diagram of an apparatus in the sixth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
The text similarity determination method provided by the embodiment can be suitable for topic, message and reply, consultation, suggestion and opinion feedback in an internet forum, and text similarity calculation based on network intelligent question answering, instant chat records and the like. The method may be performed by a text similarity determination apparatus, which may be implemented by software and/or hardware, and may be integrated in a device with a large-scale data operation function, such as a personal computer or a server. In the embodiment of the present invention, an intelligent question answering is taken as an example for explanation. Referring to fig. 3a, the method of the present embodiment specifically includes the following steps:
and S110, acquiring a target text and an alternative text with the similarity to be determined.
The target text is a text for which similarity needs to be calculated, and may be an existing text or a new text obtained from the outside. The alternative text is text for calculating the similarity with the target text. The text may be short text or long text. Short text refers to short length text such as a sentence or a few short sentences (small paragraphs).
In particular, the content input by the user can be received as the target text. Meanwhile, one or more alternative texts are obtained from the available texts which can be collected. It should be understood that the text database may be constructed in advance according to the existing text that can be collected, and then the alternative text may be obtained from the text database.
And S120, determining the word sense similarity and the word frequency inverse word frequency similarity of the target text and the alternative text.
Wherein the term semantic similarity is a similarity determined from the semantic dimension of the term. The word frequency inverse word frequency similarity is a similarity determined from the importance dimension of a word. The word frequency similarity is measured according to whether the two sentences have the same word and the frequencies are similar, and if so, the words are similar. The inverse word frequency measures the importance of each word to the sentence, and is mainly based on the fact that the more the word appears in the sentence, the less the word appears in other sentences, the more the word is important to the sentence. The word frequency and the inverse word frequency jointly represent the characteristics of each word in the sentence, so that the similarity of the sentences is calculated.
Illustratively, the word frequency inverse word frequency similarity includes an inflexibility word frequency inverse word frequency similarity and/or a word frequency inverse word frequency similarity. The word frequency-free inverse word frequency similarity refers to a word frequency-inverse word frequency similarity which does not distinguish the part of speech of each word in the target text, and the word frequency-containing inverse word frequency similarity refers to a word frequency-inverse word frequency similarity which distinguishes the part of speech of each word in the target text. Because the part of speech can reflect the semantics of the word to a certain extent, the part of speech-containing word frequency inverse word frequency similarity can be understood as the similarity determined from two dimensions of the semantics of the word and the importance of the word.
In the related technology, the similarity can be determined only from one dimension of the word, and when the similarity of texts with more words and more sentence semantics is calculated, the similarity determining method in the related technology cannot determine the similarity of the texts more accurately. Therefore, in the embodiment of the present invention, the similarity of at least two dimensions is used to determine the similarity of the text, specifically, the word meaning dimension and the word importance dimension are used, and the word importance dimension can be divided into the word importance dimension without word property and the word importance dimension with word property, so that although the text similarity measurement is still performed from the word granularity, the similarity of the text can be more completely represented by fusing the similarities of multiple dimensions.
In specific implementation, the target text and the alternative text can be used as input of a word meaning similarity model for word meaning similarity calculation, and the word meaning similarity between the target text and the alternative text can be obtained through the similarity calculation of the word meaning similarity model. Similarly, the target text and the alternative text can be used as input of a word frequency inverse word frequency similarity model for calculating the word frequency inverse word frequency similarity, and the word frequency inverse word frequency similarity between the target text and the alternative text can be obtained through the similarity calculation of the word frequency inverse word frequency similarity model. The Word sense similarity model may be a similarity model based on NLP, and may be a Word vector learning model such as a Word2vec model, a Glove model, or a Bert model. The word frequency inverse word frequency similarity model needs to calculate the word frequency inverse word frequency similarity without word property and the word frequency inverse word frequency similarity with word property, so the word frequency inverse word frequency similarity model can be divided into a word frequency inverse word frequency similarity model without word property and a word frequency inverse word frequency similarity model with word property, the two word frequency inverse word frequency similarity models can be the same model, such as a TF-IDF model, and only the input data is different, such as one input does not distinguish word groups with word property, and the other input distinguishes word groups with word property; or two different models can be adopted, and the model containing the word frequency and the inverse word frequency similarity can be required to perform specific processing aiming at the word property.
In the above process, the similarity calculation is performed on a single candidate text, and if the number of candidate texts is multiple, the similarity calculation between all candidate texts and the target text can be completed only by performing the loop operation of multiple above processes. In order to increase the calculation speed and thus increase the determination efficiency of candidate texts that are more similar to the target text in the multiple candidate texts, all the candidate texts may be calculated in parallel, for example, each candidate text is represented by a vector with different dimensions in advance, and then all the candidate texts are represented in different matrix forms. In specific implementation, the process can refer to fig. 3b, and all the candidate texts are characterized in advance as word sense feature matrices and word frequency inverse word frequency feature matrices (including word frequency inverse word frequency feature matrices without parts of speech and/or word frequency inverse word frequency feature matrices with parts of speech), which is called stock calculation. After the target text is determined, the target text is characterized into word sense characteristic vectors and word frequency inverse word frequency characteristic vectors (including word frequency inverse word frequency characteristic vectors without word parts and/or word frequency inverse word frequency characteristic vectors with word parts) based on the same model of stock calculation, and the process is called real-time incremental calculation. And finally, performing word sense similarity calculation by using the word sense characteristic vector and the word sense characteristic matrix, and performing word frequency inverse word frequency similarity calculation by using the word frequency inverse word frequency characteristic vector and the word frequency inverse word frequency characteristic matrix to determine the word sense similarity and the word frequency inverse word frequency similarity between the target text and each candidate text.
It should be noted that the part-of-speech-free word frequency inverse word frequency similarity and the part-of-speech-containing word frequency inverse word frequency similarity are not necessarily all calculated, and only one of them may be selected. That is, one or both of the part-of-speech-free word-frequency inverse word-frequency similarity model and the part-of-speech-frequency inverse word-frequency similarity model may be used in the present operation. At this time, the similarity determined in the operation may include word sense similarity, word frequency-free inverse word frequency similarity, word sense similarity, word frequency-containing inverse word frequency similarity, word sense similarity, word frequency-free inverse word frequency similarity, and word frequency-containing inverse word frequency similarity.
S130, determining the text similarity between the target text and the alternative text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity.
The preset word semantic weight and the preset word frequency inverse word frequency weight are preset weight values and are respectively used for determining the proportion of the word semantic similarity and the word frequency inverse word frequency similarity in the similarity fusion process, and the preset word semantic weight and the preset word frequency inverse word frequency weight can be set manually in advance. The text similarity refers to the similarity obtained after the similarities of different dimensions are fused, and can represent the comprehensive similarity of the target text and the alternative text.
Illustratively, the preset word frequency inverse word frequency weight comprises a preset word frequency inverse word frequency weight without a part of speech and/or a preset word frequency inverse word frequency weight with a part of speech. It should be noted that the preset word frequency inverse word frequency weight is correspondingly consistent with the word frequency inverse word frequency similarity. If the word frequency inverse word frequency similarity is the word frequency inverse word frequency similarity without the part of speech, the preset word frequency inverse word frequency weight is the preset word frequency inverse word frequency weight without the part of speech; if the word frequency inverse word frequency similarity is the word frequency inverse word frequency similarity containing the part of speech, the preset word frequency inverse word frequency weight is the preset word frequency inverse word frequency weight containing the part of speech; if the word frequency inverse word frequency similarity is the word frequency inverse word frequency similarity without the part of speech and the word frequency inverse word frequency similarity containing the part of speech, the preset word frequency inverse word frequency weight is the preset word frequency inverse word frequency weight without the part of speech and the preset word frequency inverse word frequency weight containing the part of speech.
After obtaining the similarity between the target text and the candidate text in different dimensions, the similarity of each dimension needs to be fused so as to obtain the multi-dimensional text similarity. In specific implementation, according to the weight of the similarity under each dimension, the obtained similarities of different dimensions are subjected to weighted summation to determine the text similarity, and a weighted summation formula for determining the text similarity is as shown in formula (1):
score=w1·score1+w2·score2 (1)
wherein score represents the text similarity between the target text and the alternative text, w1And w2Respectively representing a preset word semantic weight and a preset word frequency inverse word frequency weight, score1And score2Respectively representing the word sense similarity and the word frequency inverse word frequency similarity.
According to the technical scheme of the embodiment, word sense similarity between a target text and a candidate text and word frequency inverse word frequency similarity containing word frequency without word and/or word frequency inverse word frequency are generated; and determining the text similarity between the target text and the alternative text according to the preset word semantic weight, the preset word frequency inverse word frequency weight containing the preset non-word frequency inverse word frequency weight and/or the preset word frequency inverse word frequency weight containing the word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity. The method and the device have the advantages that the whole sentence semantics of the text can be grasped by utilizing the word senses and the multiple dimensional characteristics of the word frequency and/or the word property, so that the text similarity is represented from different dimensions, the comprehensive similarity of the text is obtained, and the accuracy of the text similarity is improved to a great extent.
Example two
In this embodiment, based on the first embodiment, further optimization is performed on "determining the word sense similarity and the word frequency inverse word frequency similarity of the target text and the candidate text", and optimization is performed on "determining the text similarity of the target text and the candidate text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word sense similarity and the word frequency inverse word frequency similarity". Wherein explanations of the same or corresponding terms as those of the above embodiments are omitted. Referring to fig. 4a, the text similarity determining method provided in this embodiment includes:
and S210, acquiring a target text and an alternative text with similarity to be determined.
S220, performing word segmentation on the target text to obtain each target non-part-of-speech word corresponding to the target text.
Since the target text contains at least one sentence and the similarity is determined on the granularity of words, the word segmentation processing needs to be performed on the target text first so as to split the target text into a plurality of words. Because each word after word segmentation has no part of speech, each word obtained by word segmentation is each target part of speech-free word corresponding to the target text.
And S230, determining word sense similarity of the target text and the alternative text based on a pre-trained word sense similarity model according to each target non-part-of-speech word and the alternative text.
According to the description of the above embodiment, the Word sense similarity model may be a Word2vec model, and the model uses a shallow two-layer neural network to predict a target Word by inputting a context of the target Word (which may correspond to a CBOW model structure), or predict a context of the Word by inputting the target Word (which may correspond to a Skip-Gram model structure), so as to train a text, obtain a parameter of a network hidden layer, and obtain the trained Word2vec model. The trained Word2vec model can map each Word to a vector space, and characterize the words as corresponding feature vectors.
Referring to FIG. 4b, the CBOW model has three layers, an input layer, a hidden layer and an output layer, which are prediction P (w)t|wt-k,wt-(k-1),L,wt-1,wt+1,wt+2,L,wt+k) Wherein w istFor the target word to be predicted, wt-k,wt-(k-1),L,wt-1,wt+1,wt+2,L,wt+kAnd (k is 2) is the context of the target word, namely, the first two words and the last two words of the target word are selected as the contexts, the operation from the input layer to the hidden layer is the addition of context vectors, and the hidden layer to the output layer adopts the measuring layer Softmax or Negative sampling (Negative sampling). Referring to FIG. 4c, the Skip-Gram model also has three layers, an input layer, a hidden layer and an output layer, but in contrast to the CBOW model, the Skip-Gram model is a predictive P (w)i|wt) Where t-c ≦ i ≦ t + c and i ≠ t, c is the window size (constant representing the context size), assuming there is one w1,w2,w3,…,wTThe target of Skip-gram is to maximize:
Figure BDA0002119275270000101
Figure BDA0002119275270000102
after the Word2vec model structure is determined, model training is required, and the model training process can be referred to as the inventory calculation flow of fig. 3 b.
Illustratively, the Word2vec model is trained in advance based on a short text database and a long text database, wherein the long text database is constructed in advance based on service data corresponding to a service scene, and the short text database is constructed in advance based on service data corresponding to a service requirement under the service scene.
The short text database is a database composed of a large number of short texts, and the collection source of the short text data in the short text database is related to specific service requirements. For example, the service requirement is only to calculate the similarity, the data source of the short text database can be the short text in any network platform, and the more the sources, the better; if the service requirement is that the reference answer is determined according to the similar questions, the data sources of the short text database can only be the question questions in the intelligent question-answering system corresponding to the target text because the emphasis points of the answers in different intelligent question-answering systems are different. The long text database refers to texts with more sentences, such as articles or product specifications, and the collection source of the data is related to specific business requirements. For example, if the service scenario is intelligent question answering, a long text database can be constructed by collecting long texts with stronger sentence logicality, such as a daemon recommendation article, a product introduction or a product specification, from any intelligent question answering system.
Since the target text may be a short text or a long text, in order to enhance text compatibility of the Word2vec model, in this embodiment, the short text database and the long text database are used simultaneously for model training, and in order to improve the semantic expression degree of the Word2vec model, the more complete the Word amount in the long text database, the better the training effect of the Word2vec model. In specific implementation, firstly, training data needs to be acquired, namely, a long text database and a short text database are acquired according to a service scene and service requirements. Then, it is preprocessed, such as word segmentation and data cleansing, which removes some stop words and punctuation marks and only retains valid data. And then, inputting the preprocessed long text database and short text database into a Word2vec model for model training to obtain a trained Word2vec model.
And in the increment calculation part, inputting each target non-part-of-speech Word into a Word sense similarity model Word2vec model obtained by training, and obtaining the row vector representation of each target non-part-of-speech Word. And then, performing mean calculation on the corresponding columns of the line vectors to obtain a line vector of a column mean value as a word sense feature vector of the target text. Likewise, word sense feature vectors corresponding to the alternative texts may be obtained. And finally, calculating the vector cosine of the word sense characteristic vector of the target text and the word sense characteristic vector corresponding to the alternative text, so as to obtain the word sense similarity of the target text and the alternative text.
Exemplarily, S230 includes: inputting each target non-part-of-speech word into a word sense similarity model to generate a word sense characteristic vector corresponding to the target text; determining word meaning similarity of the target text and the alternative text corresponding to the line vector according to the word meaning feature vector and the line vector in the word meaning feature matrix; the word meaning characteristic matrix is generated according to the word segmentation result without the part of speech of the text database and the word meaning similarity model.
When each text in the text database is an alternative text, in order to improve the operation efficiency, each alternative text in the text database may be preprocessed by word segmentation and data cleaning in advance to obtain a corresponding word segmentation result without part of speech. And then inputting the Word sense feature matrix into a Word2vec model to obtain a Word sense feature matrix of the text database, wherein each row vector in the Word sense feature matrix represents a feature vector of an alternative text. And finally, respectively calculating vector cosines between the word sense characteristic vector and each row vector in the word sense characteristic matrix, and obtaining the word sense similarity between the target text and each alternative text in the text database.
S240, determining the word frequency inverse word frequency similarity of the target text and the alternative text based on the word frequency inverse word frequency similarity model according to each target word without word frequency, the target text and the alternative text.
The word frequency inverse word frequency similarity model in the embodiment adopts a word frequency inverse word frequency model TF-IDF model, which is constructed based on the word frequency TF value and the inverse word frequency IDF value of a word. The Term Frequency (TF) is the frequency of occurrence of a given word in the text, and the TF value is calculated by normalizing the term frequency to prevent longer text. Inverse Document Frequency (IDF) is a measure of the general importance of a word. The specific calculation formula is as follows:
Figure BDA0002119275270000121
Figure BDA0002119275270000122
Figure BDA0002119275270000123
wherein n isi,jMeaning word tiIn the text djNumber of occurrences, Σknk,jRepresenting text djThe sum of the number of all words in the text database, | D | is the total number of texts in the text database, | { j: t |, wherei∈djDenotes the inclusion of a word t in the text databaseiThe amount of text of (c). To ensure that the denominator is not 0, | { j: t ] is typically usedi∈dj}|+1。
According to the calculation formula of the TF-IDF value, the word frequency inverse word frequency model needs to be trained in advance according to the text database to determine | D |. Meanwhile, the vector dimension of the vector representation of one word needs to be determined according to the number of non-repeated words in the text database. And after the trained TF-IDF model is obtained, inputting each target non-part-of-speech word, the target text and the text database into the TF-IDF model, obtaining a TF-IDF value of each target non-part-of-speech word to form a non-part-of-speech word frequency inverse word frequency characteristic vector of the target text, wherein the column number of the vector is consistent with the dimension number of the determined vector, the positions of elements corresponding to the target non-part-of-speech words are filled with the TF-IDF value, and the positions of the other elements are filled with 0. Similarly, a part-of-speech word frequency inverse word frequency feature vector corresponding to the candidate text can be obtained. And finally, calculating the vector cosine of the part-of-speech-free word frequency inverse word frequency characteristic vector of the target text and the part-of-speech-free word frequency inverse word frequency characteristic vector corresponding to the alternative text, so as to obtain the part-of-speech-free word frequency inverse word frequency similarity of the target text and the alternative text.
Exemplarily, S240 includes: inputting each target word without word property and target text into a word frequency inverse word frequency similarity model, and generating a word frequency inverse word frequency eigenvector without word property corresponding to the target text; determining the similarity of the word frequency inverse word frequency without the part of speech of the target text and the alternative text corresponding to the row vector according to the word frequency inverse word frequency without the part of speech eigenvector and the row vector in the word frequency inverse word frequency without the part of speech eigenvector characteristic matrix; and generating the characteristic matrix of the word frequency inverse word frequency without the part of speech according to the word segmentation result without the part of speech of the text database and the similarity model of the word frequency inverse word frequency.
When each text in the text database is an alternative text, in order to improve the operation efficiency, each alternative text in the text database may be preprocessed by word segmentation and data cleaning in advance to obtain a corresponding word segmentation result without part of speech. And then inputting the characteristic matrix into a TF-IDF model to obtain a part-of-speech word frequency-free inverse word frequency characteristic matrix of the text database, wherein the column number of the part-of-speech word frequency-free inverse word frequency characteristic matrix is consistent with the dimension of the vector, the element position of a corresponding part-of-speech word is not supplemented with 0, and each row vector in the matrix represents a characteristic vector of a candidate text. And finally, respectively calculating vector cosines between the word frequency inverse word frequency eigenvector without the part of speech and each row vector in the word frequency inverse word frequency eigenvector without the part of speech, so as to obtain the word frequency inverse word frequency similarity between the target text and each alternative text.
And S250, determining the text similarity between the target text and the alternative text according to the preset word sense weight, the preset word frequency inverse weight without word property, the word sense similarity and the word frequency inverse similarity without word property.
Respectively determining the preset word frequency inverse word frequency weight and the word frequency inverse word frequency similarity in the formula (1) as a preset word frequency inverse word frequency weight w without word property21And word frequency inverse word frequency similarity score without part of speech21The text similarity between the target text and the alternative text can be obtained according to the following formula:
score=w1·score1+w21·score21 (7)
according to the technical scheme of the embodiment, the text similarity between the target text and the alternative text is determined by determining the word sense similarity and the word frequency inverse similarity without word property of the target text and the alternative text, and according to the preset word semantic weight, the preset word frequency inverse weight without word property, the word sense similarity and the word frequency inverse similarity without word property, the text similarity between the target text and the alternative text is determined, the text similarity of the target text is determined from two dimensions of the word sense and the word frequency inverse similarity without word property, and the determination accuracy of the text similarity is improved to a certain extent.
EXAMPLE III
In this embodiment, based on the first embodiment, further optimization is performed on "determining the word sense similarity and the word frequency inverse word frequency similarity of the target text and the candidate text", and optimization is performed on "determining the text similarity of the target text and the candidate text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word sense similarity and the word frequency inverse word frequency similarity". Wherein explanations of the same or corresponding terms as those of the above embodiments are omitted. Referring to fig. 5, the text similarity determining method provided in this embodiment includes:
s310, obtaining a target text and an alternative text with similarity to be determined.
And S320, performing word segmentation and part-of-speech tagging on the target text to obtain target non-part-of-speech words and target part-of-speech tagged words corresponding to the target text.
In this embodiment, the part of speech of each word needs to be distinguished, so after the target text is segmented, part of speech tagging needs to be performed on each obtained target non-part of speech word to obtain a target part of speech tagged word. If a target part-of-speech word has two or more parts-of-speech, the target part-of-speech word generates two or more target part-of-speech tagged words.
S330, determining word meaning similarity of the target text and the alternative text based on a pre-trained word meaning similarity model according to each target non-part-of-speech word and the alternative text.
S340, determining word frequency inverse word frequency similarity of the target text and the alternative text based on the word frequency inverse word frequency similarity model according to each target part of speech tagging word, the target text and the alternative text.
In the embodiment, the model for determining the word frequency-containing inverse word frequency similarity also samples the word frequency-containing inverse word frequency model TF-IDF model, only the model parameter sigma in the TF-IDF modelknk,jAnd | { j: ti∈djWill increase due to the introduction of parts of speech. Then, the process of determining the word frequency-containing inverse word frequency similarity is as follows: marking each target part of speech with words, target texts and textsAnd inputting a TF-IDF model into the database, obtaining a TF-IDF value of each target part-of-speech tagging word to form a part-of-speech-frequency-containing inverse word-frequency feature vector of the target text, wherein the column number of the vector is consistent with the determined vector dimension, the TF-IDF value is filled in the position of the corresponding element of the target part-of-speech-containing word, and the positions of the other elements are supplemented with 0. Similarly, word segmentation, part-of-speech tagging and part-of-speech tagging can be performed on the alternative text, so that a part-of-speech-frequency-containing inverse word-frequency feature vector corresponding to the alternative text can be obtained. And finally, calculating the vector cosine of the part-of-speech-frequency-containing inverse word-frequency characteristic vector of the target text and the part-of-speech-frequency-containing inverse word-frequency characteristic vector corresponding to the alternative text, so as to obtain the part-of-speech-frequency-containing inverse word-frequency similarity of the target text and the alternative text.
Exemplarily, S340 includes: inputting each target part-of-speech tagging word and the target text into a word frequency inverse word frequency similarity model, and generating a word frequency inverse word frequency characteristic vector containing the part-of-speech corresponding to the target text; determining the part-of-speech-containing word frequency inverse word frequency similarity of the target text and the alternative text corresponding to the row vector according to the part-of-speech-containing word frequency inverse word frequency feature vector and the row vector in the part-of-speech-containing word frequency inverse word frequency feature matrix; and generating a word frequency-inverse word frequency feature matrix containing parts of speech according to the part of speech tagging word segmentation result of the text database and the word frequency-inverse word frequency similarity model.
When each text in the text database is an alternative text, in order to improve the operation efficiency, preprocessing of word segmentation, part-of-speech tagging and data cleaning can be performed on each alternative text in the text database in advance to obtain a corresponding word segmentation result containing part-of-speech. And then inputting the characteristic matrix into a TF-IDF model to obtain a part-of-speech-frequency-containing inverse word-frequency characteristic matrix of the text database, wherein the column number of the part-of-speech-frequency-containing inverse word-frequency characteristic matrix is consistent with the number of non-repeated words without repeated parts of speech in the text database, the element positions without corresponding part-of-speech-containing words are supplemented with 0, and each row vector in the matrix represents a characteristic vector of an alternative text. And finally, respectively calculating vector cosines between the word frequency inverse word frequency characteristic vector containing the part of speech and each row vector in the word frequency inverse word frequency characteristic matrix containing the part of speech, so as to obtain the word frequency inverse word frequency similarity between the target text and each alternative text.
S350, determining the text similarity between the target text and the alternative text according to the preset word semantic weight, the preset word frequency inverse weight containing the part of speech, the word semantic similarity and the word frequency inverse similarity containing the part of speech.
Respectively determining the preset word frequency inverse word frequency weight and the word frequency inverse word frequency similarity in the formula (1) as a preset word frequency inverse word frequency weight w containing word frequency22And a part-of-speech-frequency-containing inverse word-frequency similarity score22The text similarity between the target text and the alternative text can be obtained according to the following formula:
score=w1·score1+w22·score22 (8)
according to the technical scheme of the embodiment, the similarity of the target text and the candidate text is determined by determining the similarity of word senses and the inverse word frequency similarity containing part-of-speech of the target text and the candidate text, and determining the similarity of the target text and the candidate text according to the preset word semantic weight, the preset inverse word frequency weight containing part-of-speech, the similarity of word senses and the inverse word frequency similarity containing part-of-speech, so that the similarity of the target text is determined from two dimensions of the word senses and the importance of the word containing part-of-speech, and the determination accuracy of the text similarity is improved to a certain extent.
Example four
In this embodiment, based on the first embodiment, further optimization is performed on "determining the word sense similarity and the word frequency inverse word frequency similarity of the target text and the candidate text", and optimization is performed on "determining the text similarity of the target text and the candidate text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word sense similarity and the word frequency inverse word frequency similarity". Wherein explanations of the same or corresponding terms as those of the above embodiments are omitted. Referring to fig. 6a, the text similarity determining method provided in this embodiment includes:
and S410, acquiring a target text and an alternative text with similarity to be determined.
And S420, performing word segmentation and part-of-speech tagging on the target text to obtain target non-part-of-speech words and target part-of-speech tagged words corresponding to the target text.
Referring to fig. 6b, the target text is subjected to word segmentation processing to obtain each target non-part-of-speech word, and is subjected to word segmentation and part-of-speech tagging processing to obtain each target part-of-speech tagged word.
S430, determining word sense similarity of the target text and the alternative text based on a pre-trained word sense similarity model according to each target non-part-of-speech word and the alternative text.
The Word sense characteristic vector of the target text can be obtained by each target non-part-of-speech Word and the trained Word2vec model, and the Word sense similarity between the Word sense characteristic vector and the line vector in the Word sense characteristic vector corresponding to the text database is calculated to obtain the Word sense similarity score between the target text and the alternative text in the text database1. There are n alternative texts in the text database, so that n score can be obtained1
S440, determining the word frequency inverse word frequency similarity of the target text and the alternative text based on the word frequency inverse word frequency similarity model according to each target word without word frequency, the target text and the alternative text.
The method comprises the steps that each target non-part-of-speech word and a trained TF-IDF model can obtain a non-part-of-speech word frequency inverse word frequency characteristic vector of a target text, the non-part-of-speech word frequency inverse word frequency characteristic vector is subjected to non-part-of-speech word frequency inverse word frequency similarity calculation with a row vector in the non-part-of-speech word frequency inverse word frequency characteristic vector corresponding to a text database, and then the non-part-of-speech word frequency inverse word frequency similarity score of the target text and an alternative text in the text database can be obtained21. There are n alternative texts in the text database, so that n score can be obtained21
S450, determining the word frequency inverse word frequency similarity of the target text and the alternative text based on the word frequency inverse word frequency similarity model according to each target part of speech tagging word, the target text and the alternative text.
The word frequency and inverse word frequency characteristic vector containing the part of speech of the target text can be obtained by each target part of speech tagging word and the trained TF-IDF model, and the word frequency and inverse word frequency similarity containing the part of speech of the target text is calculated with each row vector in the word frequency and inverse word frequency characteristic vector containing the part of speech corresponding to the text database, so that the word frequency and inverse word frequency similarity containing the part of speech of the target text and the alternative text in the text database can be obtained as score22. There are n alternative texts in the text database, so that n score can be obtained22
And S460, determining the text similarity between the target text and the alternative text according to the preset word sense weight, the preset word frequency inverse weight without word property, the preset word frequency inverse weight with word property, the word sense similarity, the word frequency inverse similarity without word property and the word frequency inverse similarity with word property.
Determining the preset word frequency inverse word frequency weight in the formula (1) as a preset word frequency inverse word frequency weight w without word property21And presetting a word frequency inverse word frequency weight w containing part of speech22Determining the word frequency inverse word frequency similarity as the non-part-of-speech word frequency inverse word frequency similarity score21And a part-of-speech-frequency-containing inverse word-frequency similarity score22The text similarity between the target text and the alternative text can be obtained according to the following formula:
score=w1·score1+w21·score21+w22·score22 (9)
there are n candidate texts in the text database, so that n scores can be obtained through formula (9), and each score represents the text similarity between the target text and the corresponding candidate text.
And S470, performing descending order arrangement on the alternative texts according to the similarity of the texts, and generating an ordering result.
After determining the similarity of each text, a plurality of candidate texts with the similarity satisfying the service requirement can be determined from all the candidate texts. At this time, in order to further improve the service operation efficiency, all the candidate texts may be sorted in a descending order according to the text similarity, and a sorting result is generated.
And S480, extracting a preset number of alternative texts from the sequencing result to be used as similar texts of the target text.
And determining the number of the alternative texts to be selected, namely the preset number according to the service requirement. And then, extracting a preset number of alternative texts ranked at the top from the ranking result as similar texts which are similar to the target text.
And S490, when the service scene is an intelligent question-answering scene and the service requirement is to determine an alternative answer of the target text, extracting the answer corresponding to each similar text from the short text database to serve as the alternative answer of the target text.
In the service scenario of intelligent question and answer, the question and answer are usually short texts, so the target text is the target short text. If the service requirement is simply to determine similar short text, S480 may end. However, if the service requirement is an alternative answer for determining the target text, the alternative text needs to be a short text in a short text database. At this time, the short answer text for each similar text is extracted from the short text database to serve as an alternative answer of the target text, so that a more accurate answer is provided for a user in an intelligent customer service in a faster and more convenient manner, or the short answer is provided for a human customer service in assisting the human customer service to serve the answer of a similar question.
According to the technical scheme of the embodiment, the text similarity of the target text is determined from three dimensions of word meaning, word frequency similarity and word frequency-containing word importance by determining the word meaning similarity, word frequency-free inverse word frequency similarity and word frequency-containing word frequency inverse word frequency similarity of the target text and the alternative text, and determining the text similarity of the target text and the alternative text according to the preset word semantic weight, the preset word frequency-free inverse word frequency weight, the preset word frequency-containing word frequency inverse word frequency weight, the word meaning similarity, the word frequency-free inverse word frequency similarity and the word frequency-containing word frequency similarity, so that the text similarity of the target text is determined from the three dimensions of word meaning, word frequency-free word importance and word frequency-containing word importance, and the determination precision of the text similarity is improved to a greater extent. The candidate texts are sorted according to the text similarity and the answer of the similar text which is sorted in the front is determined to be used as the candidate answer of the target text, so that the determination accuracy and efficiency of the candidate answer in the intelligent question-answering system can be improved.
EXAMPLE five
The present embodiment provides a text similarity determining apparatus, and referring to fig. 7, the apparatus specifically includes:
the target text determination module 710 is configured to obtain a target text and an alternative text with similarity to be determined;
a first similarity determining module 720, configured to determine word sense similarity and word frequency inverse word frequency similarity between the target text and the candidate text, where the word frequency inverse word frequency similarity includes word frequency inverse word frequency similarity without word property and/or word frequency inverse word frequency similarity with word property;
the second similarity determining module 730 is configured to determine the text similarity between the target text and the candidate text according to the preset word sense weight, the preset word frequency inverse word frequency weight, and the word sense similarity and the word frequency inverse word frequency similarity, where the preset word frequency inverse word frequency weight includes a preset word frequency inverse word frequency weight without word property and/or a preset word frequency inverse word frequency weight with word property.
Optionally, the first similarity determining module 720 is specifically configured to:
performing word segmentation on the target text to obtain target non-part-of-speech words corresponding to the target text;
determining word meaning similarity of the target text and the alternative text based on a pre-trained word meaning similarity model according to each target non-part-of-speech word and the alternative text;
determining the part-of-speech-free word frequency inverse word frequency similarity of the target text and the alternative text based on the word frequency inverse word frequency similarity model according to each target part-of-speech-free word, the target text and the alternative text;
accordingly, the second similarity determination module 730 is specifically configured to:
and determining the text similarity between the target text and the alternative text according to the preset word sense weight, the preset word frequency inverse weight without word property, the word sense similarity and the word frequency inverse similarity without word property.
Optionally, the first similarity determining module 720 is specifically configured to:
performing word segmentation and part-of-speech tagging on a target text to obtain target non-part-of-speech words and target part-of-speech tagged words corresponding to the target text;
determining word meaning similarity of the target text and the alternative text based on a pre-trained word meaning similarity model according to each target non-part-of-speech word and the alternative text;
determining part-of-speech-containing word frequency inverse word frequency similarity between the target text and the alternative text based on the word frequency inverse word frequency similarity model according to each target part-of-speech tagging word, the target text and the alternative text;
accordingly, the second similarity determination module 730 is specifically configured to:
and determining the text similarity between the target text and the alternative text according to the preset word semantic weight, the preset word frequency inverse weight containing the part of speech, the word semantic similarity and the word frequency inverse similarity containing the part of speech.
Optionally, the first similarity determining module 720 is specifically configured to:
performing word segmentation and part-of-speech tagging on a target text to obtain target non-part-of-speech words and target part-of-speech tagged words corresponding to the target text;
determining word meaning similarity of the target text and the alternative text based on a pre-trained word meaning similarity model according to each target non-part-of-speech word and the alternative text;
determining the part-of-speech-free word frequency inverse word frequency similarity of the target text and the alternative text based on the word frequency inverse word frequency similarity model according to each target part-of-speech-free word, the target text and the alternative text;
determining part-of-speech-containing word frequency inverse word frequency similarity between the target text and the alternative text based on the word frequency inverse word frequency similarity model according to each target part-of-speech tagging word, the target text and the alternative text;
accordingly, the second similarity determination module 730 is specifically configured to:
and determining the text similarity between the target text and the alternative text according to the preset word sense weight, the preset word frequency inverse weight without word property, the preset word frequency inverse weight with word property, the word sense similarity, the word frequency inverse similarity without word property and the word frequency inverse similarity with word property.
Optionally, the Word frequency inverse Word frequency similarity model is a Word frequency inverse Word frequency model, and the Word sense similarity model is a Word2vec model;
the Word2vec model is pre-trained on the basis of a short text database and a long text database, wherein the long text database is pre-constructed on the basis of service data corresponding to a service scene, and the short text database is pre-constructed on the basis of service data corresponding to service requirements in the service scene.
Optionally, on the basis of the foregoing apparatus, the apparatus further includes a similar text determination module, configured to:
when a plurality of alternative texts are available, after the text similarity between the target text and the alternative texts is determined, performing descending order arrangement on the alternative texts according to the text similarity to generate an ordering result;
and extracting a preset number of alternative texts from the sequencing result to be used as similar texts of the target text.
Further, on the basis of the above device, the device further includes an alternative answer determining module, configured to:
when the service scene is an intelligent question-answering scene and the service requirement is to determine the alternative answers of the target text, the target text is the target short text, the alternative texts are short texts in a short text database, a preset number of alternative texts are extracted from the sequencing result and serve as the similar texts of the target text, and then the answer corresponding to each similar text is extracted from the short text database and serves as the alternative answer of the target text.
Through the text similarity determining device in the fifth embodiment of the invention, the overall semantics of the sentences of the text can be grasped by utilizing the word senses and the multiple dimensional characteristics of the word frequency and/or the word property, so that the text similarity is represented by different dimensions, the comprehensive similarity of the text is obtained, and the accuracy of the text similarity is improved to a great extent.
The text similarity determining device provided by the embodiment of the invention can execute the text similarity determining method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the executing method.
It should be noted that, in the embodiment of the text similarity determining apparatus, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
EXAMPLE six
Referring to fig. 8, the present embodiment provides an apparatus, which includes: one or more processors 820; the storage 810 is configured to store one or more programs, and when the one or more programs are executed by the one or more processors 820, the one or more processors 820 implement the text similarity determination method provided in the embodiment of the present invention, including:
acquiring a target text and an alternative text with similarity to be determined;
determining word sense similarity and word frequency inverse word frequency similarity of the target text and the alternative text, wherein the word frequency inverse word frequency similarity comprises word frequency inverse word frequency similarity without word property and/or word frequency inverse word frequency similarity with word property;
determining the text similarity between the target text and the alternative text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity, wherein the preset word frequency inverse word frequency weight comprises a preset word frequency inverse word frequency weight without word property and/or a preset word frequency inverse word frequency weight with word property.
Of course, those skilled in the art will understand that the processor 820 may also implement the technical solution of the text similarity determination method provided in any embodiment of the present invention.
The device shown in fig. 8 is only an example and should not bring any limitation to the function and the scope of use of the embodiments of the present invention. As shown in fig. 8, the apparatus includes a processor 820, a storage device 810, an input device 830, and an output device 840; the number of the processors 820 in the device may be one or more, and one processor 820 is taken as an example in fig. 8; the processor 820, storage 810, input 830, and output 840 of the apparatus may be connected by a bus or other means, such as by bus 850 in fig. 8.
The storage device 810, which is a computer-readable storage medium, may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the text similarity determination method in the embodiment of the present invention (for example, a target text acquisition module, a first similarity determination module, and a second similarity determination module in the text similarity determination device).
The storage device 810 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the storage 810 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, storage 810 may further include memory located remotely from processor 820, which may be connected to devices over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 830 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function controls of the apparatus. The output device 840 may include a display device such as a display screen.
EXAMPLE seven
The present embodiments provide a storage medium containing computer-executable instructions which, when executed by a computer processor, are operable to perform a method of text similarity determination, the method comprising:
acquiring a target text and an alternative text with similarity to be determined;
determining word sense similarity and word frequency inverse word frequency similarity of the target text and the alternative text, wherein the word frequency inverse word frequency similarity comprises word frequency inverse word frequency similarity without word property and/or word frequency inverse word frequency similarity with word property;
determining the text similarity between the target text and the alternative text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity, wherein the preset word frequency inverse word frequency weight comprises a preset word frequency inverse word frequency weight without word property and/or a preset word frequency inverse word frequency weight with word property.
Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the above method operations, and may also perform related operations in the text similarity determination method provided by any embodiment of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions to enable a device (which may be a personal computer, a server, or a network device) to execute the text similarity determining method provided in the embodiments of the present invention.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1.一种文本相似度确定方法,其特征在于,包括:1. a text similarity determination method, is characterized in that, comprises: 获取待确定相似度的目标文本和备选文本;Obtain the target text and candidate text for the similarity to be determined; 确定所述目标文本与所述备选文本的词语义相似度和词频逆词频相似度,其中,所述词频逆词频相似度包含无词性词频逆词频相似度和/或含词性词频逆词频相似度;Determining the word semantic similarity and word frequency inverse word frequency similarity between the target text and the candidate text, wherein the word frequency inverse word frequency similarity includes the word frequency inverse word frequency similarity without part of speech and/or the word frequency inverse word frequency similarity with part of speech ; 依据预设词语义权重、预设词频逆词频权重、以及所述词语义相似度和所述词频逆词频相似度,确定所述目标文本与所述备选文本的文本相似度,其中,所述预设词频逆词频权重包含预设无词性词频逆词频权重和/或预设含词性词频逆词频权重。According to the preset word semantic weight, the preset word frequency inverse word frequency weight, and the word semantic similarity and the word frequency inverse word frequency similarity, determine the text similarity between the target text and the candidate text, wherein the The preset word frequency inverse word frequency weight includes a preset word frequency inverse word frequency weight without a part of speech and/or a preset word frequency inverse word frequency weight with a part of speech. 2.根据权利要求1所述的方法,其特征在于,确定所述目标文本与所述备选文本的词语义相似度和词频逆词频相似度包括:2. The method according to claim 1, wherein determining the word semantic similarity and word frequency inverse word frequency similarity between the target text and the candidate text comprises: 对所述目标文本进行分词,获得所述目标文本对应的各目标无词性词;Perform word segmentation on the target text, and obtain each target part-of-speech word corresponding to the target text; 依据每个所述目标无词性词和所述备选文本,基于预先训练的词语义相似度模型,确定所述目标文本与所述备选文本的词语义相似度;According to each of the target part-of-speech words and the candidate text, based on a pre-trained word semantic similarity model, determine the word semantic similarity between the target text and the candidate text; 依据每个所述目标无词性词、所述目标文本和所述备选文本,基于词频逆词频相似度模型,确定所述目标文本与所述备选文本的无词性词频逆词频相似度;According to each of the target non-part-of-speech words, the target text and the candidate text, based on the word frequency inverse word frequency similarity model, determine the non-part-of-speech word frequency inverse word frequency similarity of the target text and the candidate text; 相应地,依据预设词语义权重、预设词频逆词频权重、以及所述词语义相似度和所述词频逆词频相似度,确定所述目标文本与所述备选文本的文本相似度包括:Correspondingly, according to the preset word semantic weight, the preset word frequency inverse word frequency weight, and the word semantic similarity and the word frequency inverse word frequency similarity, determining the text similarity between the target text and the candidate text includes: 依据所述预设词语义权重、所述预设无词性词频逆词频权重、以及所述词语义相似度和所述无词性词频逆词频相似度,确定所述目标文本与所述备选文本的文本相似度。According to the preset word semantic weight, the preset non-part-of-speech word frequency inverse word frequency weight, and the word semantic similarity and the non-part-of-speech word frequency inverse word frequency similarity, determine the target text and the candidate text. Text similarity. 3.根据权利要求1所述的方法,其特征在于,确定所述目标文本与所述备选文本的词语义相似度和词频逆词频相似度包括:3. The method according to claim 1, wherein determining the word semantic similarity and word frequency inverse word frequency similarity between the target text and the candidate text comprises: 对所述目标文本进行分词及词性标注,获得所述目标文本对应的各目标无词性词和各目标词性标注词;Perform word segmentation and part-of-speech tagging on the target text, and obtain each target part-of-speech-free word and each target part-of-speech tagged word corresponding to the target text; 依据每个所述目标无词性词和所述备选文本,基于预先训练的词语义相似度模型,确定所述目标文本与所述备选文本的词语义相似度;According to each of the target part-of-speech words and the candidate text, based on a pre-trained word semantic similarity model, determine the word semantic similarity between the target text and the candidate text; 依据每个所述目标词性标注词、所述目标文本和所述备选文本,基于词频逆词频相似度模型,确定所述目标文本与所述备选文本的含词性词频逆词频相似度;According to each of the target part-of-speech tagged words, the target text and the candidate text, based on a word frequency inverse word frequency similarity model, determine the part-of-speech-containing word frequency inverse word frequency similarity between the target text and the candidate text; 相应地,依据预设词语义权重、预设词频逆词频权重、以及所述词语义相似度和所述词频逆词频相似度,确定所述目标文本与所述备选文本的文本相似度包括:Correspondingly, according to the preset word semantic weight, the preset word frequency inverse word frequency weight, and the word semantic similarity and the word frequency inverse word frequency similarity, determining the text similarity between the target text and the candidate text includes: 依据所述预设词语义权重、所述预设含词性词频逆词频权重、以及所述词语义相似度和所述含词性词频逆词频相似度,确定所述目标文本与所述备选文本的文本相似度。According to the preset word semantic weight, the preset part-of-speech word frequency inverse word frequency weight, and the word semantic similarity and the part-of-speech word frequency inverse word frequency similarity, determine the target text and the candidate text. Text similarity. 4.根据权利要求1所述的方法,其特征在于,确定所述目标文本与所述备选文本的词语义相似度和词频逆词频相似度包括:4. The method according to claim 1, wherein determining the word semantic similarity and word frequency inverse word frequency similarity between the target text and the candidate text comprises: 对所述目标文本进行分词及词性标注,获得所述目标文本对应的各目标无词性词和各目标词性标注词;Perform word segmentation and part-of-speech tagging on the target text, and obtain each target part-of-speech-free word and each target part-of-speech tagged word corresponding to the target text; 依据每个所述目标无词性词和所述备选文本,基于预先训练的词语义相似度模型,确定所述目标文本与所述备选文本的词语义相似度;According to each of the target part-of-speech words and the candidate text, based on a pre-trained word semantic similarity model, determine the word semantic similarity between the target text and the candidate text; 依据每个所述目标无词性词、所述目标文本和所述备选文本,基于词频逆词频相似度模型,确定所述目标文本与所述备选文本的无词性词频逆词频相似度;According to each of the target non-part-of-speech words, the target text and the candidate text, based on the word frequency inverse word frequency similarity model, determine the non-part-of-speech word frequency inverse word frequency similarity of the target text and the candidate text; 依据每个所述目标词性标注词、所述目标文本和所述备选文本,基于词频逆词频相似度模型,确定所述目标文本与所述备选文本的含词性词频逆词频相似度;According to each of the target part-of-speech tagged words, the target text and the candidate text, based on a word frequency inverse word frequency similarity model, determine the part-of-speech-containing word frequency inverse word frequency similarity between the target text and the candidate text; 相应地,依据预设词语义权重、预设词频逆词频权重、以及所述词语义相似度和所述词频逆词频相似度,确定所述目标文本与所述备选文本的文本相似度包括:Correspondingly, according to the preset word semantic weight, the preset word frequency inverse word frequency weight, and the word semantic similarity and the word frequency inverse word frequency similarity, determining the text similarity between the target text and the candidate text includes: 依据所述预设词语义权重、所述预设无词性词频逆词频权重、所述预设含词性词频逆词频权重、以及所述词语义相似度、所述无词性词频逆词频相似度和所述含词性词频逆词频相似度,确定所述目标文本与所述备选文本的文本相似度。According to the preset word semantic weight, the preset non-part-of-speech word frequency inverse word frequency weight, the preset part-of-speech word frequency inverse word frequency weight, and the word semantic similarity, the non-part-of-speech word frequency inverse word frequency similarity and all The word frequency and inverse word frequency similarity including part of speech are stated, and the text similarity between the target text and the candidate text is determined. 5.根据权利要求2-4任一项所述的方法,其特征在于,所述词频逆词频相似度模型为词频逆词频模型,所述词语义相似度模型为Word2vec模型;5. The method according to any one of claims 2-4, wherein the word frequency inverse word frequency similarity model is a word frequency inverse word frequency model, and the word semantic similarity model is a Word2vec model; 所述Word2vec模型基于短文本数据库和长文本数据库预先训练,其中,所述长文本数据库基于业务场景对应的业务数据而预先构建,所述短文本数据库基于所述业务场景下业务需求对应的业务数据而预先构建。The Word2vec model is pre-trained based on a short text database and a long text database, wherein the long text database is pre-built based on business data corresponding to business scenarios, and the short text database is based on business data corresponding to business requirements under the business scenario. And pre-built. 6.根据权利要求1所述的方法,其特征在于,当所述备选文本为多个时,在所述确定所述目标文本与所述备选文本的文本相似度之后,还包括:6. The method according to claim 1, wherein when there are multiple candidate texts, after the determining the text similarity between the target text and the candidate text, the method further comprises: 依据各所述文本相似度对各所述备选文本进行降序排列,生成排序结果;Arrange the candidate texts in descending order according to the similarity of the texts to generate a sorting result; 从所述排序结果中提取预设数量的所述备选文本,作为所述目标文本的相似文本。A preset number of the candidate texts are extracted from the sorting result as similar texts of the target text. 7.根据权利要求6所述的方法,其特征在于,在业务场景为智能问答场景,业务需求为确定所述目标文本的备选答案时,所述目标文本为目标短文本,所述备选文本为短文本数据库中的短文本,在所述从所述排序结果中提取预设数量的所述备选文本,作为所述目标文本的相似文本之后,还包括:7. The method according to claim 6, wherein, when the business scenario is an intelligent question-and-answer scenario, and the business requirement is to determine an alternative answer of the target text, the target text is a target short text, and the alternative The text is the short text in the short text database, and after extracting a preset number of the candidate texts from the sorting result as similar texts of the target text, it also includes: 从所述短文本数据库中提取每个所述相似文本对应的答案,作为所述目标文本的备选答案。The answer corresponding to each of the similar texts is extracted from the short text database as the candidate answer of the target text. 8.一种文本相似度确定装置,其特征在于,包括:8. A text similarity determination device, characterized in that, comprising: 目标文本获取模块,用于获取待确定相似度的目标文本和备选文本;The target text acquisition module is used to acquire the target text and candidate text whose similarity is to be determined; 第一相似度确定模块,用于确定所述目标文本与所述备选文本的词语义相似度和词频逆词频相似度,其中,所述词频逆词频相似度包含无词性词频逆词频相似度和/或含词性词频逆词频相似度;The first similarity determination module is used to determine the word semantic similarity and the word frequency inverse word frequency similarity between the target text and the candidate text, wherein the word frequency inverse word frequency similarity includes no part-of-speech word frequency inverse word frequency similarity and / or word frequency inverse word frequency similarity with part-of-speech; 第二相似度确定模块,用于依据预设词语义权重、预设词频逆词频权重、以及所述词语义相似度和所述词频逆词频相似度,确定所述目标文本与所述备选文本的文本相似度,其中,所述预设词频逆词频权重包含预设无词性词频逆词频权重和/或预设含词性词频逆词频权重。The second similarity determination module is configured to determine the target text and the candidate text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity The text similarity of , wherein the preset word frequency inverse word frequency weight includes a preset word frequency inverse word frequency weight without part of speech and/or a preset word frequency inverse word frequency weight with part of speech. 9.一种设备,其特征在于,所述设备包括:9. A device, characterized in that the device comprises: 一个或多个处理器;one or more processors; 存储装置,用于存储一个或多个程序,storage means for storing one or more programs, 当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-7中任一所述的文本相似度确定方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the text similarity determination method according to any one of claims 1-7. 10.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1-7中任一所述的文本相似度确定方法。10 . A computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the method for determining text similarity according to any one of claims 1 to 7 is implemented.
CN201910600981.6A 2019-07-04 2019-07-04 Text similarity determination method, device, equipment and storage medium Active CN112182145B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910600981.6A CN112182145B (en) 2019-07-04 2019-07-04 Text similarity determination method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910600981.6A CN112182145B (en) 2019-07-04 2019-07-04 Text similarity determination method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112182145A true CN112182145A (en) 2021-01-05
CN112182145B CN112182145B (en) 2025-01-17

Family

ID=73915404

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910600981.6A Active CN112182145B (en) 2019-07-04 2019-07-04 Text similarity determination method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112182145B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505196A (en) * 2021-06-30 2021-10-15 和美(深圳)信息技术股份有限公司 Part-of-speech-based text retrieval method and device, electronic equipment and storage medium
CN113837594A (en) * 2021-09-18 2021-12-24 深圳壹账通智能科技有限公司 Quality evaluation method, system, device and medium for customer service in multiple scenes
CN114282967A (en) * 2021-12-21 2022-04-05 中国农业银行股份有限公司 Method and device for determining target product, electronic equipment and storage medium
CN115130454A (en) * 2022-07-29 2022-09-30 北京明略昭辉科技有限公司 Method, apparatus, electronic device and storage medium for calculating text similarity
CN115329742A (en) * 2022-10-13 2022-11-11 深圳市大数据研究院 Method and system for evaluation and acceptance of scientific research project output based on text analysis
CN116167352A (en) * 2023-04-03 2023-05-26 联仁健康医疗大数据科技股份有限公司 Data processing method, device, electronic equipment and storage medium
CN116228249A (en) * 2023-05-08 2023-06-06 陕西拓方信息技术有限公司 Customer service system based on information technology

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130138696A1 (en) * 2011-11-30 2013-05-30 The Institute for System Programming of the Russian Academy of Sciences Method to build a document semantic model
WO2017084267A1 (en) * 2015-11-18 2017-05-26 乐视控股(北京)有限公司 Method and device for keyphrase extraction
CN108595425A (en) * 2018-04-20 2018-09-28 昆明理工大学 Based on theme and semantic dialogue language material keyword abstraction method
CN109271626A (en) * 2018-08-31 2019-01-25 北京工业大学 Text semantic analysis method
CN109284490A (en) * 2018-09-13 2019-01-29 武汉斗鱼网络科技有限公司 A kind of Text similarity computing method, apparatus, electronic equipment and storage medium
CN109344236A (en) * 2018-09-07 2019-02-15 暨南大学 A problem similarity calculation method based on multiple features

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130138696A1 (en) * 2011-11-30 2013-05-30 The Institute for System Programming of the Russian Academy of Sciences Method to build a document semantic model
WO2017084267A1 (en) * 2015-11-18 2017-05-26 乐视控股(北京)有限公司 Method and device for keyphrase extraction
CN108595425A (en) * 2018-04-20 2018-09-28 昆明理工大学 Based on theme and semantic dialogue language material keyword abstraction method
CN109271626A (en) * 2018-08-31 2019-01-25 北京工业大学 Text semantic analysis method
CN109344236A (en) * 2018-09-07 2019-02-15 暨南大学 A problem similarity calculation method based on multiple features
CN109284490A (en) * 2018-09-13 2019-01-29 武汉斗鱼网络科技有限公司 A kind of Text similarity computing method, apparatus, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张俊飞: ""改进TF-IDF结合余弦定理计算中文语句相似度"", 《现代计算机》, no. 32, 30 November 2017 (2017-11-30), pages 1 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505196A (en) * 2021-06-30 2021-10-15 和美(深圳)信息技术股份有限公司 Part-of-speech-based text retrieval method and device, electronic equipment and storage medium
CN113505196B (en) * 2021-06-30 2024-01-30 和美(深圳)信息技术股份有限公司 Text retrieval method, device, electronic equipment and storage medium based on part of speech
CN113837594A (en) * 2021-09-18 2021-12-24 深圳壹账通智能科技有限公司 Quality evaluation method, system, device and medium for customer service in multiple scenes
CN114282967A (en) * 2021-12-21 2022-04-05 中国农业银行股份有限公司 Method and device for determining target product, electronic equipment and storage medium
CN115130454A (en) * 2022-07-29 2022-09-30 北京明略昭辉科技有限公司 Method, apparatus, electronic device and storage medium for calculating text similarity
CN115329742A (en) * 2022-10-13 2022-11-11 深圳市大数据研究院 Method and system for evaluation and acceptance of scientific research project output based on text analysis
CN116167352A (en) * 2023-04-03 2023-05-26 联仁健康医疗大数据科技股份有限公司 Data processing method, device, electronic equipment and storage medium
CN116228249A (en) * 2023-05-08 2023-06-06 陕西拓方信息技术有限公司 Customer service system based on information technology

Also Published As

Publication number Publication date
CN112182145B (en) 2025-01-17

Similar Documents

Publication Publication Date Title
CN109284357B (en) Man-machine conversation method, device, electronic equipment and computer readable medium
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN112182145B (en) Text similarity determination method, device, equipment and storage medium
CN110019732B (en) A kind of intelligent question answering method and related device
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
CN102831184B (en) According to the method and system text description of social event being predicted to social affection
CN109960756B (en) News event information induction method
CN105095204B (en) The acquisition methods and device of synonym
CN107578292B (en) User portrait construction system
CN114238573A (en) Information pushing method and device based on text countermeasure sample
CN108038725A (en) A kind of electric business Customer Satisfaction for Product analysis method based on machine learning
CN112559684A (en) Keyword extraction and information retrieval method
Yang et al. A decision method for online purchases considering dynamic information preference based on sentiment orientation classification and discrete DIFWA operators
US20200192921A1 (en) Suggesting text in an electronic document
CN111260437A (en) A product recommendation method based on commodity aspect-level sentiment mining and fuzzy decision-making
CN112966091A (en) Knowledge graph recommendation system fusing entity information and heat
CN101833560A (en) Internet-based automatic ranking system for manufacturers' word-of-mouth
CN111353044B (en) Comment-based emotion analysis method and system
Mozafari et al. Emotion detection by using similarity techniques
CN113761125B (en) Dynamic summary determination method and device, computing device and computer storage medium
CN114579705B (en) Learning auxiliary method and system for sustainable development education
CN114239828A (en) Supply chain affair map construction method based on causal relationship
CN111061939A (en) Scientific research academic news keyword matching recommendation method based on deep learning
US20140365494A1 (en) Search term clustering
CN112989208A (en) Information recommendation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant