CN114154483A

CN114154483A - A method, device, medium and equipment for measuring the similarity of sentences

Info

Publication number: CN114154483A
Application number: CN202111211101.XA
Authority: CN
Inventors: 王铎; 李晓雅; 卢辰鑫; 何豪杰; 王思宽
Original assignee: Zhejiang Xiangnong Huiyu Technology Co ltd
Current assignee: Zhejiang Xiangnong Huiyu Technology Co ltd
Priority date: 2021-10-18
Filing date: 2021-10-18
Publication date: 2022-03-08

Abstract

The invention discloses a sentence similarity measurement method, a sentence similarity measurement device, a sentence similarity measurement medium and sentence similarity measurement equipment, which are characterized by comprising the following steps: performing unsupervised learning on the context matching relationship of each sentence in a predetermined unmarked corpus by using a language model tool to obtain a context matching model; obtaining contexts related to a plurality of sentences with similarity to be calculated from a non-labeled corpus to obtain a shared context set, calculating each sentence with similarity to be calculated by using a context matching model, scoring each context in the shared context set, and further obtaining a context score vector by using all the context scores; and calculating cosine similarity between each context score vector, thereby obtaining sentence similarity between sentences with similarity to be calculated corresponding to the context score vectors. The method can complete the calculation of sentence similarity without marking data, reduces the dependence on the marking data and has simple calculation process.

Description

Sentence similarity measuring method, device, medium and equipment

Technical Field

The present application relates to the field of language processing technologies, and in particular, to a method and an apparatus for measuring sentence similarity, a storage medium, and a computer device.

Background

Sentence similarity refers to how close semantically two sentences are evaluated, for example, "apple is a fruit" and "pear is a fruit", the two sentences are relatively close semantically, but "apple is a fruit" and "i love eating a pear" are relatively low in semantic similarity. The sentence similarity model is to accurately judge how similar the two sentences are in semantics.

The traditional sentence similarity model training needs to give a data set consisting of a sentence pair and a similarity score thereof, and the sentence similarity model is trained by the data set. However, such labeled data is lacking because the similarity between two sentences needs to be labeled manually, and the similarity measure between sentences can be evaluated from many aspects, so that the manual scoring efficiency is low, and the scale of the existing labeled data is small. For example, the commonly used STS data set only has 8600 training samples, the SICK data set only has 9800 training samples, and both the training samples do not reach ten thousand levels of data, so that the trained model is not good enough.

Disclosure of Invention

The invention provides a method and a device for measuring sentence similarity, a storage medium and computer equipment, which can complete the calculation of sentence similarity without marking data, reduce the dependence on the marking data and have simple calculation process.

In order to solve the above problems, the present invention adopts a technical solution that: a method for measuring sentence similarity is provided, which comprises the following steps:

performing unsupervised learning on the context matching relationship of each sentence in a predetermined unmarked corpus by using a language model tool to obtain a context matching model;

obtaining contexts related to a plurality of sentences with similarity to be calculated from a non-labeled corpus to obtain a shared context set, calculating each sentence with similarity to be calculated by using a context matching model, sharing the context score of each context in the shared context set, and further obtaining a context score vector of each sentence with similarity to be calculated by using all the context scores; and the number of the first and second groups,

and calculating the cosine similarity between each context score vector and the rest of the context score vectors so as to obtain the sentence similarity between the sentence with the similarity to be calculated corresponding to the context score vector and the rest of the sentences with the similarity to be calculated.

The invention adopts another technical scheme that: there is provided a sentence similarity measuring apparatus, including:

a module for performing unsupervised learning on the context matching relationship of each sentence in a predetermined unmarked corpus by using a language model tool to obtain a context matching model;

a module for obtaining contexts related to a plurality of sentences with similarity to be calculated from a non-labeled corpus to obtain a shared context set, calculating each sentence with similarity to be calculated by using a context matching model, sharing a context score of each context in the shared context set, and further obtaining a context score vector of each sentence with similarity to be calculated by using all the context scores; and the number of the first and second groups,

and the module is used for calculating the cosine similarity between each context score vector and the rest of the context score vectors so as to obtain the sentence similarity between the sentence with the similarity to be calculated corresponding to the context score vector and the rest of the sentences with the similarity to be calculated.

In another aspect of the present invention, a computer-readable storage medium is provided, which stores computer instructions, wherein the computer instructions are operable to perform a method for measuring sentence similarity in a solution.

In another aspect of the present invention, a computer device is provided, which includes a processor and a memory, the memory storing computer instructions, wherein the processor operates the computer instructions to perform a method for measuring sentence similarity in a scheme.

The technical scheme of the invention can achieve the following beneficial effects: the invention provides a method and a device for measuring sentence similarity, a storage medium and computer equipment, which can complete the calculation of sentence similarity without marking data, reduce the dependence on the marking data and have simple calculation process.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a diagram illustrating an embodiment of a sentence similarity measurement method according to the present invention;

FIG. 2 is a diagram illustrating an embodiment of a sentence similarity measurement method according to the present invention;

fig. 3 is a diagram illustrating an embodiment of a sentence similarity measuring apparatus according to the present invention.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the invention easier to understand by those skilled in the art, and thus will clearly and clearly define the scope of the invention.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Fig. 1 is a diagram illustrating an embodiment of a sentence similarity measurement method according to the present invention.

In this embodiment, the method for measuring sentence similarity mainly includes:

the process S101: performing unsupervised learning on the context matching relationship of each sentence in a predetermined unmarked corpus by using a language model tool to obtain a context matching model;

the process S102: obtaining contexts related to a plurality of sentences with similarity to be calculated from a non-labeled corpus to obtain a shared context set, calculating each sentence with similarity to be calculated by using a context matching model, sharing the context score of each context in the shared context set, and further obtaining a context score vector of each sentence with similarity to be calculated by using all the context scores;

the process S103: and calculating the cosine similarity between each context score vector and the rest of the context score vectors so as to obtain the sentence similarity between the sentence with the similarity to be calculated corresponding to the context score vector and the rest of the sentences with the similarity to be calculated.

By the sentence similarity measuring method, the sentence similarity can be calculated without marking data, dependence on the marking data is reduced, and the calculating process is simple.

In the embodiment shown in fig. 1, the method for measuring sentence similarity according to the present invention includes a process S101, which utilizes a language model tool to perform unsupervised learning on a context matching relationship of each sentence in a predetermined unlabeled corpus to obtain a context matching model. The process predetermines the non-labeled corpus, reduces the dependence on labeled data, and obtains the context matching model, so as to further obtain the context score vector of each sentence with similarity to be calculated according to the context matching model, and further obtain the sentence similarity between the sentences with similarity to be calculated.

Specifically, in practical application, the unmarked corpus may be input into a language model tool, the language model tool performs unsupervised learning on the context matching relationship of each sentence in the unmarked corpus to obtain a context matching model, and the unsupervised learning here does not label any data, so that the model tool obtains a function of outputting a context matching score. The process is convenient for obtaining the context score vector of each sentence with the similarity to be calculated according to the context matching model, so that the sentence similarity between the sentences with the similarity to be calculated is further obtained.

In one embodiment of the invention, the unlabeled corpus includes unlabeled data that is crawled over the Internet. The process reduces the dependence on the labeled data, and can obtain the context matching model according to the training of the label-free corpus so as to further obtain the context score vector of each sentence with the similarity to be calculated according to the context matching model, thereby further obtaining the sentence similarity between the sentences with the similarity to be calculated.

Specifically, a non-labeled corpus can be formed by directly crawling a large amount of non-labeled data from the internet, for example, a non-labeled corpus is formed by crawling a large amount of non-labeled data from encyclopedia knowledge, forums, news information, social media and the like. Here, unmarked data means unprocessed data.

In the embodiment shown in fig. 1, the method for measuring sentence similarity of the present invention includes a process S102, obtaining a shared context set by obtaining contexts related to a plurality of sentences with similarity to be calculated from an unlabeled corpus, calculating each sentence with similarity to be calculated by using a context matching model, scoring each context in the shared context set, and further obtaining a context score vector of each sentence with similarity to be calculated by using all the context scores. The process calculates a context score vector of each sentence with the similarity to be calculated so as to further obtain the sentence similarity between the sentences with the similarity to be calculated.

In an embodiment of the present invention, the process of obtaining a shared context set from a non-labeled corpus, where the context is related to a plurality of sentences with similarity to be calculated, includes obtaining a context related to each sentence with similarity to be calculated from the non-labeled corpus; and combining all the contexts to obtain a shared context set. This process obtains a set of shared contexts to facilitate further computation of the context score vector.

In an embodiment of the invention, the process of obtaining a context associated with each of the plurality of sentences with similarity to be calculated from the unlabeled corpus includes obtaining a context associated with each of the sentences with similarity to be calculated from the unlabeled corpus by using a word frequency-inverse document frequency algorithm. This process facilitates noise reduction and improves the accuracy of subsequently derived context score vectors.

The word frequency-inverse document frequency algorithm is a mature prior art, and aims to relatively roughly find the context related to the sentences with similarity to be calculated.

Specifically, referring to the schematic diagram of a specific example of the method for measuring sentence similarity provided in fig. 2 of the present invention, first, two contexts related to a given sentence with similarity to be calculated are obtained from an unlabeled corpus, that is, contexts C1 and C2 related to a sentence S1 and a sentence S2 are obtained from an unlabeled corpus, if the given sentence S1 is "apple is a fruit", the given sentence S2 is "love eating pears", a context C1 related to a sentence 1 "apple is a fruit" is actually a context set, a plurality of contexts related to "apple is a fruit" may be contained in C1, if C1 contains 3 contexts related to "apple is a fruit", the contexts related to "apple is a fruit" may be respectively replaced with numbers a, b, and C, and thus C1 ═ a, b, C }, C, and C }, respectively, similarly, if C2 also contains 3 contexts relating to "i love eat pears", the 3 contexts relating to "i love eat pears" can be replaced by labels d, e, f, respectively, so that C2 ═ d, e, f. The context C1 and C2 associated with the sentence S1 and the sentence S2 are merged to obtain the shared context set C, which is a mathematical union, so that the shared context set C ═ C1 ═ C2 ═ a, b, C, d, e, f }.

In a specific example of the present invention, each sentence with similarity to be calculated is calculated by using a context matching model, and a context score of each context in a shared context set is calculated, and further, a context score vector of each sentence with similarity to be calculated is obtained by using all the context scores. Since the above procedure has already been performed to obtain the shared context set C of the given sentence S1 and the sentence S2 { a, b, C, d, e, f }, first, the context scores of the sentence S1 and the shared context set C are calculated, the sentence S1 and the shared context set C are input into the context matching model, and the sentence S1 and each context in the shared context set C are subjected to the matching score, for example, the sentence S1 is matched with the context 1 in the shared context set C, then the score of S1 and the context a is: [ p (S)₁a)+p(aS₁)]2, assume that the score of S1 and context a is 2, and similarly assume that the score of S1 and context b is 3, the score of S1 and context c is 4, the score of S1 and context d is 2, the score of S1 and context e is 1, and the score of S1 and context f is 2. Thereby obtaining a context score vector v of the sentence S1 and the shared context set C ₁2, 3, 4, 2, 1, 2. Similarly, the process of calculating the context score of the sentence S2 and the shared context set C is the same as the process of calculating S1, provided that the context score vector v of the sentence S2 and the shared context set C is₂＝{1，4，3，3，2，2}。

In the embodiment shown in fig. 1, the method for measuring sentence similarity of the present invention includes a process S103 of calculating cosine similarity between each context score vector and the remaining context score vectors, so as to obtain sentence similarity between a sentence with similarity to be calculated corresponding to the context score vector and the remaining sentences with similarity to be calculated. The process calculates cosine similarity between context score vectors so as to further obtain sentence similarity between sentences with similarity to be calculated by referring to the cosine similarity.

In an embodiment of the present invention, referring to the schematic diagram of an embodiment of the sentence similarity measurement method provided in fig. 2 of the present invention, since the context score vector v of the sentence S1 and the shared context set C has been calculated in the example of the process S102₁With {2, 3, 4, 2, 1, 2}, the sentence S2 shares the context score vector v of the context set C ₂1, 4, 3, 3, 2, 2. Continue scoring the context vector v₁With context score vector v₂Calculating cosine similarity Sim (upsilon)₁,υ₂) The calculation formula is as follows:

Sim(υ₁,υ₂)＝cos(υ₁,υ₂)

cosine similarity Sim (upsilon) calculated in the process₁,v₂) Represents a context score vector v₁And a context score vector v₂The size of the included angle between the two groups, the cosine similarity and the sentence similarity are positively correlated. The smaller the angle, the larger the cosine value, and the more similar the sentence. Thus, according to all the processes summarized in the above example, the similarity between the sentence S1 and the sentence S2, i.e. the similarity between "apple is a kind of fruit" and "i love eating a pear" can be calculated.

Fig. 3 is a schematic diagram illustrating an embodiment of a sentence similarity measuring apparatus according to the present invention.

In this embodiment, the sentence similarity measuring device mainly includes:

the module 301: and the module is used for carrying out unsupervised learning on the context matching relationship of each sentence in the predetermined unmarked corpus by utilizing a language model tool to obtain a context matching model. The module determines a non-labeled corpus in advance, reduces the dependence on labeled data, and obtains a context matching model so as to further obtain a context score vector of each similarity sentence to be calculated according to the context matching model, thereby further obtaining the sentence similarity between the similarity sentences to be calculated.

The module 302: and the module is used for acquiring the contexts related to the plurality of sentences with the similarity to be calculated from the non-labeled corpus to obtain a shared context set, calculating each sentence with the similarity to be calculated by using a context matching model, sharing the context score of each context in the shared context set, and further obtaining the context score vector of each sentence with the similarity to be calculated by using all the context scores. The module calculates a context score vector of each sentence with the similarity to be calculated so as to further obtain the sentence similarity between the sentences with the similarity to be calculated.

Module 303: and the module is used for calculating the cosine similarity between each context score vector and the rest of the context score vectors so as to obtain the sentence similarity between the sentence with the similarity to be calculated corresponding to the context score vector and the rest of the sentences with the similarity to be calculated. The module calculates cosine similarity between context score vectors so as to further obtain sentence similarity between sentences with similarity to be calculated by referring to the cosine similarity.

In an embodiment of the present invention, the module 301 further includes a sub-module for obtaining an unlabeled corpus by crawling unlabeled data through the internet. The submodule reduces the dependence on the labeled data, and can obtain the context matching model according to the training of the label-free corpus so as to further obtain the context score vector of each sentence with the similarity to be calculated according to the context matching model, thereby further obtaining the sentence similarity between the sentences with the similarity to be calculated.

In an embodiment of the present invention, the module 302 further includes a sub-module for obtaining a context associated with each sentence with similarity to be calculated from the unlabeled corpus by using a word frequency-inverse document frequency algorithm. The sub-module is used for reducing noise and improving the accuracy of the subsequently obtained context score vector.

By applying the sentence similarity measuring device, the sentence similarity can be calculated without marking data, the dependence on the marking data is reduced, and the calculating process is simple.

The sentence similarity measuring device provided by the invention can be used for executing the sentence similarity measuring method described in any of the above embodiments, and the implementation principle and the technical effect are similar, and are not repeated herein.

In another embodiment of the present invention, a computer-readable storage medium stores computer instructions, wherein the computer instructions are operable to perform the method for measuring sentence similarity described in any of the embodiments. Wherein the storage medium may be directly in hardware, in a software module executed by a processor, or in a combination of the two.

A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium.

The Processor may be a Central Processing Unit (CPU), other general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), other Programmable logic devices, discrete Gate or transistor logic, discrete hardware components, or any combination thereof. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

In one embodiment of the present application, a computer device includes a processor and a memory, the memory storing computer instructions, wherein: the processor operates the computer instructions to perform the sentence similarity measure method described in any of the embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above embodiments are merely examples, which are not intended to limit the scope of the present disclosure, and all equivalent structural changes made by using the contents of the specification and the drawings, or any other related technical fields, are also included in the scope of the present disclosure.

Claims

1. A method for measuring sentence similarity is characterized by comprising the following steps,

obtaining contexts related to a plurality of sentences with similarity to be calculated from the non-labeling corpus to obtain a shared context set, calculating each sentence with similarity to be calculated by using the context matching model, obtaining a context score of each sentence with similarity to be calculated with the context score of each context in the shared context set, and further obtaining a context score vector of each sentence with similarity to be calculated by using all the context scores; and the number of the first and second groups,

and calculating cosine similarity between each context score vector and the rest of the context score vectors, thereby obtaining sentence similarity between the sentence with the similarity to be calculated corresponding to the context score vector and the rest of the sentences with the similarity to be calculated.

2. The method for measuring sentence similarity according to claim 1,

the unlabeled corpus includes unlabeled data crawled through the internet.

3. The method for measuring sentence similarity according to claim 1, wherein the process of obtaining a shared context set from the context associated with a plurality of sentences with similarity to be calculated from the unlabeled corpus comprises,

acquiring the context related to each sentence with similarity to be calculated in the plurality of sentences with similarity to be calculated from the non-labeling corpus;

and combining all the contexts to obtain the shared context set.

4. The method for measuring sentence similarity according to claim 3, wherein the process of obtaining the context associated with each sentence with similarity to be calculated from the unlabeled corpus comprises,

and acquiring the context related to each sentence with the similarity to be calculated from the non-labeled corpus by using a word frequency-inverse document frequency algorithm.

5. A sentence similarity measuring device is characterized by comprising,

a module, configured to obtain a shared context set from the unlabeled corpus, where the shared context set is obtained by obtaining contexts related to multiple sentences with similarity to be calculated, calculate each sentence with similarity to be calculated by using the context matching model, obtain a context score for each context in the shared context set, and further obtain a context score vector for each sentence with similarity to be calculated by using all the context scores; and the number of the first and second groups,

and the module is used for calculating cosine similarity between each context score vector and the rest of the context score vectors so as to obtain sentence similarity between the sentence with the similarity to be calculated corresponding to the context score vector and the rest of the sentences with the similarity to be calculated.

6. The apparatus for sentence similarity measurement according to claim 5, wherein the module for unsupervised learning of the context matching relationship of each sentence in the predetermined unlabeled corpus using the language model tool to obtain the context matching model further comprises,

and the sub-module is used for obtaining the label-free corpus by crawling label-free data through the Internet.

7. The sentence similarity measurement apparatus according to claim 5, wherein the module for obtaining the context related to the plurality of sentences with similarity to be calculated from the unlabeled corpus to obtain a shared context set, and calculating each sentence with similarity to be calculated by using the context matching model, and further obtaining a context score vector of each sentence with similarity to be calculated by using all the context scores together with the context score of each context in the shared context set,

and the sub-module is used for acquiring the context related to each sentence with the similarity to be calculated from the non-labeled corpus by using a word frequency-inverse document frequency algorithm.

8. A computer readable storage medium storing computer instructions, wherein the computer instructions are operable to perform the sentence similarity measure method of any one of claims 1-4.

9. A computer device comprising a processor and a memory, the memory storing computer instructions, wherein the processor operates the computer instructions to perform the sentence similarity measure method according to any one of claims 1-4.