[go: up one dir, main page]

CN111401039B - Word retrieval method, device, equipment and storage medium based on binary mutual information - Google Patents

Word retrieval method, device, equipment and storage medium based on binary mutual information Download PDF

Info

Publication number
CN111401039B
CN111401039B CN202010146242.7A CN202010146242A CN111401039B CN 111401039 B CN111401039 B CN 111401039B CN 202010146242 A CN202010146242 A CN 202010146242A CN 111401039 B CN111401039 B CN 111401039B
Authority
CN
China
Prior art keywords
target
word
words
mutual information
target word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010146242.7A
Other languages
Chinese (zh)
Other versions
CN111401039A (en
Inventor
梁志成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202010146242.7A priority Critical patent/CN111401039B/en
Publication of CN111401039A publication Critical patent/CN111401039A/en
Application granted granted Critical
Publication of CN111401039B publication Critical patent/CN111401039B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of artificial intelligence, and discloses a word retrieval method, a device, equipment and a storage medium based on binary mutual information, which are used for increasing the accuracy of keyword extraction and further retrieving effective message response. The method comprises the steps of obtaining target problem text sent by a target user, segmenting the target problem text to obtain a plurality of candidate words, determining word frequency TF and reverse file frequency IDF of the plurality of target words in the plurality of candidate words by invoking a preset corpus, calculating binary mutual information of each target word in the plurality of target words according to a preset formula, obtaining adjustment factors of each target word in the plurality of target words according to a preset algorithm, calculating weight values of each target word according to the adjustment factors of each target word and the binary mutual information of each target word, determining key words of the target problem text according to the weight values of each target word, and searching corresponding answers according to the key words.

Description

Word retrieval method, device, equipment and storage medium based on binary mutual information
Technical Field
The invention relates to the technical field of keyword matching, in particular to a word retrieval method, device and equipment based on binary mutual information and a storage medium.
Background
With the development of information technology, people can search and acquire required information from a network more and more conveniently. However, how to quickly obtain the required information from the massive network information is important, so that information retrieval technology is generated, and one of the important supporting technologies is a keyword extraction technology. Currently, the most widely used keyword extraction technology is a term frequency-inverse document frequency (TF-IDF) algorithm, and the basic principle of the TF-IDF algorithm is to calculate the rank of a word according to the number of occurrences and the term frequency weight, and select the top several words as keywords.
In the existing scheme, the weight calculation of the TF-IDF algorithm is too dependent on word frequency, and the influence of the distribution condition of words in different documents on the weight is not considered, so that the information extraction of the existing product is inaccurate, and effective message response cannot be retrieved.
Disclosure of Invention
The invention provides a word retrieval method, a device, equipment and a storage medium based on binary mutual information, which are used for adding an adjustment factor to adjust the weight value of word frequency, eliminating the influence of different document distribution conditions on the weight value by adopting the mutual information factor, reducing the dependency of TF-IDF algorithm on the word frequency, improving the accuracy of the weight value, increasing the accuracy of extracting keywords, and further retrieving effective message response.
The first aspect of the embodiment of the invention provides a word retrieval method based on binary mutual information, which comprises the steps of obtaining target question text sent by a target user, wherein the target question text is used for indicating to obtain answers corresponding to the target question text, segmenting the target question text to obtain a plurality of candidate words, each candidate word has uniqueness, invoking a preset corpus to determine word frequency TF and reverse file frequency IDF of a plurality of target words in the plurality of candidate words, calculating binary mutual information of each target word in the plurality of target words according to a preset formula, obtaining an adjustment factor of each target word in the plurality of target words according to a preset algorithm, calculating to obtain a weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word, determining a keyword of the target question text according to the weight value of each target word, and retrieving the corresponding answer according to the keyword.
Optionally, in a first implementation manner of the first aspect of the embodiment of the present invention, the invoking the preset corpus to determine word frequencies TF and reverse document frequencies IDF of the plurality of target words in the plurality of candidate words includes performing deactivated word filtering processing on the plurality of candidate words to obtain a plurality of target words, invoking the preset corpus to determine word frequencies TF of each target word in the plurality of target words, and invoking the preset corpus to determine reverse document frequencies IDF of each target word in the plurality of target words.
Optionally, in a second implementation manner of the first aspect of the embodiment of the present invention, the invoking the preset corpus to determine the word frequency TF of each target word in the plurality of target words includes obtaining the preset corpus, determining a target corpus document in the preset corpus, determining the occurrence times T of each target word in the target corpus document, and generating the word frequency TF of each target word.
Optionally, in a third implementation manner of the first aspect of the embodiment of the present invention, the invoking the preset corpus to determine the reverse document frequency IDF of each target word in the plurality of target words includes obtaining the preset corpus and determining the total number M of corpus documents in the preset corpus, determining the number Wi, i of documents including a first target word in the M corpus documents as a positive integer, the first target word being any one of the plurality of target words, invoking a first preset formula, the number Wi of documents and the total number M to generate the reverse document frequency IDF of the first target word, where the first preset formula is idf=log 2 (M/wi+1), and generating the reverse document frequency IDF of each target word.
Optionally, in a fourth implementation manner of the first aspect of the embodiment of the present invention, the calculating the binary mutual information of each target word in the plurality of target words according to a preset formula includes selecting any one target word from the plurality of target words as a candidate target word, determining counts of occurrences of the candidate target word in two continuous corpus documentsDetermining the count of the occurrence of the candidate target words in two sequential corpus documentsThe method comprises the steps of taking the first count as a first count, calculating the ratio of the first count to the second count to obtain a first ratio p (X|w i,wi+k,wi+1), determining binary mutual information mi (X, w i,wi+k,wi+1)=log2p(X|wi,wi+k,wi+1) of candidate target words according to the first ratio p (X|w i,wi+k,wi+1), and generating binary mutual information of other target words in the target words to obtain the binary mutual information of each target word.
Optionally, in a fifth implementation manner of the first aspect of the embodiment of the present invention, the obtaining, according to a preset algorithm, an adjustment factor of each target word in the plurality of target words includes determining a current service scenario based on the target question text, dividing the plurality of target words into a spoken word and a key word based on the current service scenario, setting an adjustment factor corresponding to the spoken word as a negative number and setting an adjustment factor corresponding to the key word as a positive number, and generating an adjustment factor of each target word.
Optionally, in a sixth implementation manner of the first aspect of the embodiment of the present invention, the calculating, according to the adjustment factor of each target word and the binary mutual information of each target word, a weight value of each target word includes:
Selecting one target word from a plurality of target words as a second target word, determining an adjustment factor mu x and binary mutual information mi (x) corresponding to the second target word, calculating to obtain a weight value f (x) of the second target word according to a preset formula, wherein the preset calculation formula is f (x) =mi (x) =TF+mu x, and calculating to obtain weight values of other target words in the plurality of target words to obtain the weight value of each target word.
The second aspect of the embodiment of the invention provides a word retrieval device based on binary mutual information, which comprises a first acquisition unit, a word segmentation unit, a calling determination unit, a first calculation unit, a second calculation unit and a retrieval unit, wherein the first acquisition unit is used for acquiring target question text sent by a target user, the target question text is used for indicating to acquire answers corresponding to the target question text, the word segmentation unit is used for segmenting the target question text to obtain a plurality of candidate words, each candidate word has uniqueness, the calling determination unit is used for calling a preset corpus to determine word frequency TF and reverse file frequency IDF of a plurality of target words in the plurality of candidate words, the first calculation unit is used for calculating binary mutual information of each target word in the plurality of target words according to a preset formula, the second acquisition unit is used for acquiring an adjustment factor of each target word in the plurality of target words according to a preset algorithm, the second calculation unit is used for calculating to obtain a weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word, and the retrieval unit is used for determining a keyword of the target word according to the weight value of each target word and the answer corresponding to the keyword.
Optionally, in a first implementation manner of the second aspect of the embodiment of the present invention, the invoking determination unit includes a filtering module, configured to perform deactivated word filtering processing on the multiple candidate words to obtain multiple target words, a first determining module, configured to invoke a preset corpus to determine a word frequency TF of each target word in the multiple target words, and a second determining module, configured to invoke the preset corpus to determine a reverse document frequency IDF of each target word in the multiple target words.
Optionally, in a second implementation manner of the second aspect of the embodiment of the present invention, the first determining module is specifically configured to obtain a preset corpus, determine a target corpus document in the preset corpus, determine a number of times T of occurrence of each target word in the target corpus document, and generate a word frequency TF of each target word.
Optionally, in a third implementation manner of the second aspect of the embodiment of the present invention, the second determining module is specifically configured to obtain a preset corpus, determine a total number M of corpus documents in the preset corpus, determine a number Wi, i of documents including a first target word in the M corpus documents, where the first target word is any one of the multiple target words, call a first preset formula, the number Wi of documents, and the total number M, generate an inverse document frequency IDF of the first target word, where the first preset formula is idf=log 2 (M/wi+1), and generate an inverse document frequency IDF of each target word.
Optionally, in a fourth implementation manner of the second aspect of the embodiment of the present invention, the first calculating unit is specifically configured to select any one target word from the plurality of target words as a candidate target word, determine counts of occurrences of the candidate target word in two consecutive corpus documents, and determine the counts of occurrences of the candidate target word in the two consecutive corpus documentsDetermining the count of the occurrence of the candidate target words in two sequential corpus documentsThe method comprises the steps of taking the first count as a first count, calculating the ratio of the first count to the second count to obtain a first ratio p (X|w i,wi+k,wi+1), determining binary mutual information mi (X, w i,wi+k,wi+1)=log2p(X|wi,wi+k,wi+1) of candidate target words according to the first ratio p (X|w i,wi+k,wi+1), and generating binary mutual information of other target words in the target words to obtain the binary mutual information of each target word.
Optionally, in a fifth implementation manner of the second aspect of the embodiment of the present invention, the second obtaining unit is specifically configured to determine a current service scenario based on the target question text, divide the plurality of target words into a spoken word and a keyword based on the current service scenario, set an adjustment factor corresponding to the spoken word to be a negative number, set an adjustment factor corresponding to the keyword to be a positive number, and generate an adjustment factor of each target word.
Optionally, in a sixth implementation manner of the second aspect of the embodiment of the present invention, the second calculating unit is specifically configured to:
Selecting one target word from a plurality of target words as a second target word, determining an adjustment factor mu x and binary mutual information mi (x) corresponding to the second target word, calculating to obtain a weight value f (x) of the second target word according to a preset formula, wherein the preset calculation formula is f (x) =mi (x) =TF+mu x, and calculating to obtain weight values of other target words in the plurality of target words to obtain the weight value of each target word.
A third aspect of the embodiment of the present invention provides a word retrieval device based on binary mutual information, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the word retrieval method based on binary mutual information according to any one of the foregoing embodiments when executing the computer program.
A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium storing a computer program, which when executed by a processor, implements the steps of the word retrieval method based on binary mutual information according to any one of the above embodiments.
According to the technical scheme, a target question text sent by a target user is obtained, the target question text is used for indicating to obtain answers corresponding to the target question text, word segmentation is conducted on the target question text to obtain a plurality of candidate words, each candidate word has uniqueness, a preset corpus is called to determine word frequency TF and reverse file frequency IDF of a plurality of target words in the plurality of candidate words, binary mutual information of each target word in the plurality of target words is calculated according to a preset formula, adjustment factors of each target word in the plurality of target words are obtained according to a preset algorithm, weight values of each target word are obtained according to the adjustment factors of each target word and the binary mutual information of each target word, keywords of the target question text are determined according to the weight values of each target word, and answers corresponding to keyword retrieval are obtained. According to the embodiment of the invention, the weight value of the word frequency is adjusted by adding one adjustment factor, the influence of different document distribution conditions on the weight value is eliminated by adopting the mutual information factor, the dependence degree of the TF-IDF algorithm on the word frequency is reduced, the accuracy of the weight value is improved, the accuracy of extracting the key words is increased, and further, the effective message response is searched.
Drawings
FIG. 1 is a diagram of one embodiment of a method for word retrieval based on binary mutual information in an embodiment of the present invention;
FIG. 2 is a diagram of another embodiment of a method for word retrieval based on binary mutual information according to an embodiment of the present invention;
FIG. 3 is a diagram of one embodiment of a word retrieval device based on binary mutual information in an embodiment of the present invention;
FIG. 4 is a diagram of another embodiment of a word retrieval device based on binary mutual information according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an embodiment of a word retrieval device based on binary mutual information in an embodiment of the present invention.
Detailed Description
The invention provides a word retrieval method, a device, equipment and a storage medium based on binary mutual information, which are used for adding an adjustment factor to adjust the weight value of word frequency, eliminating the influence of different document distribution conditions on the weight value by adopting the mutual information factor, reducing the dependency of TF-IDF algorithm on the word frequency, improving the accuracy of the weight value, increasing the accuracy of extracting keywords, and further retrieving effective message response.
In order to enable those skilled in the art to better understand the present invention, embodiments of the present invention will be described below with reference to the accompanying drawings.
The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.
Referring to fig. 1, a flowchart of a word retrieval method based on binary mutual information provided in an embodiment of the present invention specifically includes:
101. And acquiring a target question text sent by a target user, wherein the target question text is used for indicating to acquire an answer corresponding to the target question text.
The server acquires a target question text sent by a target user, wherein the target question text is used for indicating to acquire an answer corresponding to the target question text.
For example, in a chat client robot, the target user presents a question "what rights are the insurance of fortune claim? the server needs to retrieve the associated rights and interests of" good fortune "and take the retrieved rights and interests as answers to the questions.
It can be understood that the execution subject of the present invention may be a word retrieval device based on binary mutual information, and may also be a terminal or a server, which is not limited herein. The embodiment of the invention is described by taking a server as an execution main body as an example.
102. And segmenting the target problem text to obtain a plurality of candidate words, wherein each candidate word has uniqueness.
The server performs word segmentation on the target problem text to obtain a plurality of candidate words, and each candidate word has uniqueness. In this embodiment, the server uses a preset word segmentation algorithm to segment the target problem text. The preset word segmentation algorithm comprises a word segmentation algorithm based on character string matching, a word segmentation algorithm based on understanding and a word segmentation algorithm based on statistics.
It should be noted that, (1) for the word segmentation algorithm based on character string matching, matching the character string to be analyzed with the entry in a "sufficiently large" machine dictionary according to a certain policy, if a certain character string is found in the dictionary, the matching is successful (a word is identified). According to different scanning directions, the string matching word segmentation method can be divided into forward matching and reverse matching, according to the condition of preferential matching of different lengths, the string matching word segmentation method can be divided into maximum (longest) matching and minimum (shortest) matching, and according to the combination of part-of-speech labeling process or not, the string matching word segmentation method can be divided into a simple word segmentation method and an integrated method of word segmentation and labeling. (2) For word segmentation algorithm based on understanding, the word recognition effect is achieved by enabling a computer to simulate understanding of sentences by people. The basic idea is that the syntactic and semantic analysis is performed while the words are segmented, and the syntactic information and the semantic information are utilized to process the ambiguity. The system generally comprises three parts, namely a word segmentation subsystem, a syntactic and semantic subsystem and a general control part. Under the coordination of the general control part, the word segmentation subsystem can obtain the syntactic and semantic information of related words, sentences and the like to judge word segmentation ambiguity, namely, the word segmentation subsystem simulates the understanding process of people to sentences. (3) For statistical-based word segmentation algorithms, words are formally stable combinations of words, and therefore in context, the more times adjacent words appear simultaneously, the more likely a word is composed. Therefore, the frequency or probability of co-occurrence of the characters adjacent to the characters can better reflect the credibility of the formed words. The frequency of the combination of each word of adjacent co-occurrence in the corpus can be counted, and the co-occurrence information of each word can be calculated. The mutual information shows the tightness of the combination relation between Chinese characters. When the degree of compactness is above a certain threshold, it is considered that the word may constitute a word. It will be appreciated that other word segmentation methods may be employed, and are not limited in this particular context.
For example, in a chat client robot, the target user presents a question "what rights are the insurance of fortune claim? words such as 'good fortune' and 'insurance', 'equity', 'which' and 'have' are obtained after word segmentation.
103. And calling a preset corpus to determine word frequencies TF and reverse document frequencies IDF of a plurality of target words in the plurality of candidate words.
The server invokes a preset corpus to determine word frequencies TF and reverse document frequencies IDF of a plurality of target words in a plurality of candidate words. The method comprises the steps that a server conducts stop word filtering processing on a plurality of candidate words to obtain a plurality of target words, a preset corpus is called by the server to determine word frequency TF of each target word in the plurality of target words, and reverse file frequency IDF of each target word in the plurality of target words is called by the server to determine the preset corpus.
104. And calculating the binary mutual information of each target word in the plurality of target words according to a preset formula.
And the server calculates the binary mutual information of each target word in the plurality of target words according to a preset formula. The method specifically comprises the following steps:
(1) The server selects any one target word from the plurality of target words as a candidate target word.
(2) The server determines the count of candidate target words appearing in two consecutive corpus documentsA first count is obtained.
Wherein, For counting the occurrences of the target word X in two consecutive documents w i,wi+1, e.g. the documents w 1,w2,w3 both contain word A, then
(3) The server determines the count of candidate target words appearing in two sequential corpus documentsAs a second count.
Wherein, For counting the occurrences of the target word X in two sequential documents w i,wi+k, e.g., document w 1,w2,w3 contains word A, then
(4) The server calculates the ratio of the first count to the second count to obtain a first ratio p (X|w i,wi+k,wi+1);
Wherein, The count for the occurrence of two consecutive adjacent documents for word X is the ratio of the count of A for the occurrence of two consecutive documents, e.g., w 1,w2 is a consecutive document, w 1,w3 is a non-consecutive document, w 1,w2、w1,w3 is a consecutive document, and w 2,w1 is a non-consecutive document.
(5) The server determines binary mutual information mi (X, w i,wi+k,wi+1)=log2p(X|wi,wi+k,wi+1) of the candidate target words according to the first ratio p (x|w i,wi+k,wi+1);
For example, for keywords A, B, there are 10 corpus documents w 1,w2,w3,...,w10, where corpus document w 1 contains x 1 keywords A, y 1 keywords B, corpus document w 2 contains x 2 keywords A, y 2 keywords B, and corpus document w 10 contains y 3 keywords B. The binary mutual information factors of the keyword A, B are calculated to be mi (a) and mi (B), respectively, and the specific calculation process is as follows:
(6) The server generates binary mutual information of other target words in the target words to obtain the binary mutual information of each target word.
105. And acquiring an adjustment factor of each target word in the plurality of target words according to a preset algorithm.
And the server acquires the adjustment factor of each target word in the plurality of target words according to a preset algorithm. The method comprises the steps of determining a current service scene based on target problem text by a server, dividing a plurality of target words into spoken words and key words based on the current service scene by the server, setting an adjustment factor corresponding to the spoken words to be negative and an adjustment factor corresponding to the key words to be positive by the server, and generating the adjustment factor of each target word by the server.
For example, the weight of the word "hello" of the mouth water needs to be reduced, the word "hello" of the mouth water is given a μ value smaller than 0 (default value of-1) at the time of initialization, the weight of the keyword "good fortune" needs to be increased, and the word "good fortune" is given a μ value larger than 0 (default value of 1) at the time of initialization. Namely, initializing a word bag with weight: [ "hello": -1, "good fortune insurance": 1].
It should be noted that μmay also be adjusted. For example, the keyword a extracted from the word weight calculated by the BOOST algorithm is compared with the preset keyword B, if a is related to B, mu a is trimmed to obtain mu a=μa +w, where w may be set to 1/100 of the initial value of mu a, and if a is not related to B, mu a is trimmed to obtain mu a=μa -w.
106. And calculating the weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word.
And the server calculates the weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word. Specifically, a server selects one target word from a plurality of target words as a second target word, determines an adjustment factor mu x and binary mutual information mi (x) corresponding to the second target word, calculates a weight value f (x) of the second target word according to a preset formula, wherein the preset calculation formula is f (x) =mi (x) =tf+mu x, and calculates weight values of other target words from the plurality of target words to obtain the weight value of each target word.
For example, if "benefit" is set as the word a and "benefit" is set as the word B, if it is calculated according to the existing TF-IDF algorithm, it can be seen that TF is IDF A<TF*IDFB in the corpus w 1, then the weight of "benefit" is greater than the weight of "benefit" but the word "benefit" is more representative, but because the weight of "benefit" is weakened due to the strong dependence on word frequency, which is intensively distributed in a certain corpus and the TF-IDF algorithm itself, the weight of "benefit" needs to be adjusted.
For example, for "fortune guard", assuming "fortune guard" is word a, to increase the weight value of word a, the adjustment needs to be set to be greater than 0, and here, μ A =1 is set, and the weight value is calculated as follows:
for example, for "equity", assuming "equity" is word B, to decrease the weight value of word B, the adjustment needs to be set to be less than 0, where μ A = -1 is set, the weight value is calculated as follows: Thus f (a) > f (B), i.e. f (benefit) f (equity).
107. And determining the keywords of the target question text according to the weight value of each target word, and retrieving the corresponding answers according to the keywords.
And the server determines the keywords of the target question text according to the weight value of each target word, and retrieves the corresponding answers according to the keywords.
It will be appreciated that a keyword corresponds to one or more answers, for example, if the keyword is "profitability," the corresponding answer may be "5%", "10%" or "20%", but may also be other values. If the keyword is "scene", the corresponding answer may be "loan", "regular financing" or "mortgage", or may be other scene types.
According to the embodiment of the invention, the weight value of the word frequency is adjusted by adding one adjustment factor, the influence of different document distribution conditions on the weight value is eliminated by adopting the mutual information factor, the dependence degree of the TF-IDF algorithm on the word frequency is reduced, the accuracy of the weight value is improved, the accuracy of extracting the key words is increased, and further, the effective message response is searched.
Referring to fig. 2, another flowchart of a word retrieval method based on binary mutual information provided in an embodiment of the present invention specifically includes:
201. and acquiring a target question text sent by a target user, wherein the target question text is used for indicating to acquire an answer corresponding to the target question text.
The server acquires a target question text sent by a target user, wherein the target question text is used for indicating to acquire an answer corresponding to the target question text.
For example, in a chat client robot, the target user presents a question "what rights are the insurance of fortune claim? the server needs to retrieve the associated rights and interests of" good fortune "and take the retrieved rights and interests as answers to the questions.
It can be understood that the execution subject of the present invention may be a word retrieval device based on binary mutual information, and may also be a terminal or a server, which is not limited herein. The embodiment of the invention is described by taking a server as an execution main body as an example.
202. And segmenting the target problem text to obtain a plurality of candidate words, wherein each candidate word has uniqueness.
The server performs word segmentation on the target problem text to obtain a plurality of candidate words, and each candidate word has uniqueness. In this embodiment, the server uses a preset word segmentation algorithm to segment the target problem text. The preset word segmentation algorithm comprises a word segmentation algorithm based on character string matching, a word segmentation algorithm based on understanding and a word segmentation algorithm based on statistics.
It should be noted that, (1) for the word segmentation algorithm based on character string matching, matching the character string to be analyzed with the entry in a "sufficiently large" machine dictionary according to a certain policy, if a certain character string is found in the dictionary, the matching is successful (a word is identified). According to different scanning directions, the string matching word segmentation method can be divided into forward matching and reverse matching, according to the condition of preferential matching of different lengths, the string matching word segmentation method can be divided into maximum (longest) matching and minimum (shortest) matching, and according to the combination of part-of-speech labeling process or not, the string matching word segmentation method can be divided into a simple word segmentation method and an integrated method of word segmentation and labeling. (2) For word segmentation algorithm based on understanding, the word recognition effect is achieved by enabling a computer to simulate understanding of sentences by people. The basic idea is that the syntactic and semantic analysis is performed while the words are segmented, and the syntactic information and the semantic information are utilized to process the ambiguity. The system generally comprises three parts, namely a word segmentation subsystem, a syntactic and semantic subsystem and a general control part. Under the coordination of the general control part, the word segmentation subsystem can obtain the syntactic and semantic information of related words, sentences and the like to judge word segmentation ambiguity, namely, the word segmentation subsystem simulates the understanding process of people to sentences. (3) For statistical-based word segmentation algorithms, words are formally stable combinations of words, and therefore in context, the more times adjacent words appear simultaneously, the more likely a word is composed. Therefore, the frequency or probability of co-occurrence of the characters adjacent to the characters can better reflect the credibility of the formed words. The frequency of the combination of each word of adjacent co-occurrence in the corpus can be counted, and the co-occurrence information of each word can be calculated. The mutual information shows the tightness of the combination relation between Chinese characters. When the degree of compactness is above a certain threshold, it is considered that the word may constitute a word. It will be appreciated that other word segmentation methods may be employed, and are not limited in this particular context.
For example, in a chat client robot, the target user presents a question "what rights are the insurance of fortune claim? words such as 'good fortune' and 'insurance', 'equity', 'which' and 'have' are obtained after word segmentation.
203. And performing deactivated word filtering processing on the candidate words to obtain a plurality of target words.
And the server performs the filtering processing of the stop words on the candidate words to obtain a plurality of target words. The stop words may include functional words, such as "that", "those", "and" Chinese words, or English words such as "the", "a", "an", "that", and "those".
204. And calling a preset corpus to determine word frequency TF of each target word in the plurality of target words.
The server invokes a preset corpus to determine word frequency TF of each target word in the plurality of target words. The method comprises the steps that a server obtains a preset corpus and determines target corpus documents in the preset corpus, the server determines the occurrence times T of each target word in the target corpus documents, and the word frequency TF of each target word is generated.
For example, in the existing corpus w 1,w2,w3,...,w10, the number of words in the corpus w 1,w2,w3,...,w10 is 10, wherein "good fortune and insurance" are intensively distributed in the corpus document w 1,w2, the occurrence frequency of "good fortune and insurance" in w 1 and w 2 is 1, the word frequency tf=1 of "good fortune and insurance", the "equity" is distributed in the w 1,w2,w4 corpus document, the occurrence frequency of "equity" in each corpus document is 10, and the word frequency tf=10 of "equity".
205. And calling a preset corpus to determine the reverse document frequency IDF of each target word in the plurality of target words.
The server invokes a preset corpus to determine the reverse document frequency IDF of each target word in the plurality of target words. The method comprises the steps that a server obtains a preset corpus and determines the total number M of corpus documents in the preset corpus, the server determines the number Wi of documents containing first target words in the M corpus documents, i is a positive integer, the first target words are any one of a plurality of target words, the server calls a first preset formula, the number Wi of the documents and the total number M to generate reverse file frequency IDF of the first target words, the first preset formula is IDF=log 2 (M/wi+1), and the server generates reverse file frequency IDF of each target word.
For example, in the existing corpus w 1,w2,w3,...,w10, the word numbers in the corpus w 1,w2,w3,...,w10 are 10, wherein "Fubao" is intensively distributed in the corpus document w 1,w2, the occurrence times of "Fubao" in w 1 and w 2 are 1, and then the reverse file frequency IDF of "Fubao" is valued asThe rights are distributed in w 1,w2,w4 corpus documents, the number of occurrences of the rights in each corpus document is 10, and the reverse file frequency IDF of the rights is valued as follows
206. And calculating the binary mutual information of each target word in the plurality of target words according to a preset formula.
And the server calculates the binary mutual information of each target word in the plurality of target words according to a preset formula. The method specifically comprises the following steps:
(1) The server selects any one target word from the plurality of target words as a candidate target word.
(2) The server determines the count of candidate target words appearing in two consecutive corpus documentsA first count is obtained.
Wherein, For counting the occurrences of the target word X in two consecutive documents w i,wi+1, e.g. the documents w 1,w2,w3 both contain word A, then
(3) The server determines the count of candidate target words appearing in two sequential corpus documentsAs a second count.
Wherein, For counting the occurrences of the target word X in two sequential documents w i,wi+k, e.g., document w 1,w2,w3 contains word A, then
(4) The server calculates the ratio of the first count to the second count to obtain a first ratio p (X|w i,wi+k,wi+1);
Wherein, The count for the occurrence of two consecutive adjacent documents for word X is the ratio of the count of A for the occurrence of two consecutive documents, e.g., w 1,w2 is a consecutive document, w 1,w3 is a non-consecutive document, w 1,w2、w1,w3 is a consecutive document, and w 2,w1 is a non-consecutive document.
(5) The server determines binary mutual information mi (X, w i,wi+k,wi+1)=log2p(X|wi,wi+k,wi+1) of the candidate target words according to the first ratio p (x|w i,wi+k,wi+1);
For example, for keywords A, B, there are 10 corpus documents w 1,w2,w3,...,w10, where corpus document w 1 contains x 1 keywords A, y 1 keywords B, corpus document w 2 contains x 2 keywords A, y 2 keywords B, and corpus document w 10 contains y 3 keywords B. The binary mutual information factors of the keyword A, B are calculated to be mi (a) and mi (B), respectively, and the specific calculation process is as follows:
(6) The server generates binary mutual information of other target words in the target words to obtain the binary mutual information of each target word.
207. And acquiring an adjustment factor of each target word in the plurality of target words according to a preset algorithm.
And the server acquires the adjustment factor of each target word in the plurality of target words according to a preset algorithm. The method comprises the steps of determining a current service scene based on target problem text by a server, dividing a plurality of target words into spoken words and key words based on the current service scene by the server, setting an adjustment factor corresponding to the spoken words to be negative and an adjustment factor corresponding to the key words to be positive by the server, and generating the adjustment factor of each target word by the server.
For example, the weight of the word "hello" of the mouth water needs to be reduced, the word "hello" of the mouth water is given a μ value smaller than 0 (default value of-1) at the time of initialization, the weight of the keyword "good fortune" needs to be increased, and the word "good fortune" is given a μ value larger than 0 (default value of 1) at the time of initialization. Namely, initializing a word bag with weight: [ "hello": -1, "good fortune insurance": 1].
It should be noted that μmay also be adjusted. For example, the keyword a extracted from the word weight calculated by the BOOST algorithm is compared with the preset keyword B, if a is related to B, mu a is trimmed to obtain mu a=μa +w, where w may be set to 1/100 of the initial value of mu a, and if a is not related to B, mu a is trimmed to obtain mu a=μa -w.
208. And calculating the weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word.
And the server calculates the weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word. Specifically, a server selects one target word from a plurality of target words as a second target word, determines an adjustment factor mu x and binary mutual information mi (x) corresponding to the second target word, calculates a weight value f (x) of the second target word according to a preset formula, wherein the preset calculation formula is f (x) =mi (x) =tf+mu x, and calculates weight values of other target words from the plurality of target words to obtain the weight value of each target word.
For example, if "benefit" is set as the word a and "benefit" is set as the word B, if it is calculated according to the existing TF-IDF algorithm, it can be seen that TF is IDF A<TF*IDFB in the corpus w 1, then the weight of "benefit" is greater than the weight of "benefit" but the word "benefit" is more representative, but because the weight of "benefit" is weakened due to the strong dependence on word frequency, which is intensively distributed in a certain corpus and the TF-IDF algorithm itself, the weight of "benefit" needs to be adjusted.
For example, for "fortune guard", assuming "fortune guard" is word a, to increase the weight value of word a, the adjustment needs to be set to be greater than 0, and here, μ A =1 is set, and the weight value is calculated as follows:
for example, for "equity", assuming "equity" is word B, to decrease the weight value of word B, the adjustment needs to be set to be less than 0, where μ A = -1 is set, the weight value is calculated as follows: Thus f (a) > f (B), i.e. f (benefit) f (equity).
209. And determining the keywords of the target question text according to the weight value of each target word, and retrieving the corresponding answers according to the keywords.
And the server determines the keywords of the target question text according to the weight value of each target word, and retrieves the corresponding answers according to the keywords.
It will be appreciated that a keyword corresponds to one or more answers, for example, if the keyword is "profitability," the corresponding answer may be "5%", "10%" or "20%", but may also be other values. If the keyword is "scene", the corresponding answer may be "loan", "regular financing" or "mortgage", or may be other scene types.
According to the technical scheme, a target question text sent by a target user is obtained, the target question text is used for indicating to obtain answers corresponding to the target question text, word segmentation is conducted on the target question text to obtain a plurality of candidate words, each candidate word has uniqueness, a preset corpus is called to determine word frequency TF and reverse file frequency IDF of a plurality of target words in the plurality of candidate words, binary mutual information of each target word in the plurality of target words is calculated according to a preset formula, adjustment factors of each target word in the plurality of target words are obtained according to a preset algorithm, weight values of each target word are obtained according to the adjustment factors of each target word and the binary mutual information of each target word, keywords of the target question text are determined according to the weight values of each target word, and answers corresponding to keyword retrieval are obtained. According to the embodiment of the invention, the weight value of the word frequency is adjusted by adding one adjustment factor, the influence of different document distribution conditions on the weight value is eliminated by adopting the mutual information factor, the dependence degree of the TF-IDF algorithm on the word frequency is reduced, the accuracy of the weight value is improved, the accuracy of extracting the key words is increased, and further, the effective message response is searched.
The word searching method based on the binary mutual information in the embodiment of the present invention is described above, and the word searching device based on the binary mutual information in the embodiment of the present invention is described below, referring to fig. 3, one embodiment of the word searching device based on the binary mutual information in the embodiment of the present invention includes:
a first obtaining unit 301, configured to obtain a target question text sent by a target user, where the target question text is used to indicate an answer corresponding to the obtained target question text;
The word segmentation unit 302 is configured to segment the target question text to obtain a plurality of candidate words, where each candidate word has uniqueness;
a calling determining unit 303, configured to call a preset corpus to determine word frequencies TF and inverse document frequencies IDF of a plurality of target words in the plurality of candidate words;
A first calculating unit 304, configured to calculate binary mutual information of each target word in the plurality of target words according to a preset formula;
a second obtaining unit 305, configured to obtain, according to a preset algorithm, an adjustment factor of each target word in the plurality of target words;
A second calculating unit 306, configured to calculate a weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word;
A determining and retrieving unit 307, configured to determine a keyword of the target question text according to the weight value of each target word, and retrieve a corresponding answer according to the keyword.
According to the embodiment of the invention, the weight value of the word frequency is adjusted by adding one adjustment factor, the influence of different document distribution conditions on the weight value is eliminated by adopting the mutual information factor, the dependence on the word frequency is reduced, the accuracy of the weight value is improved, the accuracy of extracting the key words is increased, and then the effective message response is searched.
Referring to fig. 4, another embodiment of a word retrieving apparatus based on binary mutual information in an embodiment of the present invention includes:
a first obtaining unit 301, configured to obtain a target question text sent by a target user, where the target question text is used to indicate an answer corresponding to the obtained target question text;
The word segmentation unit 302 is configured to segment the target question text to obtain a plurality of candidate words, where each candidate word has uniqueness;
a calling determining unit 303, configured to call a preset corpus to determine word frequencies TF and inverse document frequencies IDF of a plurality of target words in the plurality of candidate words;
A first calculating unit 304, configured to calculate binary mutual information of each target word in the plurality of target words according to a preset formula;
a second obtaining unit 305, configured to obtain, according to a preset algorithm, an adjustment factor of each target word in the plurality of target words;
A second calculating unit 306, configured to calculate a weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word;
A determining and retrieving unit 307, configured to determine a keyword of the target question text according to the weight value of each target word, and retrieve a corresponding answer according to the keyword.
Optionally, the call determining unit 303 includes:
A filtering module 3031, configured to perform deactivated word filtering processing on the multiple candidate words to obtain multiple target words;
A first determining module 3032, configured to invoke a preset corpus to determine word frequency TF of each target word in the plurality of target words;
A second determining module 3033 is configured to invoke a preset corpus to determine a reverse document frequency IDF of each of the plurality of target words.
Optionally, the first determining module 3032 is specifically configured to:
The method comprises the steps of obtaining a preset corpus, determining target corpus documents in the preset corpus, determining the occurrence times T of each target word in the target corpus documents, and generating word frequency TF of each target word.
Optionally, the second determining module 3033 is specifically configured to:
The method comprises the steps of obtaining a preset corpus, determining the total number M of corpus documents in the preset corpus, determining the number Wi, i of documents containing first target words in the M corpus documents as a positive integer, wherein the first target words are any one of a plurality of target words, calling a first preset formula, the number Wi of the documents and the total number M to generate reverse file frequency IDF of the first target words, wherein the first preset formula is IDF=log 2 (M/wi+1), and generating reverse file frequency IDF of each target word.
Optionally, the first computing unit 304 is specifically configured to:
Selecting any one target word from the target words as a candidate target word, determining the count of the occurrence of the candidate target word in two continuous corpus documents Determining the count of the occurrence of the candidate target words in two sequential corpus documentsThe method comprises the steps of taking the first count as a first count, calculating the ratio of the first count to the second count to obtain a first ratio p (X|w i,wi+k,wi+1), determining binary mutual information mi (X, w i,wi+k,wi+1)=log2p(X|wi,wi+k,wi+1) of candidate target words according to the first ratio p (X|w i,wi+k,wi+1), and generating binary mutual information of other target words in the target words to obtain the binary mutual information of each target word.
Optionally, the second obtaining unit 305 is specifically configured to:
The method comprises the steps of determining a current service scene based on a target problem text, dividing a plurality of target words into a spoken word and a key word based on the current service scene, setting an adjustment factor corresponding to the spoken word as a negative number, setting an adjustment factor corresponding to the key word as a positive number, and generating an adjustment factor of each target word.
Optionally, the second computing unit 306 is specifically configured to:
Selecting one target word from a plurality of target words as a second target word, determining an adjustment factor mu x and binary mutual information mi (x) corresponding to the second target word, calculating to obtain a weight value f (x) of the second target word according to a preset formula, wherein the preset calculation formula is f (x) =mi (x) =TF+mu x, and calculating to obtain weight values of other target words in the plurality of target words to obtain the weight value of each target word.
The method comprises the steps of obtaining target question text sent by a target user, enabling the target question text to be used for indicating to obtain answers corresponding to the target question text, segmenting the target question text to obtain a plurality of candidate words, enabling a preset corpus to be called to determine word frequencies TF and reverse file frequencies IDF of the plurality of target words in the plurality of candidate words, calculating binary mutual information of each target word in the plurality of target words according to a preset formula, obtaining adjustment factors of each target word in the plurality of target words according to a preset algorithm, calculating weight values of each target word according to the adjustment factors of each target word and the binary mutual information of each target word, determining keywords of the target question text according to the weight values of each target word, and searching the corresponding answers according to the keywords. According to the embodiment of the invention, the weight value of the word frequency is adjusted by adding one adjustment factor, the influence of different document distribution conditions on the weight value is eliminated by adopting the mutual information factor, the dependence on the word frequency is reduced, the accuracy of the weight value is improved, the accuracy of extracting the key words is increased, and then the effective message response is searched.
The word retrieving device based on the binary mutual information in the embodiment of the present invention is described in detail from the point of view of the modularized functional entity in the above fig. 3 to 4, and the word retrieving device based on the binary mutual information in the embodiment of the present invention is described in detail from the point of view of hardware processing in the following.
Fig. 5 is a schematic structural diagram of a word retrieval device based on binary mutual information according to an embodiment of the present invention, where the word retrieval device 500 based on binary mutual information may have relatively large differences due to configuration or performance, and may include one or more processors (central processing units, CPU) 501 (e.g., one or more processors) and a memory 509, and one or more storage mediums 508 (e.g., one or more mass storage devices) storing application programs 507 or data 506. Wherein the memory 509 and storage medium 508 may be transitory or persistent storage. The program stored on the storage medium 508 may include one or more modules (not shown), each of which may include a series of instruction operations on a word retrieval device based on binary mutual information. Still further, the processor 501 may be configured to communicate with the storage medium 508 to execute a series of instruction operations in the storage medium 508 on the binary mutual information based word retrieval device 500.
The binary mutual information based word retrieval device 500 may also include one or more power supplies 502, one or more wired or wireless network interfaces 503, one or more input/output interfaces 504, and/or one or more operating systems 505, such as Windows Serve, mac OS X, unix, linux, freeBSD, and the like. It will be appreciated by those skilled in the art that the binary mutual information based word retrieval device structure shown in fig. 5 does not constitute a limitation of the binary mutual information based word retrieval device, and may include more or less components than illustrated, or may combine certain components, or may be a different arrangement of components. The processor 501 may perform the functions of the first acquisition unit 301, the word segmentation unit 302, the call determination unit 303, the first calculation unit 304, the second acquisition unit 305, the second calculation unit 306, and the determination retrieval unit 307 in the above-described embodiments.
The following describes the respective constituent elements of the word retrieval device based on the binary mutual information in detail with reference to fig. 5:
The processor 501 is a control center of the word retrieval device based on the binary mutual information, and can be processed according to a set word retrieval method based on the binary mutual information. The processor 501 utilizes various interfaces and lines to connect various parts of the whole word retrieval device based on the binary mutual information, and executes various functions and processing data of the word retrieval device based on the binary mutual information by running or executing software programs and/or modules stored in the memory 509 and calling data stored in the memory 509, thereby reducing the dependency of the TF-IDF algorithm on word frequency, improving the accuracy of weight values, increasing the accuracy of keyword extraction, and further retrieving effective message response. The storage medium 508 and the memory 509 are both carriers for storing data, and in the embodiment of the present invention, the storage medium 508 may refer to an internal memory with a small storage capacity but a fast speed, and the memory 509 may be an external memory with a large storage capacity but a slow storage speed.
The memory 509 may be used to store software programs and modules, and the processor 501 performs various functional applications and data processing of the binary mutual information-based word retrieval device 500 by running the software programs and modules stored in the memory 509. The memory 509 may mainly include a storage program area that may store an operating system, an application program required for at least one function (such as word segmentation of a target question text to obtain a plurality of candidate words each having uniqueness), etc., and a storage data area that may store data created according to use of a word retrieval device based on binary mutual information (such as a weight value of each target word, etc.), etc. In addition, the memory 509 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. The word retrieval method program and received data stream based on binary mutual information provided in the embodiment of the present invention are stored in the memory, and when necessary, the processor 501 is called from the memory 509.
When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, twisted pair), or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., compact disk), or a semiconductor medium (e.g., solid State Drive (SSD)), etc.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided in the present invention, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiment of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.
While the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that the foregoing embodiments may be modified or equivalents may be substituted for some of the features thereof, and that the modifications or substitutions do not depart from the spirit and scope of the embodiments of the invention.

Claims (7)

1. The word retrieval method based on binary mutual information is characterized by comprising the following steps:
acquiring a target question text sent by a target user, wherein the target question text is used for indicating to acquire an answer corresponding to the target question text;
Word segmentation is carried out on the target problem text to obtain a plurality of candidate words, and each candidate word has uniqueness;
Invoking a preset corpus to determine word frequencies TF and reverse document frequencies IDF of a plurality of target words in the plurality of candidate words;
Calculating binary mutual information of each target word in the plurality of target words according to a preset formula;
Acquiring an adjustment factor of each target word in the plurality of target words according to a preset algorithm;
Calculating a weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word;
Determining a keyword of the target question text according to the weight value of each target word, and retrieving a corresponding answer according to the keyword;
the calculating the binary mutual information of each target word in the plurality of target words according to a preset formula comprises the following steps:
Selecting any one target word from the plurality of target words as a candidate target word;
determining counts of occurrences of the candidate target term in two consecutive corpus documents Obtaining a first count;
determining counts of occurrences of the candidate target term in two sequential corpus documents As a second count;
Calculating the ratio of the first count to the second count to obtain a first ratio ;
According to the first ratioDetermining binary mutual information of the candidate target words;
Generating binary mutual information of other target words in the target words to obtain binary mutual information of each target word;
the obtaining the adjustment factor of each target word in the plurality of target words according to a preset algorithm comprises the following steps:
determining a current service scene based on the target problem text;
Dividing the target words into spoken words and key words based on the current business scenario;
setting the adjustment factors corresponding to the spoken words as negative numbers and setting the adjustment factors corresponding to the key words as positive numbers;
generating an adjustment factor for each target word;
the calculating to obtain the weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word comprises the following steps:
Selecting one target word from a plurality of target words as a second target word, and determining an adjustment factor corresponding to the second target word And binary mutual information;
Obtaining the weight value of the second target word according to a preset calculation formulaThe preset calculation formula is as follows:;
and calculating the weight value of other target words in the target words to obtain the weight value of each target word.
2. The method for word retrieval based on binary mutual information according to claim 1, wherein the invoking a preset corpus to determine word frequencies TF and reverse document frequencies IDF of a plurality of target words in the plurality of candidate words comprises:
performing deactivated word filtering processing on the candidate words to obtain a plurality of target words;
Invoking a preset corpus to determine word frequency TF of each target word in a plurality of target words;
And calling a preset corpus to determine the reverse document frequency IDF of each target word in the plurality of target words.
3. The method for word retrieval based on binary mutual information according to claim 2, wherein the invoking the preset corpus to determine the word frequency TF of each target word of the plurality of target words comprises:
acquiring a preset corpus and determining a target corpus document in the preset corpus;
And determining the occurrence times T of each target word in the target corpus document, and generating the word frequency TF of each target word.
4. The method for word retrieval based on binary mutual information according to claim 2, wherein the invoking the preset corpus to determine the reverse document frequency IDF of each of the plurality of target words comprises:
acquiring a preset corpus, and determining the total number M of corpus documents in the preset corpus;
Determining a number of documents including a first target word among the M corpus documents ,The first target word is any one of the target words;
calling a first preset formula the number of documents And the total number M, generating the reverse file frequency IDF of the first target word, wherein the first preset formula is that;
An inverse document frequency IDF is generated for each target word.
5. A word retrieval device based on binary mutual information, comprising:
The first acquisition unit is used for acquiring a target question text sent by a target user, wherein the target question text is used for indicating to acquire an answer corresponding to the target question text;
the word segmentation unit is used for segmenting the target problem text to obtain a plurality of candidate words, and each candidate word has uniqueness;
The invoking determining unit is used for invoking a preset corpus to determine word frequency TF and reverse file frequency IDF of a plurality of target words in the plurality of candidate words;
The first calculation unit is used for calculating binary mutual information of each target word in the plurality of target words according to a preset formula;
The second acquisition unit is used for acquiring the adjustment factor of each target word in the plurality of target words according to a preset algorithm;
the second calculation unit is used for calculating the weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word;
the determining and searching unit is used for determining the keywords of the target question text according to the weight value of each target word and searching the corresponding answers according to the keywords;
the calculating the binary mutual information of each target word in the plurality of target words according to a preset formula comprises the following steps:
Selecting any one target word from the plurality of target words as a candidate target word;
determining counts of occurrences of the candidate target term in two consecutive corpus documents Obtaining a first count;
determining counts of occurrences of the candidate target term in two sequential corpus documents As a second count;
Calculating the ratio of the first count to the second count to obtain a first ratio ;
According to the first ratioDetermining binary mutual information of the candidate target words;
Generating binary mutual information of other target words in the target words to obtain binary mutual information of each target word;
the obtaining the adjustment factor of each target word in the plurality of target words according to a preset algorithm comprises the following steps:
determining a current service scene based on the target problem text;
Dividing the target words into spoken words and key words based on the current business scenario;
setting the adjustment factors corresponding to the spoken words as negative numbers and setting the adjustment factors corresponding to the key words as positive numbers;
generating an adjustment factor for each target word;
the calculating to obtain the weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word comprises the following steps:
Selecting one target word from a plurality of target words as a second target word, and determining an adjustment factor corresponding to the second target word And binary mutual information;
Obtaining the weight value of the second target word according to a preset calculation formulaThe preset calculation formula is as follows:;
and calculating the weight value of other target words in the target words to obtain the weight value of each target word.
6. A binary mutual information based word retrieval device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the binary mutual information based word retrieval method according to any one of claims 1-4 when executing the computer program.
7. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, which when executed by a processor implements the binary mutual information based word retrieval method according to any one of claims 1-4.
CN202010146242.7A 2020-03-05 2020-03-05 Word retrieval method, device, equipment and storage medium based on binary mutual information Active CN111401039B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010146242.7A CN111401039B (en) 2020-03-05 2020-03-05 Word retrieval method, device, equipment and storage medium based on binary mutual information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010146242.7A CN111401039B (en) 2020-03-05 2020-03-05 Word retrieval method, device, equipment and storage medium based on binary mutual information

Publications (2)

Publication Number Publication Date
CN111401039A CN111401039A (en) 2020-07-10
CN111401039B true CN111401039B (en) 2025-04-11

Family

ID=71430502

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010146242.7A Active CN111401039B (en) 2020-03-05 2020-03-05 Word retrieval method, device, equipment and storage medium based on binary mutual information

Country Status (1)

Country Link
CN (1) CN111401039B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112036159B (en) * 2020-09-01 2023-11-03 北京金堤征信服务有限公司 Word cloud data generation method and device
CN112184027B (en) * 2020-09-29 2023-12-26 壹链盟生态科技有限公司 Task progress updating method, device and storage medium
CN114764435A (en) * 2021-01-14 2022-07-19 西门子股份公司 Fault mode prediction method, device and computer readable medium
CN114417863A (en) * 2021-07-13 2022-04-29 北京金山数字娱乐科技有限公司 Word weight generation model training method and device, word weight generation method and device
CN113609248B (en) * 2021-08-20 2024-10-15 北京金山数字娱乐科技有限公司 Word weight generation model training method and device, and word weight generation method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763196A (en) * 2018-05-03 2018-11-06 上海海事大学 A kind of keyword extraction method based on PMI

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763196A (en) * 2018-05-03 2018-11-06 上海海事大学 A kind of keyword extraction method based on PMI

Also Published As

Publication number Publication date
CN111401039A (en) 2020-07-10

Similar Documents

Publication Publication Date Title
CN111401039B (en) Word retrieval method, device, equipment and storage medium based on binary mutual information
CN108304378B (en) Text similarity computing method, apparatus, computer equipment and storage medium
US11301637B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
CN108197111B (en) An Automatic Text Summarization Method Based on Fusion Semantic Clustering
US8666984B2 (en) Unsupervised message clustering
WO2020119063A1 (en) Expert knowledge recommendation method and apparatus, computer device, and storage medium
CN113326420B (en) Problem retrieval method, device, electronic device and medium
CN105045781B (en) Query term similarity calculation method and device and query term search method and device
CN106407280B (en) Query target matching method and device
US20160275178A1 (en) Method and apparatus for search
CN109657053B (en) Multi-text abstract generation method, device, server and storage medium
US20100241647A1 (en) Context-Aware Query Recommendations
KR101423549B1 (en) Sentiment-based query processing system and method
CN110162630A (en) A kind of method, device and equipment of text duplicate removal
CN110427626B (en) Keyword extraction method and device
CN105653553B (en) Word weight generation method and device
CN114880447A (en) Information retrieval method, device, equipment and storage medium
CN112926297A (en) Method, apparatus, device and storage medium for processing information
CN106569989A (en) De-weighting method and apparatus for short text
EP2766826A1 (en) Searching information
CN112417845B (en) Text evaluation method, device, electronic device and storage medium
CN119493778A (en) A method, system, device and storage medium for compressing multimodal weight files
CN111651596A (en) Text clustering method, text clustering device, server and storage medium
CN113962221A (en) A text abstract extraction method, device, terminal device and storage medium
CN118364174A (en) Content recommendation method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant