Disclosure of Invention
The invention mainly aims to provide a method, a device, equipment and a storage medium for mining a near meaning word based on network search, and aims to solve the technical problem of low mining efficiency caused by few data sources and inaccurate mining in the existing near meaning word mining process.
In addition, in order to achieve the above object, the present invention further provides a method for mining a synonym based on a web search, where the method for mining a synonym based on a web search includes the following steps:
detecting a first word and a second word to be subjected to similarity comparison, and inputting the first word and the second word into a preset model to obtain a first word vector corresponding to the first word and a second word vector corresponding to the second word;
calculating the similarity of the first word vector and the second word vector;
comparing the similarity with a preset threshold, and if the similarity is greater than the preset threshold, inputting the first word and the second word into a preset search engine for searching to respectively obtain first search information corresponding to the first word and second search information corresponding to the second word;
and determining an approximation degree contrast result of the first word and the second word according to the first search information and the second search information.
Optionally, before the step of detecting a first word and a second word to be subjected to proximity comparison, and inputting the first word and the second word into a preset model to obtain a first word vector corresponding to the first word and a second word vector corresponding to the second word, the method includes:
receiving a near meaning word mining instruction, and acquiring a historical library corresponding to the near meaning word mining instruction;
performing word segmentation processing on the historical library to obtain a historical word bank, and performing denoising processing on the historical word bank to obtain a preprocessed word bank;
and receiving a near meaning word judgment instruction, and acquiring a first word and a second word corresponding to the near meaning word judgment instruction from the preprocessing word bank.
Optionally, the calculating the similarity between the first word vector and the second word vector includes:
acquiring a vector included angle between the first word vector and the second word vector, and calculating a length ratio of the first word vector to the second word vector;
and determining the similarity of the first word vector and the second word vector according to the ratio of the vector included angle to the length.
Optionally, the step of inputting the first term and the second term into a preset search engine for searching to obtain first search information corresponding to the first term and second search information corresponding to the second term includes:
inputting the first terms and the second terms into a preset search engine, adjusting the preset search engine to be in an initial state, and adjusting the search quantity parameters of the preset search engine to be target numerical values;
searching the first word and the second word, and acquiring first search information and second search information output by the preset search engine, wherein the first search information corresponds to the first word, the second search information corresponds to the second word, and the number of search links in the first search information and the number of search links in the second search information are both the target numerical values.
Optionally, the step of determining an approximate degree comparison result of the first word and the second word according to the first search information and the second search information includes:
screening target search links contained in the first search information and the second search information, and determining the target number of the target search links;
calculating a target ratio of the target quantity to the target numerical value, taking a search link in the first search information as a first search link, and taking a search link in the second search information as a second search link;
calculating a similarity value of first link display information and second link display information, wherein the first link display information corresponds to the first search link and the second link display information corresponds to the second search link;
and determining an approximation degree comparison result of the first word and the second word according to the target ratio and the similarity value.
Optionally, the step of calculating a similarity value between the first link display information and the second link display information includes:
acquiring a first resource positioning mark corresponding to the first link display information and a second resource positioning mark corresponding to the second link display information;
screening out target resource positioning marks contained in both the first resource positioning mark and the second resource positioning mark;
and determining the similarity value of the first link display information and the second link display information according to the target number and the number of the target resource positioning marks.
Optionally, after the step of calculating a similarity value between the first link display information and the second link display information, the method includes:
determining first link classification information corresponding to the first search link and second link classification information corresponding to the second search link according to a preset link classification label;
calculating the similarity degree of the first link classification information and the second link classification information;
the determining an approximation degree comparison result of the first word and the second word according to the target ratio and the similarity value comprises:
and determining an approximation degree contrast result of the first word and the second word according to the target ratio, the similarity value and the similarity degree.
In addition, in order to achieve the above object, the present invention further provides a network search-based synonym mining device, including:
the vector output module is used for detecting a first word and a second word to be subjected to similarity comparison, and inputting the first word and the second word into a preset model to obtain a first word vector corresponding to the first word and a second word vector corresponding to the second word;
the similarity calculation module is used for calculating the similarity of the first word vector and the second word vector;
the search module is used for comparing the similarity between the first word vector and the second word vector with a preset threshold value, and if the similarity between the first word vector and the second word vector is greater than the preset threshold value, inputting the first word and the second word into a preset search engine for searching to respectively obtain first search information corresponding to the first word and second search information corresponding to the second word;
and the similarity calculation module is used for determining a similarity comparison result of the first word and the second word according to the first search information and the second search information.
In addition, in order to achieve the above object, the present invention further provides a network search-based synonym mining device, including: the network search-based synonym mining program comprises a memory, a processor and a network search-based synonym mining program stored on the memory and capable of running on the processor, wherein the network search-based synonym mining program realizes the steps of the network search-based synonym mining method when being executed by the processor.
In addition, to achieve the above object, the present invention further provides a storage medium having stored thereon a network search based hypernym mining program, which when executed by a processor implements the steps of the network search based hypernym mining method as described above.
The embodiment of the invention provides a method, a device, equipment and a storage medium for mining a synonym based on network search. According to the embodiment of the invention, the first word vector corresponding to the first word and the second word vector corresponding to the second word are obtained by inputting the first word and the second word to be subjected to similarity comparison into the preset model, when the similarity of the first word vector and the second word vector is greater than the preset threshold value, the first word and the second word are input into the preset search engine for searching, the first search information corresponding to the first word and the second search information corresponding to the second word are obtained, and finally the similarity comparison result of the first word and the second word is determined according to the first search information and the second search information.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for facilitating the explanation of the present invention, and have no specific meaning in itself. Thus, "module", "component" or "unit" may be used mixedly.
The synonym mining terminal (called as a terminal, equipment or terminal equipment) based on network search in the embodiment of the invention can be a PC, and can also be a mobile terminal equipment with a display function, such as a smart phone, a tablet computer, a portable computer and the like.
As shown in fig. 1, the terminal may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.
Optionally, the terminal may further include a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WiFi module, and the like. Such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display screen according to the brightness of ambient light, and a proximity sensor that may turn off the display screen and/or the backlight when the mobile terminal is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when the mobile terminal is stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer and tapping) and the like for recognizing the attitude of the mobile terminal; of course, the mobile terminal may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which are not described herein again.
Those skilled in the art will appreciate that the terminal structure shown in fig. 1 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a storage medium, may include therein an operating system, a network communication module, a user interface module, and a network search-based synonym mining program.
In the terminal shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and performing data communication with the backend server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be configured to invoke a network search based hypernym mining program stored in the memory 1005, which when executed by the processor implements the operations in the network search based hypernym mining method provided by the embodiments described below.
Based on the hardware structure of the equipment, the embodiment of the method for mining the similar meaning words based on the network search is provided.
Referring to fig. 2, in a first embodiment of the method for mining a synonym based on a web search according to the present invention, the method for mining a synonym based on a web search includes:
step S10, detecting a first word and a second word to be subjected to similarity comparison, inputting the first word and the second word into a preset model, and obtaining a first word vector corresponding to the first word and a second word vector corresponding to the second word.
In the field of natural language processing, the preprocessing of data is an important part, and includes chinese word segmentation, word replacement with near meaning, noise word cleaning, etc., where word replacement with near meaning plays a significant role in calculating the similarity between two sentences, and the preprocessing of data in natural language often involves word replacement with near meaning, for example, in calculating the similarity between two sentences, two near meaning words of a word are converted into the same word, so as to improve the reliability of similarity calculation between two sentences, and the similarity comparison in this embodiment refers to a determination process of determining whether two words (i.e., the first word and the second word in this embodiment) are near meaning words, and two words determined to be near meaning words are stored in a near meaning word library, so that word replacement is used in data preprocessing of natural language, and it is known that, for human beings, it is easy to determine whether two words are similar words, but it is difficult for a computer based on logic operation to determine whether two words are similar words, the first word and the second word in this embodiment are words that need to be determined by similar words, the preset model is a word vector generation model, it can be understood that when determining whether two words are similar words, the computer can determine according to the similarity of the features of the two words, and the preset model in this embodiment is a model that can extract word features and can generate a vector corresponding to a word according to the features of the word, where the vector is visually comparable, and it is known that the first word vector in this embodiment is a vector generated by the first word through the preset model, the second word vector is a vector generated by the second word through the preset model, and specifically, the features of a word include the position of the word in the original sentence and the part-of-word in the original sentence (noun, adjectives, verbs, etc.), etc., a vector corresponding to the word having specifically contrastable attributes such as direction and length is determined based on the characteristics of the word.
Step S20, calculating a similarity between the first word vector and the second word vector.
The first word vector and the second word vector have the general properties of vectors, i.e. have a magnitude and a direction, wherein the magnitude of the vectors can be measured by length, and the angle between the two vectors can be obtained by coinciding the starting points of the first vector and the second vector, and it is understood that the condition that the two vectors are identical is that: the vectors have the same direction and the same length, and if the lengths of the two vectors are equal, the length ratio of the two vectors is 1, it is understood that the closer the length ratio of the two vectors is to 1, the higher the similarity of the two vectors is, and the smaller the angle between the two vectors is, the higher the similarity of the two vectors is, in sum, the similarity of the first word vector and the second word vector is related to the length ratio and the size of the angle of the two vectors, and the similarity of the first word vector and the second word vector is calculated by respectively weighting the length ratio and the angle, specifically, for example, the weight of the length ratio is 0.3, the weight of the angle is 0.7, the length ratio is 0.8, and the angle is 30 degrees (i.e., 1/6 half circumference angle), and then the similarity of the first word vector and the second word vector is 0.3 × 0.8+ (0.7-0.7 × 1/6) ═ 0.82.
Step S30, comparing the similarity with a preset threshold, and if the similarity is greater than the preset threshold, inputting the first word and the second word into a preset search engine for searching to obtain first search information corresponding to the first word and second search information corresponding to the second word, respectively.
By obtaining and comparing the direction included angle and the vector length of the two word vectors, comprehensively determining the similarity between the first word vector and the second word vector, for example, the direction included angle of the two word vectors corresponds to different weights in different value intervals, the smaller the included angle, the larger the weight, the length ratio of the two word vectors also corresponds to different weights in different value intervals, the closer the length ratio is to 1, the larger the weight is, the weights are added to obtain a comprehensive weight, the comprehensive weight can represent the similarity, and finally the comprehensive weight is compared with a preset threshold, it can be known that the preset threshold is a value preset according to experience, if the similarity is less than or equal to the preset threshold, it is directly determined that the first word and the second word are not near-sense words, if the similarity is greater than the preset threshold, it is preliminarily determined that the first word and the second word are near-sense words, the preliminary determination here indicates that there is a greater likelihood that the first word and the second word are close-sense words.
Further, the present scheme adopts a search engine search method to judge whether the first word and the second word are synonyms again, so as to improve the accuracy of the judgment of the synonyms, specifically, the first word and the second word are input into a preset search engine to be searched, and search information containing a plurality of search links is obtained, wherein the first search information is obtained by searching the first word, the second search information is obtained by searching the second word, it is known that each search link corresponds to link display information, and there are cases where the link display information is the same but the link display information is different, for example, the words are searched for "understanding" and "understanding", and the search links of both words may include descriptions of the word by a certain encyclopedic address (for example, descriptions of the encyclopedic address for "understanding" and "understanding"), that is, the search links of both words include links of the encyclopedic address, a certain search link of the two words is the same, but the description contents (i.e. link display information) of the encyclopedia website to the two words are different, i.e. the search link is the same, but the link display information is different.
Step S40, determining an approximation degree comparison result of the first word and the second word according to the first search information and the second search information.
Specifically, in this embodiment, the first search information and the second search information are compared to obtain a comparison result, where the comparison result may be a numerical value capable of measuring a magnitude, or may be an absolute conclusion such as whether the first search information is compared with the second search information, and if the obtained similarity is the similarity, the comparison result of the first search information and the second search information is determined according to the similarity, specifically, the similarity is a numerical value, and the determination method of the magnitude may be: as is known, the search information includes search links and link display information, that is, each search link has link display information, in order to facilitate calculating the similarity, the number of the search links in the first search information and the second search information is the same (how many links can be manually set for searching), the number of the same search links in the first search information and the second search information is obtained through statistics, and then the link display information output by the same search link is compared, for example, the first word "lux" and the second word "zhou tree person", and 100 search links are displayed in total by searching "lux" and "zhou tree person", where the same search link has a total of 80, the same search link accounts for 80% of all the search links, and 80% corresponds to a weight, and then the similarity of the link display contents corresponding to the same search link is compared, and obtaining the similarity of the display content of the corresponding link of each same search link, further calculating the average similarity, if the average similarity is 65%, determining the similarity of the first search information and the second search information by calculating the sum of the two weights, and if the similarity is greater than a threshold value, determining that the first word and the second word are similar words.
Specifically, the steps before step S10 include:
step a1, receiving a near meaning word mining instruction, and acquiring a historical library corresponding to the near meaning word mining instruction.
Step a2, performing word segmentation processing on the historical library to obtain a historical word bank, and performing denoising processing on the historical word bank to obtain a preprocessed word bank.
Step a3, receiving a near meaning word judgment instruction, and acquiring a first word and a second word corresponding to the near meaning word judgment instruction from the preprocessing word bank.
The historical library in this embodiment is a database for performing synonym mining, such as a conversation between a client and a customer service, and the present solution enables a finally mined synonym library to better conform to an actual application scenario by collecting these text information, and also improves reliability of sentence similarity calculation for the case where synonym replacement is required, specifically, the historical library includes text contents input by the client and the customer service, and the text contents exist in the form of sentences, and when performing synonym mining, it is also necessary to perform word segmentation processing on these sentences, it can be understood that word segmentation processing in the present solution is to divide a sentence into words of different numbers in a literal sense, for example, the word segmentation result of a sentence "is good weather" is "today", "is" good "," weather "and knowingly, there are many ways to perform word segmentation on a sentence, in order to separate accurate words from a sentence, the scheme can comprehensively use various methods and can carry out training to obtain the best word separating result.
It is known that the words that are manually input may include many words that are useless for word mining, such as punctuation marks, foreign language, sensitive words (abusive words and illegal words), and the like, and the words are cleaned by the present scheme, that is, in the present embodiment, the process of denoising the historical word library is performed, after denoising is completed, a preprocessed word library including many words is obtained, the words in the preprocessed word library are bound with the positions of the words in the original sentence, and the parts of speech in the original sentence, so as to obtain word vectors subsequently, the first word and the second word that need to be subjected to word judgment are determined manually, or the words are automatically selected by a program, and the first word and the second word may be from the preprocessed word library or from outside the preprocessed word library.
Specifically, the step S20 is a step of refining, including:
step b1, obtaining a vector included angle between the first word vector and the second word vector, and calculating a length ratio of the first word vector to the second word vector.
Step b2, determining the similarity of the first word vector and the second word vector according to the ratio of the vector included angle to the length.
The first word vector and the second word vector have the general properties of vectors, i.e. have a magnitude and a direction, wherein the magnitude of the vectors can be measured by length, and the angle between the two vectors can be obtained by coinciding the starting points of the first vector and the second vector, and it is understood that the condition that the two vectors are identical is that: the vectors have the same direction and the same length, and if the lengths of the two vectors are equal, the length ratio of the two vectors is 1, it is understood that the closer the length ratio of the two vectors is to 1, the higher the similarity of the two vectors is, and the smaller the angle between the two vectors is, the higher the similarity of the two vectors is, in sum, the similarity of the first word vector and the second word vector is related to the length ratio and the size of the angle of the two vectors, and the similarity of the first word vector and the second word vector is calculated by respectively weighting the length ratio and the angle, specifically, for example, the weight of the length ratio is 0.3, the weight of the angle is 0.7, the length ratio is 0.8, and the angle is 30 degrees (i.e., 1/6 half circumference angle), and then the similarity of the first word vector and the second word vector is 0.3 × 0.8+ (0.7-0.7 × 1/6) ═ 0.82.
In the embodiment, the first word and the second word to be subjected to approximation degree comparison are input into the preset model, so that a first word vector corresponding to the first word and a second word vector corresponding to the second word are obtained, when the similarity of the first word vector and the second word vector is greater than a preset threshold value, the first word and the second word are input into the preset search engine for searching, first search information corresponding to the first word and second search information corresponding to the second word are obtained, and finally, the approximation degree comparison result of the first word and the second word is determined according to the first search information and the second search information.
Further, referring to fig. 3, on the basis of the above embodiment of the present invention, a second embodiment of the method for mining a synonym based on web search according to the present invention is provided.
This embodiment is a step of the first embodiment, which is a refinement of step S30, and the difference between this embodiment and the above-described embodiment of the present invention is:
step S31, inputting the first term and the second term into a preset search engine, adjusting the preset search engine to an initial state, and adjusting a search quantity parameter of the preset search engine to a target value.
Step S32, search the first word and the second word, and acquire first search information and second search information output by the preset search engine, where the first search information corresponds to the first word, the second search information corresponds to the second word, and both the number of search links in the first search information and the number of search links in the second search information are the target values.
It can be known that, the preset search engine in this embodiment is a system that collects information from the internet according to a certain policy, organizes and processes the collected information, provides a search service for a user, displays the retrieved related information to the user, inputs the first word and the second word into the preset search engine, adjusts the search quantity parameter of the preset search engine, and adjusts the search quantity parameter to a specific value (i.e., a target value in this embodiment), it can be understood that the search quantity parameter is the number of related information that is retrieved by the preset search engine at one time according to the information input by the user, the retrieved related information exists in a form of links, the number of links corresponding to all related information retrieved at one time is the search quantity parameter in this embodiment, and the search quantity parameter of the preset search engine in this embodiment supports adjustment, for example, if the search amount parameter of the preset search engine is adjusted to 100, the first term and the second term are respectively input into the preset search engine, the number of links corresponding to the obtained first search information is 100, the number of links corresponding to the second search information is also 100, the number of links corresponding to the first search information, the purpose of setting equal to the number of links corresponding to the second search information is to facilitate comparison of the first search information with the second search information, it is known that the preset search engine may adjust the output search information according to the search habit of the user, and in this scheme, before inputting the first word or the second word into the preset search engine, the preset search engine needs to be initialized, and the purpose of the initialization is to eliminate the previously stored user search records, to ensure that the output search information is not affected by other factors than the input information (i.e., the first and second terms in this embodiment).
Specifically, the step S40 is a step of refining, including:
and c1, screening the target search links contained in the first search information and the second search information, and determining the target quantity of the target search links.
Step c2, calculating the target ratio of the target number to the target value, and using the search link in the first search information as the first search link and using the search link in the second search information as the second search link.
And c3, calculating a similarity value between first link display information and second link display information, wherein the first link display information corresponds to the first search link, and the second link display information corresponds to the second search link.
And c4, determining the similarity contrast result of the first word and the second word according to the target ratio and the similarity value.
It should be noted that the target search link included in the first search information and the second search information in the present embodiment refers to a search link existing in the first search information and the second search information, and as is known from the definition of the link, the link refers to a connection relationship from one target (hereinafter referred to as target one) to another target (hereinafter referred to as target two), where the targets may be web pages, pictures, characters, etc., but whatever the targets are, the targets have an exact address, and the link refers to a connection relationship from one address (hereinafter referred to as a starting point) to another address (hereinafter referred to as an ending point), and in the present embodiment, the first search information and the second search information are both output by the same preset search engine, and therefore, the starting points of all the search links in the first search information and the second search information are the same, and if the end point of the search link a in the first search information is the same as the end point of the search link b in the second search information, the search link a and the search link b are the target search links in this embodiment.
After the target search links are screened from the first search information and the second search information, the number of the target search links (i.e., the number of targets in this embodiment) may be determined, and as can be seen from the above, the number of the targets is the number of links corresponding to all relevant information retrieved by the preset search engine once, i.e., the number of the first search links and the second search links in this embodiment, for example, if the number of the targets is 100, the number of the first search links and the number of the second search links are both 100, and if the number of the target search links is 30, the target ratio in this embodiment is 0.3, it is known that the link display information in this embodiment refers to information displayed on the target two, where the first link display information corresponds to the first search links, the second link display information corresponds to the second search links, and it is known that each link display information corresponds to one URL (Uniform Resource Locator, uniform resource locator), the uniform resource locator can be understood as a unique identification tag of information stored on an end point, for example, if the starting point and the end point of a search link c and a search link d are the same, an end point of the search link c stores information e, and an end point of the search link d stores information f, if the uniform resource locator of the information e is the same as the uniform resource locator of the information f, the search link c and the search link d are completely the same, if the uniform resource locator of the information e is different from the uniform resource locator of the information f, only the search link c and the search link d can be partially the same, by obtaining the uniform resource locator corresponding to the link display information, and judging whether the resource locator corresponding to the first link display information is the same as the resource locator corresponding to the second link display information, the same link display information in the first link display information and the second link display information may be determined, and then the similarity value between the first link display information and the second link display information may be determined according to the number of the same link display information in the first link display information and the second link display information, specifically, the greater the number ratio between the first link display information and the second link display information, the greater the similarity value, for example, if the target value is 200, the target number is 50, and the number of the same link display information in the first link display information and the second link display information is 2, the similarity value is 4%, the target ratio is 25%, and the similarity between the first word and the second word is 81%, it is understood that the similarity in this embodiment is determined by combining the target ratio and the similarity value, and when the target ratio is fixed, the greater the similarity value is, the greater the approximation degree is, and the greater the target ratio value is, the greater the approximation degree is when the similarity value is fixed.
Specifically, the step c3 is a step of refining, which comprises:
and d1, acquiring a first resource positioning mark corresponding to the first link display information, and a second resource positioning mark corresponding to the second link display information.
And d2, screening out the target resource positioning marks contained in both the first resource positioning mark and the second resource positioning mark.
Step d3, determining the similarity value of the first link display information and the second link display information according to the target number and the number of the target resource positioning marks.
It is to be noted that, if the first resource locator and the second resource locator are the same, the link display information corresponding to the first resource locator is the same as the link display information corresponding to the second resource locator, and it is to be noted that, after acquiring a first resource positioning mark corresponding to the first link display information and a second resource positioning mark corresponding to the second link display information, the target resource locator included in both the first resource locator and the second resource locator is also screened out, and finally, determining the similarity value of the first link display information and the second link display information according to the target number and the number of the target resource positioning marks, for example, if the target value is 500, the target number is 80, and the number of the same link display information in the first link display information and the second link display information is 2, the similarity value is 2.5%.
Specifically, the steps after step c3 include:
step e1, determining first link classification information corresponding to the first search link and second link classification information corresponding to the second search link according to a preset link classification label.
Step e2, calculating the similarity between the first link classification information and the second link classification information.
Specifically, the step of refining in step c4 includes:
and e3, determining the similarity contrast result of the first term and the second term according to the target ratio, the similarity value and the similarity degree.
It can be known that, besides the target search links included in the first search information and the second search information, the different search links in the first search information and the second search information also have reference values, and in the present solution, before performing the similar word comparison, the search links may be classified in advance to obtain link classification tags, where the link classification tags may also be classified into multiple levels of tags, for example, the first level tags of the search links may be classified into characters, videos, pictures, and the like, the second level tags of the characters may be books, documents, user public logs, and the like, and the first search link and the second search link are classified according to the preset link classification tags to obtain the first link classification information and the second link classification information, and it can be known that the link classification information is the classification condition of all the search links, including the number of the search links under each level of classification tags, further, according to the first link classification information and the second link classification information, the similarity is calculated, specifically, the difference between the numbers of the first search link and the second search link under the same classification label can be calculated, and finally, the average value of the difference values between the first search link and the second search link under all the classification labels is calculated, the smaller the average value is, the larger the similarity between the first link classification information and the second link classification information is, the larger the average value is, the smaller the similarity between the first link classification information and the second link classification information is, and finally, the similarity between the first word and the second word can be determined by the target ratio, the similarity value and the similarity degree, and the specific determination method is as described in the above embodiment.
In the embodiment, the search information is classified through the preset link classification label, and then the similarity comparison result of the first word and the second word is determined according to the similarity of the classification information and the search information, so that the precision of the near meaning word mining is improved.
In addition, referring to fig. 4, an embodiment of the present invention further provides a network search-based synonym mining device, where the network search-based synonym mining device includes:
the vector output module 10 is configured to detect a first word and a second word to be subjected to similarity comparison, input the first word and the second word into a preset model, and obtain a first word vector corresponding to the first word and a second word vector corresponding to the second word;
a similarity calculation module 20, configured to calculate a similarity between the first word vector and the second word vector;
the search module 30 is configured to compare a similarity between the first word vector and the second word vector with a preset threshold, and if the similarity between the first word vector and the second word vector is greater than the preset threshold, input the first word and the second word into a preset search engine for search to obtain first search information corresponding to the first word and second search information corresponding to the second word, respectively;
and the similarity calculation module 40 is configured to determine a similarity comparison result between the first word and the second word according to the first search information and the second search information.
Optionally, the apparatus for mining a synonym based on web search further includes:
the system comprises a mining instruction receiving module, a semantic word mining instruction processing module and a semantic word mining module, wherein the mining instruction receiving module is used for receiving a semantic word mining instruction and acquiring a historical library corresponding to the semantic word mining instruction;
the word segmentation and denoising module is used for carrying out word segmentation on the historical library to obtain a historical word bank and carrying out denoising on the historical word bank to obtain a preprocessed word bank;
and the near meaning word judgment instruction receiving module is used for receiving a near meaning word judgment instruction and acquiring a first word and a second word corresponding to the near meaning word judgment instruction from the preprocessing word bank.
Optionally, the similarity calculation module 20 includes:
the length ratio calculation unit is used for acquiring a vector included angle between the first word vector and the second word vector and calculating the length ratio of the first word vector to the second word vector;
and the similarity determining unit is used for determining the similarity of the first word vector and the second word vector according to the ratio of the vector included angle to the length.
Optionally, the vector output module 10 includes:
the parameter adjusting unit is used for inputting the first words and the second words into a preset search engine, adjusting the preset search engine to be in an initial state, and adjusting the search quantity parameters of the preset search engine to be target numerical values;
the search information acquisition unit is configured to search the first word and the second word and acquire first search information and second search information output by the preset search engine, where the first search information corresponds to the first word, the second search information corresponds to the second word, and the number of search links in the first search information and the number of search links in the second search information are both the target numerical value.
Optionally, the approximation calculation module 30 includes:
the first screening unit is used for screening target search links contained in the first search information and the second search information and determining the target number of the target search links;
a target ratio calculation unit, configured to calculate a target ratio between the target number and the target value, use a search link in the first search information as a first search link, and use a search link in the second search information as a second search link;
a similarity value calculation unit configured to calculate a similarity value between first link display information and second link display information, where the first link display information corresponds to the first search link and the second link display information corresponds to the second search link;
and the first comparison result determining unit is used for determining an approximate degree comparison result of the first word and the second word according to the target ratio and the similarity value.
Optionally, the similarity value calculating unit includes:
a resource locator acquiring unit, configured to acquire a first resource locator corresponding to the first link display information, and a second resource locator corresponding to the second link display information;
a second screening unit, configured to screen out target resource locators included in both the first resource locator and the second resource locator;
and the similarity value determining unit is used for determining the similarity value of the first link display information and the second link display information according to the target number and the number of the target resource positioning marks.
Optionally, the similarity value calculating unit includes:
the classified information determining unit is used for determining first link classified information corresponding to the first search link and second link classified information corresponding to the second search link according to a preset link classified label;
a similarity degree calculation unit configured to calculate a similarity degree between the first link classification information and the second link classification information;
the first comparison result determination unit includes:
and the second comparison result determining unit is used for determining the similarity comparison result of the first word and the second word according to the target ratio, the similarity value and the similarity degree.
In addition, the embodiment of the present invention further provides a storage medium, where a network search-based hypernym mining program is stored on the storage medium, and when executed by a processor, the network search-based hypernym mining program implements the operations in the network search-based hypernym mining method provided in the above embodiment.
The method executed by each program module can refer to each embodiment of the method of the present invention, and is not described herein again.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity/action/object from another entity/action/object without necessarily requiring or implying any actual such relationship or order between such entities/actions/objects; the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
For the apparatus embodiment, since it is substantially similar to the method embodiment, it is described relatively simply, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described apparatus embodiments are merely illustrative, in that elements described as separate components may or may not be physically separate. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method for mining synonyms based on network search according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.