CN112307183B - Search data identification method, apparatus, electronic device and computer storage medium - Google Patents
Search data identification method, apparatus, electronic device and computer storage medium Download PDFInfo
- Publication number
- CN112307183B CN112307183B CN202011191952.8A CN202011191952A CN112307183B CN 112307183 B CN112307183 B CN 112307183B CN 202011191952 A CN202011191952 A CN 202011191952A CN 112307183 B CN112307183 B CN 112307183B
- Authority
- CN
- China
- Prior art keywords
- search
- candidate
- rewritten
- preset
- characteristic information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9532—Query formulation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3322—Query formulation using system suggestions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The disclosure relates to a search data identification method, a search data identification device, electronic equipment and a storage medium. Wherein the method comprises the following steps: responding to an input data search request, and analyzing whether search words in the data search request comprise first characteristic information meeting preset characteristic conditions or not; performing word segmentation processing on the first characteristic information according to a preset word segmentation strategy, and obtaining a plurality of groups of rewritten candidate words corresponding to the first characteristic information; sorting the plurality of groups of rewritten candidate words according to a preset sorting algorithm to obtain a sorted candidate search set; and acquiring first recall result data according to the candidate search set. The method and the device can improve the accuracy of the search results, and feedback is closer to the search results expected by the user.
Description
Technical Field
The present disclosure relates to the field of data processing, and in particular, to a search data identification method, apparatus, electronic device, and computer storage medium.
Background
In an application scene of information searching, when a user uses a track and listen way or unfamiliar search words to search information, pinyin or short-hand of the pinyin is generally used as a searching expression mode with larger probability, for example, when searching for 'Tianyan view' because specific words are not clear, 'tianyan view' is used for searching; or when the user is urgent to input or the pinyin input method does not provide correct candidate words, the user is more favored to directly input uncertain pinyin or pinyin shorthand or incomplete pinyin fragments, such as ' Wuhan Ji ' wuz ' (Wuhan Ji materials) ', ' Chinese postal express logistics share limited g ' (Chinese postal express logistics share limited company) ', and the like. If the search is expressed by incomplete search with pinyin, the true search meaning is difficult to identify, and the search result is mostly returned without a result or the returned result is inaccurate enough and deviates from the true result expected by the user.
Accordingly, there is a need for one or more approaches to address the above-described problems.
It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
It is an object of the present disclosure to provide a search data identification method, apparatus, electronic device, and computer-readable storage medium, which overcome, at least in part, one or more of the problems due to the limitations and disadvantages of the related art.
According to one aspect of the present disclosure, there is provided a search data identification method including:
Responding to an input data search request, and analyzing whether search words in the data search request comprise first characteristic information meeting preset characteristic conditions or not;
when the first characteristic information meeting the preset characteristic conditions is included, word segmentation processing is carried out on the first characteristic information according to a preset word segmentation strategy, and a plurality of groups of rewritten candidate words corresponding to the first characteristic information are obtained;
sorting the plurality of groups of rewritten candidate words according to a preset sorting algorithm to obtain a sorted candidate search set;
and acquiring first recall result data according to the candidate search set.
In an exemplary embodiment of the disclosure, the analyzing whether the search term in the data search request includes first feature information that satisfies a preset feature condition includes:
detecting whether a search word in the data search request comprises a pinyin syllable segment;
If the search word comprises a pinyin syllable segment, determining that the search word in the data search request comprises first characteristic information meeting preset characteristic conditions, wherein the pinyin syllable segment is the first characteristic information.
In an exemplary embodiment of the present disclosure, the ranking the plurality of sets of rewritten candidate words according to a preset ranking algorithm to obtain a ranked candidate search set includes:
calculating the scores of each group of rewritten candidate words according to a preset ordering algorithm to obtain a scoring result;
and sorting the rewritten candidate words according to the scoring result to obtain a sorted candidate search set.
In an exemplary embodiment of the present disclosure, the ranking of the plurality of sets of rewritten candidate words according to a preset ranking algorithm, to obtain a ranked candidate search set, including any one or more of the following:
Judging the number of independent syllables of each rewritten candidate word; sorting the rewritten candidate words of each group according to the number of independent syllables to obtain a sorted candidate search set;
Or alternatively
Determining syllable prefix matching degree of each rewritten candidate word; and ordering the rewritten candidate words of each group according to the syllable prefix matching degree to obtain an ordered candidate search set.
In an exemplary embodiment of the present disclosure, after obtaining the plurality of sets of rewrite candidate words corresponding to the first feature information, the method further includes:
acquiring fuzzy search result data according to each group of candidate rewritten words;
determining word segmentation frequency of each group of candidate rewritten words in the fuzzy search result data;
sorting the plurality of groups of rewritten candidate words according to a preset sorting algorithm to obtain a sorted candidate search set, wherein the sorting comprises:
and ordering the rewritten candidate words of each group according to the word segmentation frequency to obtain an ordered candidate search set.
In an exemplary embodiment of the present disclosure, the ranking the plurality of sets of rewritten candidate words according to a preset ranking algorithm to obtain a ranked candidate search set includes:
Performing confusion degree calculation on a plurality of groups of rewritten candidate words after word segmentation to obtain a confusion degree score;
And carrying out ascending order on the rewritten candidate words with the lowest confusion degree scores according to the preset number of the rewritten candidate words with the lowest confusion degree scores, and obtaining an ordered candidate search set.
In an exemplary embodiment of the present disclosure, performing word segmentation on the first feature information according to a preset word segmentation policy includes:
dividing the first characteristic information according to an initial and final comparison table; or alternatively
And performing word segmentation processing on the first characteristic information based on a forward maximum word segmentation matching algorithm of the pinyin byte dictionary.
In an exemplary embodiment of the present disclosure, the method further comprises:
Acquiring second characteristic information included in search words in the data search request;
extracting word granularity and phrase granularity of the second characteristic information;
Acquiring second recall result data corresponding to the second feature information according to the word granularity and the phrase granularity of the second feature information;
And taking the first recall result data and the second recall result data as response information of the data search request.
In one aspect of the present disclosure, there is provided a search data identification apparatus including:
the characteristic analysis module is used for responding to an input data search request and analyzing whether search words in the data search request comprise first characteristic information meeting preset characteristic conditions or not;
The word segmentation processing module is used for carrying out word segmentation processing on the first characteristic information according to a preset word segmentation strategy when the first characteristic information meeting the preset characteristic condition is included, and obtaining a plurality of groups of rewritten candidate words corresponding to the first characteristic information;
The candidate word ordering module is used for ordering the plurality of groups of rewritten candidate words according to a preset ordering algorithm to obtain an ordered candidate search set;
And the result recall module is used for acquiring first recall result data according to the candidate search set.
In one aspect of the present disclosure, there is provided an electronic device comprising:
A processor; and
A memory having stored thereon computer readable instructions which, when executed by the processor, implement a method according to any of the above.
In one aspect of the present disclosure, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor, implements a method according to any of the above.
A search data identification method in an exemplary embodiment of the present disclosure, the method comprising: responding to an input data search request, and analyzing whether search words in the data search request comprise first characteristic information meeting preset characteristic conditions or not; performing word segmentation processing on the first characteristic information according to a preset word segmentation strategy, and obtaining a plurality of groups of rewritten candidate words corresponding to the first characteristic information; sorting the plurality of groups of rewritten candidate words according to a preset sorting algorithm to obtain a sorted candidate search set; and acquiring first recall result data according to the candidate search set. The method and the device can improve the accuracy of the search results, and feedback is closer to the search results expected by the user.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The above and other features and advantages of the present disclosure will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.
FIG. 1 illustrates a flowchart of a search data identification method according to an exemplary embodiment of the present disclosure;
FIG. 2 illustrates a schematic block diagram of a search data identification apparatus according to an exemplary embodiment of the present disclosure;
FIG. 3 schematically illustrates a block diagram of an electronic device according to an exemplary embodiment of the present disclosure; and
Fig. 4 schematically illustrates a schematic diagram of a computer-readable storage medium according to an exemplary embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments can be embodied in many forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the disclosed aspects may be practiced without one or more of the specific details, or with other methods, components, materials, devices, steps, etc. In other instances, well-known structures, methods, devices, implementations, materials, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.
The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in software, or in one or more software-hardened modules, or in different networks and/or processor devices and/or microcontroller devices.
In the present exemplary embodiment, there is first provided a search data identification method; referring to fig. 1, the search data identification method may include the steps of:
Step S110, responding to an input data search request, and analyzing whether search words in the data search request comprise first characteristic information meeting preset characteristic conditions or not;
Step S120, when first characteristic information meeting preset characteristic conditions is included, word segmentation processing is carried out on the first characteristic information according to a preset word segmentation strategy, and a plurality of groups of rewriting candidate words corresponding to the first characteristic information are obtained;
step S130, sorting the plurality of groups of rewritten candidate words according to a preset sorting algorithm to obtain a sorted candidate search set;
and step S140, acquiring first recall result data according to the candidate search set.
A search data identification method in an exemplary embodiment of the present disclosure, the method comprising: responding to an input data search request, and analyzing whether search words in the data search request comprise first characteristic information meeting preset characteristic conditions or not; performing word segmentation processing on the first characteristic information according to a preset word segmentation strategy, and obtaining a plurality of groups of rewritten candidate words corresponding to the first characteristic information; sorting the plurality of groups of rewritten candidate words according to a preset sorting algorithm to obtain a sorted candidate search set; and acquiring first recall result data according to the candidate search set. The method and the device can improve the accuracy of the search results, and feedback is closer to the search results expected by the user.
Next, a search data identification method in the present exemplary embodiment will be further described.
In step S110, it may be analyzed whether a search term in an input data search request includes first feature information satisfying a preset feature condition in response to the data search request.
In the existing search scene, the input content is the combination of Chinese characters and pinyin letters or the shorthand of Chinese characters and pinyin letters due to the reasons of user input problems or uncertain user input content, and in the existing search algorithm, huge pinyin prefix trees are generally established to realize the recognition of pinyin letters, but for some specific searches, such as specific proper nouns like company names, the method not only occupies a large amount of memory space, but also can not quickly and accurately recognize the proper nouns, so that the user experience is reduced.
When a data search request input by a user is received, responding to the data search request, and acquiring information such as search words carried in the data search request. The first characteristic information may be pinyin syllable segments of the search term in the data search request, and each search term in the data search request may include one or more pinyin syllable segments that are continuous or discontinuous; analyzing whether the search word in the data search request comprises first characteristic information meeting preset characteristic conditions or not comprises the following steps: detecting whether a search word in the data search request comprises a pinyin syllable segment; if the search word comprises a pinyin syllable segment, determining that the search word in the data search request comprises first characteristic information meeting preset characteristic conditions, wherein the pinyin syllable segment is the first characteristic information. That is, whether the search word of the data search request contains a combination phenomenon of Chinese characters and syllable segments of pinyin or Chinese characters and syllable abbreviations of pinyin (such as initial letters of a syllable of pinyin) is analyzed, if so, it is determined that the data search request includes first characteristic information meeting preset characteristic conditions, and the first characteristic information is the syllable segments of pinyin or the abbreviations of syllable segments of pinyin in the search word. If the search word of the data search request is "san Diego hyan" and the pinyin syllable segment "hyan" is included, the pinyin syllable segment "hyan" in the search word "san Diego hyan" is determined as the first characteristic information.
Since the english word and the pinyin syllable segment are both composed of letters, when determining whether the first feature information is a pinyin syllable segment, a judgment deviation is easy to occur, so when determining the first feature information, whether the pinyin syllable segment is an english word can be further determined, when determining the english word, the first feature information can be determined that the first feature information cannot meet the preset feature condition, the english word can be ignored, for example, "WIFI" can be distinguished as a non-pinyin syllable segment, and "diz" in "WIFI" is the first feature information (pinyin syllable segment) meeting the preset feature condition.
In step S120, word segmentation processing may be performed on the first feature information according to a preset word segmentation policy, and multiple sets of rewrite candidate words corresponding to the first feature information are obtained.
The preset word segmentation strategy can be any one or more modes of tree-shaped word segmentation, initial consonant word segmentation and vowel comparison table segmentation word segmentation, front-to-back maximum matching word segmentation, and the like, and multiple groups of rewritten candidate words can be divided according to the first characteristic information containing pinyin syllable fragments through the preset word segmentation strategy.
When the method comprises the first characteristic information meeting the preset characteristic conditions, word segmentation processing is carried out on the first characteristic information according to a preset word segmentation strategy, and the method comprises the following steps: dividing the first characteristic information according to an initial and final comparison table; or performing word segmentation processing on the first characteristic information based on a forward maximum word segmentation matching algorithm of the pinyin byte dictionary. For example, after the first feature information "youxiangs" is determined, the search term "network technology youxiangs" performs word segmentation processing on the first feature information "youxiangs" according to the initial and final comparison table segmentation or the forward maximum word segmentation matching algorithm based on the pinyin byte dictionary, and obtains multiple groups of word segments, such as: "you-xin-g-s", "you-xi-an-g-s", "you-xin-s" and the like; then, the segmented word can be rewritten to obtain a plurality of groups of rewritten candidate words, for example: "Limited company", "gambling location company", "similar to" …, and so on. In this example embodiment, different candidate word rewrite rules may be preset according to different application scenarios, for example, when the candidate word rewrite rules are applied to the enterprise information query industry, a certain enterprise information rule may be summarized according to enterprise information, and the candidate word rewrite is calculated according to the enterprise information rule, for example, because the suffix of the enterprise name is a limited company, the candidate word rewrite "limited company" of the pinyin syllable segment word may be obtained according to the enterprise suffix in the enterprise information rule.
In step S130, the multiple sets of rewritten candidate words may be ranked according to a preset ranking algorithm, so as to obtain a ranked candidate search set.
Specifically, the multiple sets of rewritten candidate words are ranked according to a preset ranking algorithm, so as to obtain a ranked candidate search set, which may include various manners, for example:
1. Calculating the scores of each group of rewritten candidate words according to a preset ordering algorithm to obtain a scoring result; and sorting the rewritten candidate words according to the scoring result to obtain a sorted candidate search set.
For example, the confusion (perplexity) of the multiple groups of rewritten candidate words after word segmentation can be calculated to obtain a language model confusion score (ppl); the overall score should be higher as the ppl score is lower, for example, the "kender" (ppl=0.91) and "kender chicken" (ppl=11.2) candidates for writing "kender j" are better. If the scoring result (confusion score ppl) is higher than the preset reject threshold, the rewrite candidate word can be rejected. And then, carrying out ascending order sorting on the rewritten candidate words according to the high and low of the scoring result (confusion score) to obtain a sorted candidate search set. Specifically, the confusion degree calculation can be performed on the candidate rewritten word based on the ngram statistical language model, and the lower the confusion degree score, the more smooth the description sentence, and the better the candidate rewritten word rewritten effect. The language model needs to collect relevant corpus training in combination with a business scene, and can receive candidate rewritten words on line in real time and give corresponding confusion scores. And sorting according to the confusion degree scores to generate sorted candidate search sets.
2. Judging the number of independent syllables of each rewritten candidate word; and ordering the rewritten candidate words of each group according to the number of independent syllables to obtain an ordered candidate search set.
When sorting according to the number of independent syllables of each rewritten candidate word, the rewritten candidate words can be sorted in a descending order according to the number of the independent syllables, and a candidate search set is obtained. If the data search request is "beijing jetty kejyouxgs", determining that the first feature information in the data search request is "kejyouxgs", performing word segmentation processing on the first feature information, and obtaining a candidate word including "beijing jetty limited science and technology", and the like, wherein the number of independent syllables of the candidate word "beijing jetty limited science and technology" is the largest, so that the candidate word "beijing jetty limited science and technology" is ranked before the candidate word "beijing jetty limited science" and the "beijing jetty limited science and technology" and the like, and sequentially sorting according to the independent syllable number data judgment mode, thereby generating a sorted candidate search set.
3. Determining syllable prefix matching degree of each rewritten candidate word; and ordering the rewritten candidate words of each group according to the syllable prefix matching degree to obtain an ordered candidate search set.
The candidate words are sorted according to syllable prefix matching degree of each candidate word, and the candidate words can be sorted in descending order according to syllable prefix matching degree, so that a candidate search set is obtained. If the data search request is "good all dia", determining that the first feature information in the data search request is "dia", performing word segmentation processing on the first feature information, and obtaining rewritten candidate words including "good all electricity", and the like, wherein syllable prefix matching degree of the rewritten candidate words "good all electricity" is highest, so that the rewritten candidate words "good all electricity" are ranked before the rewritten candidate words "good all electricity", and are sequentially ranked according to the syllable prefix matching degree, and a ranked candidate search set is generated.
4. Acquiring fuzzy search result data according to each group of candidate rewritten words; determining word segmentation frequency of each group of candidate rewritten words in the fuzzy search result data; and ordering the rewritten candidate words of each group according to the word segmentation frequency to obtain an ordered candidate search set. For example:
And obtaining fuzzy search result data according to the groups of candidate rewritten words, and ordering the groups of rewritten candidate words in descending order according to the number of word segmentation frequencies of the fuzzy search result data, wherein when a data search request is 'san Diego hyan', determining that first characteristic information in the data search request is 'hyan', performing fuzzy search on the first characteristic information, obtaining a plurality of results comprising 'Vannisha san Diego', 'Ji Nima san Diego', 'Navier san Diego', 'Katsujia' and the like, wherein the word segmentation frequencies of 'san Diego' are highest among all fuzzy search result data, so that the rewritten candidate words are ranked in front of the rewritten candidate words 'san Diego' according to the word segmentation frequencies of the fuzzy search result data, obtaining fuzzy search result data, and ordering the groups of candidate words according to the word segmentation frequencies of the fuzzy search result data, thereby generating a ranked candidate search set.
In step S140, first recall result data may be obtained from the candidate search set.
In the embodiment of the present example, the search result data corresponding to each rewritten candidate word in the sorted candidate search set may be recalled according to the sort condition in the candidate search set, and may be used as the recall order of the search result data; the search result data obtained according to the sorting condition in the candidate search set as the recall sequence is the first recall result data, and the first recall result data is used as the response information of the data search request.
In an embodiment of the present example, the method further comprises: acquiring second characteristic information included in search words in the data search request; extracting word granularity and phrase granularity of the second characteristic information; acquiring second recall result data corresponding to the second feature information according to the word granularity and the phrase granularity of the second feature information; and taking the first recall result data and the second recall result data as response information of the data search request.
In this example embodiment, except that the pinyin syllable segment in the data search request is used as the first feature information, the kanji text portion in the data search request may be used as the second feature information, and the word granularity and phrase granularity of the kanji text portion are extracted, where the granularity is a measure of the amount of information contained in the text. The text contains more information, the granularity is larger, and the granularity is smaller. Searching corresponding search result data from the database according to word granularity and phrase granularity of the Chinese character text part, acquiring second recall result data corresponding to the second feature information, and using the second recall result and the first recall result data together as response information of the data search request.
It should be noted that although the steps of the methods of the present disclosure are illustrated in a particular order in the figures, this does not require or imply that the steps must be performed in that particular order or that all of the illustrated steps must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
Further, in the present exemplary embodiment, there is also provided a search data identification apparatus. Referring to fig. 2, the search data identification apparatus 200 may include: feature analysis module 210, word segmentation processing module 220, candidate word ordering module 230, and result recall module 240. Wherein:
A feature analysis module 210, configured to, in response to an input data search request, analyze whether a search term in the data search request includes first feature information that satisfies a preset feature condition;
The word segmentation processing module 220 is configured to perform word segmentation processing on the first feature information according to a preset word segmentation policy, and obtain a plurality of groups of rewritten candidate words corresponding to the first feature information;
The candidate word ordering module 230 is configured to order the plurality of sets of rewritten candidate words according to a preset ordering algorithm, so as to obtain an ordered candidate search set;
The result recall module 240 is configured to obtain first recall result data according to the candidate search set.
The specific details of each search data identification apparatus module in the foregoing have been described in detail in the corresponding search data identification method, and thus will not be described herein.
It should be noted that although several modules or units of the search data identification apparatus 200 are mentioned in the above detailed description, such division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.
Those skilled in the art will appreciate that the various aspects of the invention may be implemented as a system, method, or program product. Accordingly, aspects of the invention may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.
An electronic device 300 according to such an embodiment of the invention is described below with reference to fig. 3. The electronic device 300 shown in fig. 3 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.
As shown in fig. 3, the electronic device 300 is embodied in the form of a general purpose computing device. Components of electronic device 300 may include, but are not limited to: the at least one processing unit 310, the at least one memory unit 320, a bus 330 connecting the different system components (including the memory unit 320 and the processing unit 310), and a display unit 340.
Wherein the storage unit stores program code that is executable by the processing unit 310 such that the processing unit 310 performs steps according to various exemplary embodiments of the present invention described in the above-mentioned "exemplary methods" section of the present specification. For example, the processing unit 310 may perform steps S110 to S140 as shown in fig. 1.
The storage unit 320 may include a storage medium in the form of a volatile storage unit, such as a Random Access Memory (RAM) 3201 and/or a cache memory 3202, and may further include a Read Only Memory (ROM) 3203.
The storage unit 320 may also include a program/utility 3204 having a set (at least one) of program modules 3205, such program modules 3205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Bus 330 may be one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 300 may also communicate with one or more external devices 370 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 300, and/or any device (e.g., router, modem, etc.) that enables the electronic device 300 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 350. Also, electronic device 300 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 360. As shown, the network adapter 360 communicates with other modules of the electronic device 300 over the bus 330. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 300, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification is also provided. In some possible embodiments, the various aspects of the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the invention as described in the "exemplary methods" section of this specification, when said program product is run on the terminal device.
Referring to fig. 4, a program product 400 for implementing the above-described method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more storage media. The storage medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any storage medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
Furthermore, the above-described drawings are only schematic illustrations of processes included in the method according to the exemplary embodiment of the present invention, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.
Claims (9)
1. A method of identifying search data, the method comprising:
Responding to an input data search request, and analyzing whether search words in the data search request comprise first characteristic information meeting preset characteristic conditions or not;
if the first characteristic information meeting the preset characteristic conditions is included, word segmentation processing is carried out on the first characteristic information according to a preset word segmentation strategy, and a plurality of groups of rewriting candidate words corresponding to the first characteristic information are obtained;
sorting the plurality of groups of rewritten candidate words according to a preset sorting algorithm to obtain a sorted candidate search set;
Acquiring first recall result data according to the candidate search set;
the analyzing whether the search word in the data search request includes first feature information meeting a preset feature condition includes:
detecting whether a search word in the data search request comprises a pinyin syllable segment;
if the search word comprises a pinyin syllable segment, determining that the search word in the data search request comprises first characteristic information meeting a preset characteristic condition, wherein the pinyin syllable segment is the first characteristic information, and the preset characteristic condition is a non-English word;
the word segmentation processing is performed on the first characteristic information according to a preset word segmentation strategy, and the word segmentation processing comprises the following steps:
dividing the first characteristic information according to an initial and final comparison table; or alternatively
And performing word segmentation processing on the first characteristic information based on a forward maximum word segmentation matching algorithm of the pinyin byte dictionary.
2. The method of claim 1, wherein ranking the plurality of sets of rewritten candidate words according to a preset ranking algorithm to obtain a ranked candidate search set comprises:
calculating the scores of each group of rewritten candidate words according to a preset ordering algorithm to obtain a scoring result;
and sorting the rewritten candidate words according to the scoring result to obtain a sorted candidate search set.
3. The method of claim 1, wherein the ranking the plurality of sets of rewritten candidate words according to a preset ranking algorithm to obtain a ranked candidate search set comprises any one or more of:
Judging the number of independent syllables of each rewritten candidate word; sorting the rewritten candidate words of each group according to the number of independent syllables to obtain a sorted candidate search set;
Or alternatively
Determining syllable prefix matching degree of each rewritten candidate word; and ordering the rewritten candidate words of each group according to the syllable prefix matching degree to obtain an ordered candidate search set.
4. The method of claim 1, wherein after obtaining the plurality of sets of rewrite candidate words corresponding to the first feature information, the method further comprises:
acquiring fuzzy search result data according to each group of candidate rewritten words;
determining word segmentation frequency of each group of candidate rewritten words in the fuzzy search result data;
sorting the plurality of groups of rewritten candidate words according to a preset sorting algorithm to obtain a sorted candidate search set, wherein the sorting comprises:
and ordering the rewritten candidate words of each group according to the word segmentation frequency to obtain an ordered candidate search set.
5. The method according to any one of claims 1-2, wherein ranking the plurality of sets of rewritten candidate words according to a preset ranking algorithm to obtain a ranked candidate search set comprises:
Performing confusion degree calculation on a plurality of groups of rewritten candidate words after word segmentation to obtain a confusion degree score;
And carrying out ascending order on the rewritten candidate words with the lowest confusion degree scores according to the preset number of the rewritten candidate words with the lowest confusion degree scores, and obtaining an ordered candidate search set.
6. The method of claim 1, wherein the method further comprises:
Acquiring second characteristic information included in search words in the data search request;
And acquiring corresponding search result data according to the second characteristic information and a plurality of groups of rewritten candidate words in the sorted candidate search set, wherein the search result data is the first recall result data.
7. A search data identification apparatus, the apparatus comprising:
the characteristic analysis module is used for responding to an input data search request and analyzing whether search words in the data search request comprise first characteristic information meeting preset characteristic conditions or not;
The word segmentation processing module is used for carrying out word segmentation processing on the first characteristic information according to a preset word segmentation strategy when the first characteristic information meeting the preset characteristic condition is included, and obtaining a plurality of groups of rewritten candidate words corresponding to the first characteristic information;
The candidate word ordering module is used for ordering the plurality of groups of rewritten candidate words according to a preset ordering algorithm to obtain an ordered candidate search set;
The result recall module is used for acquiring first recall result data according to the candidate search set;
the analyzing whether the search word in the data search request includes first feature information meeting a preset feature condition includes:
detecting whether a search word in the data search request comprises a pinyin syllable segment;
if the search word comprises a pinyin syllable segment, determining that the search word in the data search request comprises first characteristic information meeting a preset characteristic condition, wherein the pinyin syllable segment is the first characteristic information, and the preset characteristic condition is a non-English word;
the word segmentation processing is performed on the first characteristic information according to a preset word segmentation strategy, and the word segmentation processing comprises the following steps:
dividing the first characteristic information according to an initial and final comparison table; or alternatively
And performing word segmentation processing on the first characteristic information based on a forward maximum word segmentation matching algorithm of the pinyin byte dictionary.
8. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs,
When executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-6.
9. A computer storage medium having stored thereon a computer program, which when executed by a processor implements the method according to any of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011191952.8A CN112307183B (en) | 2020-10-30 | 2020-10-30 | Search data identification method, apparatus, electronic device and computer storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011191952.8A CN112307183B (en) | 2020-10-30 | 2020-10-30 | Search data identification method, apparatus, electronic device and computer storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112307183A CN112307183A (en) | 2021-02-02 |
CN112307183B true CN112307183B (en) | 2024-04-19 |
Family
ID=74333104
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011191952.8A Active CN112307183B (en) | 2020-10-30 | 2020-10-30 | Search data identification method, apparatus, electronic device and computer storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112307183B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115017344A (en) * | 2021-03-05 | 2022-09-06 | 北京奇虎科技有限公司 | Method, device, device and storage medium for keyword recommendation |
CN113569010B (en) * | 2021-07-23 | 2023-12-12 | 北京百度网讯科技有限公司 | Method, device, equipment and storage medium for filtering search result |
CN114064857A (en) * | 2021-11-24 | 2022-02-18 | 北京房江湖科技有限公司 | Method for analyzing search request |
CN114528463A (en) * | 2022-01-26 | 2022-05-24 | 北京三快在线科技有限公司 | Searching method and device, electronic equipment and readable storage medium |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107609098A (en) * | 2017-09-11 | 2018-01-19 | 北京金堤科技有限公司 | Searching method and device |
CN108108497A (en) * | 2018-01-29 | 2018-06-01 | 上海名轩软件科技有限公司 | Keyword recommendation method and equipment |
CN108170293A (en) * | 2017-12-29 | 2018-06-15 | 北京奇虎科技有限公司 | Input the personalized recommendation method and device of association |
CN109828981A (en) * | 2017-11-22 | 2019-05-31 | 阿里巴巴集团控股有限公司 | A kind of data processing method and calculate equipment |
CN110619076A (en) * | 2018-12-25 | 2019-12-27 | 北京时光荏苒科技有限公司 | Search term recommendation method and device, computer and storage medium |
WO2020062680A1 (en) * | 2018-09-30 | 2020-04-02 | 平安科技(深圳)有限公司 | Waveform splicing method and apparatus based on double syllable mixing, and device, and storage medium |
CN111324700A (en) * | 2020-02-21 | 2020-06-23 | 北京声智科技有限公司 | Resource recall method and device, electronic equipment and computer-readable storage medium |
CN111369996A (en) * | 2020-02-24 | 2020-07-03 | 网经科技(苏州)有限公司 | Method for correcting text error in speech recognition in specific field |
CN111428494A (en) * | 2020-03-11 | 2020-07-17 | 中国平安人寿保险股份有限公司 | Intelligent error correction method, device and equipment for proper nouns and storage medium |
CN111488426A (en) * | 2020-04-17 | 2020-08-04 | 支付宝(杭州)信息技术有限公司 | Query intention determining method and device and processing equipment |
CN111737977A (en) * | 2020-06-24 | 2020-10-02 | 平安科技(深圳)有限公司 | Data dictionary generation method, data query method, device, equipment and medium |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120297294A1 (en) * | 2011-05-17 | 2012-11-22 | Microsoft Corporation | Network search for writing assistance |
US9852188B2 (en) * | 2014-06-23 | 2017-12-26 | Google Llc | Contextual search on multimedia content |
CN107491518B (en) * | 2017-08-15 | 2020-08-04 | 北京百度网讯科技有限公司 | A search and recall method and device, server and storage medium |
-
2020
- 2020-10-30 CN CN202011191952.8A patent/CN112307183B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107609098A (en) * | 2017-09-11 | 2018-01-19 | 北京金堤科技有限公司 | Searching method and device |
CN109828981A (en) * | 2017-11-22 | 2019-05-31 | 阿里巴巴集团控股有限公司 | A kind of data processing method and calculate equipment |
CN108170293A (en) * | 2017-12-29 | 2018-06-15 | 北京奇虎科技有限公司 | Input the personalized recommendation method and device of association |
CN108108497A (en) * | 2018-01-29 | 2018-06-01 | 上海名轩软件科技有限公司 | Keyword recommendation method and equipment |
WO2020062680A1 (en) * | 2018-09-30 | 2020-04-02 | 平安科技(深圳)有限公司 | Waveform splicing method and apparatus based on double syllable mixing, and device, and storage medium |
CN110619076A (en) * | 2018-12-25 | 2019-12-27 | 北京时光荏苒科技有限公司 | Search term recommendation method and device, computer and storage medium |
CN111324700A (en) * | 2020-02-21 | 2020-06-23 | 北京声智科技有限公司 | Resource recall method and device, electronic equipment and computer-readable storage medium |
CN111369996A (en) * | 2020-02-24 | 2020-07-03 | 网经科技(苏州)有限公司 | Method for correcting text error in speech recognition in specific field |
CN111428494A (en) * | 2020-03-11 | 2020-07-17 | 中国平安人寿保险股份有限公司 | Intelligent error correction method, device and equipment for proper nouns and storage medium |
CN111488426A (en) * | 2020-04-17 | 2020-08-04 | 支付宝(杭州)信息技术有限公司 | Query intention determining method and device and processing equipment |
CN111737977A (en) * | 2020-06-24 | 2020-10-02 | 平安科技(深圳)有限公司 | Data dictionary generation method, data query method, device, equipment and medium |
Non-Patent Citations (2)
Title |
---|
基于自适应隐马尔可夫模型的石油领域文档分词;宫法明;朱朋海;;计算机科学;20180615(S1);110-113 * |
蒙古文原始语料统计建模研究;白双成;;中文信息学报;20170115(01);123-130 * |
Also Published As
Publication number | Publication date |
---|---|
CN112307183A (en) | 2021-02-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112307183B (en) | Search data identification method, apparatus, electronic device and computer storage medium | |
US10192545B2 (en) | Language modeling based on spoken and unspeakable corpuses | |
CN106897439B (en) | Text emotion recognition method, device, server and storage medium | |
CN109325201B (en) | Method, device, equipment and storage medium for generating entity relationship data | |
US20180373692A1 (en) | Method for parsing query based on artificial intelligence and computer device | |
CN111581976A (en) | Method and apparatus for standardizing medical terms, computer device and storage medium | |
CN112926297B (en) | Method, apparatus, device and storage medium for processing information | |
US20210342379A1 (en) | Method and device for processing sentence, and storage medium | |
CN107861948B (en) | Label extraction method, device, equipment and medium | |
CN113220999B (en) | User characteristic generation method and device, electronic equipment and storage medium | |
CN111930792A (en) | Data resource labeling method and device, storage medium and electronic equipment | |
CN108628911B (en) | Expression prediction for user input | |
CN111143556A (en) | Software function point automatic counting method, device, medium and electronic equipment | |
JP2018010514A (en) | Bilingual dictionary creation device, bilingual dictionary creation method, and bilingual dictionary creation program | |
CN112699237A (en) | Label determination method, device and storage medium | |
CN110705308B (en) | Voice information domain identification method and device, storage medium and electronic equipment | |
CN114528851A (en) | Reply statement determination method and device, electronic equipment and storage medium | |
US10049108B2 (en) | Identification and translation of idioms | |
CN111597800B (en) | Method, device, equipment and storage medium for obtaining synonyms | |
CN113590919A (en) | Search request processing method and device, electronic equipment and computer readable medium | |
CN116150497A (en) | Text information recommendation method, device, electronic device and storage medium | |
CN113449516A (en) | Disambiguation method, system, electronic device and storage medium for acronyms | |
US10354013B2 (en) | Dynamic translation of idioms | |
CN114742062B (en) | Text keyword extraction processing method and system | |
CN111930949B (en) | Search string processing method and device, computer readable medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |