HK1172124A - Selection of atoms for search engine retrieval - Google Patents
Selection of atoms for search engine retrieval Download PDFInfo
- Publication number
- HK1172124A HK1172124A HK12113013.2A HK12113013A HK1172124A HK 1172124 A HK1172124 A HK 1172124A HK 12113013 A HK12113013 A HK 12113013A HK 1172124 A HK1172124 A HK 1172124A
- Authority
- HK
- Hong Kong
- Prior art keywords
- document
- atom
- atoms
- search
- documents
- Prior art date
Links
Description
Background
The amount of information and content available on the internet continues to grow rapidly. In view of the vast amount of information, search engines have been developed to facilitate searching for electronic files. In particular, users may search for information and files by entering a search query containing one or more terms (term) that may be of interest to the user. After receiving a search query from a user, the search engine identifies relevant files and/or web pages based on the search query. Because of its utility, web searching (i.e., the process of finding relevant web pages and documents for a user-issued search query) has arguably become the most popular business on the internet today.
The search engine operates by crawling (crawl) files and indexing (index) information related to the files in a search index. When a search query is received, the search engine uses the search index to identify documents that are relevant to the search query. Using the search index in this manner allows information to be quickly retrieved for a query. Without a search index, a search engine would need to search a corpus of documents for relevant results, which would take an unacceptable amount of time.
As the internet continues to grow, the number of searchable files that can be crawled and indexed in search indexes has become very large. As a result, it is not feasible for a search engine to index information about all network files. For example, an excessively large amount of hardware memory would be required. In addition, the processing time required to retrieve results from a very large index would be unacceptable. Nonetheless, search engines strive to target as many files as possible to provide search results to any query, while being cost-effective and capable of providing relevant results within an amount of time acceptable to the end user.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
Embodiments of the present invention relate to populating (populating) one or more search indexes with atoms that have been identified from a plurality of files. Atoms can be unigrams (unigrams), n-grams (n-grams), or n-tuples (n-tuples). A list of atom/document pairs is generated such that atoms can be identified as coming from a particular document based on, for example, a document identification. For each atom/document pair, an information metric is computed that represents the proximity of how relevant the atom is to a particular document. Many factors are used to compute information metrics, such as how often an atom appears in a document, the proximity of terms that include the atom in the document, the degree to which the terms are related, whether the terms have been linked together by examining the query log, and so forth. In some instances, machine learning tools are used to compute information metrics. Atom/document pairs whose information metrics meet or exceed a particular threshold are indexed in the search index, while those that do not are discarded and therefore not indexed.
Brief Description of Drawings
The invention is described in detail below with reference to the attached drawing figures, wherein:
FIG. 1 is a block diagram of an exemplary computing environment suitable for use in the implementation of embodiments of the present invention;
FIG. 2 is a diagram illustrating a smart funnel (smart funnel) for reducing document candidates to derive a ranked set of documents, according to an embodiment of the present invention;
FIG. 3 is a block diagram of an exemplary system in which embodiments of the invention may be used;
FIG. 4 is a flow diagram illustrating a method for staged processing to return search results in response to a search query, in accordance with an embodiment of the present invention;
FIG. 5 is a flow diagram illustrating a method for generating a search index during a pre-calculation/indexing phase in accordance with an embodiment of the present invention;
FIG. 6 is a flow diagram illustrating a method for identifying an initial set of matching files during a matching phase in accordance with an embodiment of the present invention;
FIG. 7 is a flow diagram illustrating a method for pruning documents from an initial set of matching documents during a pruning (prune) phase in accordance with an embodiment of the present invention;
FIG. 8 illustrates an exemplary system in which embodiments of the invention may be used;
9A, 9B, and 9C illustrate examples of entries in a unigram search index, an n-metagram search index, and an n-tuple search index, respectively, according to embodiments of the invention;
FIG. 10 is a flow diagram illustrating a method for populating one or more search indexes with atoms identified in a plurality of documents, in accordance with an embodiment of the present invention;
FIG. 11 is a flow diagram illustrating a method for populating one or more search indexes with atoms identified in a plurality of documents, in accordance with an embodiment of the present invention; and
FIG. 12 is a flow diagram illustrating a method for populating one or more search indexes with atoms identified in a plurality of documents, in accordance with an embodiment of the present invention.
Detailed Description
The subject matter of the present invention is described with specificity herein to meet statutory requirements. However, the description itself is not intended to limit the scope of this patent. Rather, the inventors have contemplated that the claimed subject matter might also be embodied in other ways, to include different steps or combinations of steps similar to the ones described in this document, in conjunction with other present or future technologies. Moreover, although the terms "step" and/or "block" may be used herein to connote different elements of methods employed, the terms should not be interpreted as implying any particular order among or between various steps herein disclosed unless and except when the order of individual steps is explicitly described.
Embodiments of the present invention provide an indexing and search process that allows large numbers of files to be indexed and retrieved in a cost-effective manner and that meets stringent latency constraints. According to an embodiment of the present invention, a process of estimating and pruning document candidates in multiple stages is utilized. Conceptually, this process looks like a funnel (funnel), and document candidates are estimated and pruned as the analysis becomes more complex over the entire stage. As this process continues throughout the phase, more expensive calculations are applied and the number of candidate files can be reduced by orders of magnitude. Different policies are applied at each stage to allow search results to be returned from a large number of files in a fast and efficient manner. In addition, the policies used at each stage may be designed to complement the policies used at other stages, thereby making the process more efficient.
The search index used by embodiments of the present invention indexes higher-order primitives (primative) or "atoms" from a file, as opposed to simply indexing a single word. As used herein, "atom" may refer to a variety of units of a query or document. These units may include, for example, words, n-grams, n-tuples, k-neighborhood n-tuples, and the like. The words are mapped down to a single symbol or word, as defined by the particular tokenizer (tokenizer) technique used. In one embodiment, a word is a single character. In another embodiment, a word is a single word or a combination of words. An n-gram is a sequence of "n" consecutive or nearly consecutive words that can be extracted from a document. An n-gram is said to be "tight" if it corresponds to a string of consecutive words, and "loose" if it contains words in the order they appear in the document, but the words are not necessarily consecutive. A loose n-gram is typically used to represent a class of equivalent phrases that differ in the absence of a significant word (e.g., "i will be wet if it rains" and "i will be wet if it rains"). An n-tuple, as used herein, is a set of "n" words that co-occur (order independent) in a document. Further, as used herein, a k-adjacent n-tuple refers to a set of "n" words that co-occur in a window of "k" words in the document. Thus, an atom is generally defined as a generalization of all of the above. Implementations of embodiments of the present invention may use different kinds of atoms, but as used herein, atoms generally describe each of the above-described kinds.
When building the search index, each document is analyzed to identify atoms in the document and to generate a pre-computed score or rank for each atom that represents the importance of the atom or its relevance to the context of the document. The search index stores information about pre-computed scores generated for document/atom pairs, which are used during the funnel process.
FIG. 2 illustrates stages of funnel processing according to one embodiment of the present invention. The stages of processing shown in FIG. 2 are performed after receiving a search query and include: the L0 matching stage 202, the L1 temporary ranking stage 204, and the L2 final ranking stage 206. As shown in fig. 2, as this process proceeds, the number of candidate files is reduced.
When a search query is received, the search query is analyzed to identify atoms. This atom is used during the L0 matching phase 202 to query the search index and identify an initial set of matching documents that contain the atom from the search query. As shown in FIG. 2, this may reduce the number of candidate documents from all documents indexed in the search index to those documents that match the atoms from the search query.
At the L1 temporary ranking stage 204, a preliminary score is calculated for the candidate documents retained from the L0 matching stage 202 using a simplified scoring function. The simplified scoring function operates on, among other things, pre-computed scores stored in the search index for document/atom pairs. In some embodiments, the simplified scoring function may be used as an approximation of a final ranking algorithm that will ultimately be used to rank the document. However, simplifying the scoring function provides a cheaper operation than the final ranking algorithm, which allows a large number of candidate documents to be processed quickly. Based on the preliminary score, the candidate documents are pruned. For example, only the top N files with the highest preliminary scores may be retained.
At the L2 final ranking stage 206, candidate files retained from the L1 temporary ranking stage 204 are evaluated using a final ranking algorithm. The final ranking algorithm is a more expensive operation with a large number of ranking features compared to the simplified scoring function used during the L1 provisional ranking stage 204. However, the final ranking algorithm is applied to a much smaller number of candidate files. The final ranking algorithm provides a ranked set of documents, and in response to the original search query, search results are provided based on the ranked set of documents.
Thus, in one aspect, embodiments of the present invention are directed to one or more computer storage media storing computer-useable instructions that, when used by a computing device, cause the computing device to perform a method. The method includes receiving a search query and rewriting (reformulating) the search query to identify one or more atoms. The method also includes identifying an initial set of documents from the search index based on the one or more atoms. The method further includes calculating a preliminary score for each document in the initial set of documents using a simplified scoring function and the pre-calculated scores stored in the search index for the document/atom pairs for the one or more atoms and the initial set of documents. The method also includes selecting a pruned set of documents from the initial set of documents based on the preliminary scores. The method further includes calculating a ranking score for each document in the pruned set of documents using a full ranking algorithm to provide a ranked set of documents. The method still further includes providing search results based on the ranked set of files for presentation to an end user.
In another embodiment of the present invention, an aspect is directed to a computerized system comprising at least one processor and one or more computer storage media. The system includes a query rewrite component that analyzes a received search query to identify one or more atoms based on terms contained in the received search query and generates a rewritten query. The system also includes a document matching component that queries the search index using the rewritten query to identify an initial set of matching documents. The system also includes a document pruning component that computes a preliminary score for each document of the initial set of matching documents using a simplified scoring function and identifies a pruned set of documents based on the preliminary score. The system still further includes a final document ranking component that calculates a ranking score for each document in the pruned set of documents using a full ranking algorithm.
A further embodiment of the present invention is directed to a method of providing search results responsive to a search query using a hierarchical process. The method includes receiving a search query and identifying one or more atoms from the search query. The method also includes identifying an initial set of documents that includes one or more atoms, calculating a preliminary score for each document in the initial set of documents using a simplified scoring function, and selecting a subset of documents for further processing based on the preliminary scores. The method further includes calculating a ranking score for each document in the subset of documents using a final ranking algorithm. The method still further includes providing a set of search results based on the ranking score.
In addition to the embodiments described above, methods for identifying relevant atoms from documents and indexing atom/document pairs are also described herein. For example, atoms (which may be classified as unigrams, n-grams, or n-tuples) are identified or extracted from the document. An information metric is calculated for each atom/document pair. The computation of the information metric can be based on many factors and can even be done by a machine learning tool that can learn how to compute the information metric. A threshold is used to discard those atom/document pairs that are deemed not to be as relevant or useful in parsing the query as other atom/document pairs based on the information metric. Those that are deemed most relevant are indexed in the search index for use when a search query is received in the future.
According to a first aspect of the present invention, a method is provided for populating one or more search indexes with atoms identified in a plurality of documents. The method includes identifying a set of documents to be indexed in a search index, and for each document in the set of documents, identifying a plurality of atoms, the plurality of atoms including one or more unigrams, one or more n-grams, and one or more n-tuples. Additionally, the method includes generating a list of atom/document pairs based on the identified collection of documents and the plurality of atoms, and calculating an information metric for each atom/document pair, wherein the information metric represents a ranking of atoms associated with a particular document. Additionally, the method includes selecting a subset of atom/document pairs that are most relevant to the particular document for which the atom is identified based on the information metric for each atom/document pair. The method further includes populating a search index with a subset of the atom/document pairs for the particular document.
According to a second aspect of the present invention, one or more computer storage media storing computer-useable instructions are provided that, when used by a computing device, cause the computing device to perform a method of populating one or more search indexes with atoms identified in a plurality of files. The method includes identifying a plurality of atoms from a first document of the plurality of documents to be indexed, classifying each of the plurality of atoms according to one or more of a unigram, an n-gram, or an n-tuple, and calculating an information metric for each of the plurality of atoms associated with the first document. Further, the method includes determining whether the information metric of each of the plurality of atoms meets a predetermined threshold. The atoms meeting the predetermined threshold are those most relevant to the first document. The method also includes discarding atoms that do not meet a predetermined threshold and incorporating atoms associated with the first document that meet the predetermined threshold into one or more search indexes.
According to a third aspect of the present invention, one or more computer storage media storing computer-useable instructions are provided that, when used by a computing device, cause the computing device to perform a method of populating one or more search indexes with atoms identified in a plurality of documents. The method includes extracting a plurality of atoms from the document, the plurality of atoms including one or more unigrams, one or more n-grams, and one or more n-tuples, and for each of the plurality of atoms, computing an information metric representing a ranking of a particular atom related to the document. The information metric is calculated based on one or more of: the frequency of the atom in the document, the proximity of two or more terms of the atom in the document, the relevance of two or more terms of the atom, or whether two or more terms of the atom have been previously linked together as evidenced by an examination of the query log. The method further includes determining an information metric threshold. Atom/document pairs whose information metric meets or exceeds the information metric threshold are indexed. Additionally, the method includes discarding a portion of the atom/document pairs based on the information metric. The information metric corresponding to the discarded atom/document pair is below the information metric threshold. One or more search indexes are populated by indexing atom/document pairs whose information metrics meet or exceed an information metric threshold, wherein unigrams, n-grams, and n-tuples are individually indexed. One or more search indexes are accessed to identify relevant documents for atoms in the query.
Having described an overview of embodiments of the present invention, an exemplary operating environment in which embodiments of the present invention may be implemented is described below in order to provide a general context for various aspects of the present invention. Referring initially to FIG. 1 in particular, an exemplary operating environment for implementing embodiments of the present invention is shown and designated generally as computing device 100. Computing device 100 is but one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should the computing device 100 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated.
The invention may be described in the general context of computer code or machine-useable instructions, including computer-executable instructions such as program modules, being executed by a computer or other machine, such as a personal digital assistant or other handheld device. Generally, program modules including routines, programs, objects, components, data structures, etc., refer to code that perform particular tasks or implement particular abstract data types. The invention may be implemented in a variety of system configurations, including hand-held devices, consumer electronics devices, general-purpose computers, more specialty computing devices, and the like. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
Referring to FIG. 1, computing device 100 includes a bus 110 that directly or indirectly couples the following devices: memory 112, one or more processors 114, one or more presentation components 116, input/output (I/O) ports 118, input/output components 120, and an illustrative power supply 122. Bus 110 represents what may be one or more busses (such as an address bus, data bus, or combination thereof). Although the various blocks of FIG. 1 are shown with lines for the sake of clarity, in reality, delineating various components is not so clear, and metaphorically, the lines would more accurately be dark and fuzzy. For example, a presentation component such as a display device may be considered an I/O component. In addition, the processor has a memory. The inventors recognize that such is the nature of the art, and reiterate that the diagram of FIG. 1 is merely illustrative of an exemplary computing device that can be used in connection with one or more embodiments of the present invention. There is no distinction between categories such as "workstation," server, "" laptop, "" handheld device, "etc., as all are contemplated to be within the scope of fig. 1 and are referred to as" computing devices.
Computing device 100 typically includes a variety of computer-readable media. Computer readable media can be any available media that can be accessed by computing device 100 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 100. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as ultrasonic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The memory 112 includes computer storage media in the form of volatile and/or nonvolatile memory. The memory may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid-state memory, hard disk drives, optical disk drives, and the like. Computing device 100 includes one or more processors that read data from various entities such as memory 112 or I/O components 120. The presentation component(s) 116 present data indications to a user or other device. Exemplary presentation components include a display device, speaker, printing component, vibrating component, and the like.
I/O ports 118 allow computing device 100 to be logically coupled to other devices including I/O components 120, some of which may be built-in. The illustrated components include a microphone, joystick, game pad, satellite dish, scanner, printer, wireless device, and the like.
Referring now to FIG. 3, a block diagram is provided illustrating an exemplary system 300 in which embodiments of the present invention may be used. It should be understood that this and other arrangements described herein are presented by way of example only. Other arrangements and elements (e.g., machines, interfaces, functions, orders, and combinations of functions, etc.) can be used in addition to or in place of those shown, and some elements may be omitted altogether. Further, many of the elements described herein are functional entities that may be implemented as discrete or distributed components or in conjunction with other components, and in any suitable combination and location. Various functions described herein as being performed by one or more entities may be carried out by hardware, firmware, and/or software. For example, various functions may be carried out by a processor executing instructions stored in a memory.
Among other components not shown, the system 300 may include a user device 302, a content server 304, and a search engine server 306. Each of the components shown in FIG. 3 may be any type of computing device, such as computing device 100 described with reference to FIG. 1, for example. The components may communicate with each other via a network 308, which may include, but is not limited to, one or more Local Area Networks (LANs) and/or Wide Area Networks (WANs). Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet. It should be understood that any number of user devices, content servers, and search engine servers may be used in the system 300 within the scope of the present invention. Each may comprise a single device or multiple devices cooperating in a distributed environment. For example, the search engine server 306 may include multiple devices arranged in a distributed environment that collectively provide the functionality of the search engine server 306 described herein. Additionally, other components not shown may also be included in the system 300.
The search engine server 306 generally operates to receive a search query from a user device, such as the user device 302, to provide search results responsive to the search query. The search engine server 306 includes, among other components, an indexing component 310, a user interface component 312, a query rewrite component 314, a document matching component 316, a document pruning component 318, and a final document ranking component 320.
The indexing component 310 operates to index data about files maintained by a content server, such as content server 304. For example, a crawling component (not shown) may be used to crawl content servers and access information related to files maintained by the content servers. Indexing component 310 thus indexes data related to crawled files in search index 322. In an embodiment, the indexing component 310 indexes atoms found in documents and scoring information for documents in which each atom is found, indicating the importance of the atoms in the context of the documents. Any number of algorithms may be used to calculate the scores of atoms found in a document. By way of example only, the score may be based on a word frequency inverse document frequency (TF/IDF) function as known in the art. For example, a BM25F ranking (ranking) function may be used. The scores generated for the document/atom pairs are stored as pre-computed scores in search index 322.
In an embodiment, the indexing component 310 analyzes each document to identify terms, n-grams, n-tuples and determine which of these atoms should be indexed for the document. During processing of the document to be indexed, statistics on query distribution, term distribution, and/or simplified scoring functions to be used during funnel processing may be used to statistically select the best set of atoms to represent the document. These selected atoms are indexed in the search index 322 with pre-computed scores, which allows for efficient pruning of documents early in the funnel process.
Although not required, in some embodiments of the invention, search index 322 may include a reverse index (in atomic order) and a forward index (in file order). The inverted index may include a plurality of markup lists (posing lists), each markup list pointing to an atom and listing documents that include the atom, with a pre-computed score for each document/atom pair. As will be described in more detail below, the reverse index and forward index may be used at different stages of the funnel process.
The user interface component 312 provides an interface to user devices, such as the user device 302, that allows a user to submit search queries to the search engine server 306 and receive search results from the search engine server 306. The user device 302 may be any type of computing device used by a user to submit search queries and receive search results. By way of example only and not limitation, the user device 302 may be a desktop computer, laptop computer, tablet computer, mobile device, or other type of computing device. The user device 302 may include an application that allows a user to enter a search query and submit the search query to the search engine server 306 to retrieve search results. For example, the user device 302 may include a web browser that includes a search input box or allows a user to access a search page to submit a search query. Other mechanisms for submitting a search query to a search engine are also contemplated within the scope of embodiments of the present invention.
When a search query is received via the user interface component 312, the query rewrite component 314 operates to rewrite the query. The query is adapted from its free-text form into a format that facilitates querying the search index 322 based on how the data is indexed in the search index 322. In an embodiment, the terms of the search query are analyzed to identify atoms that may be used to query the search index 322. The atoms may be identified using techniques similar to those used to identify atoms in documents when indexing documents in search index 322. For example, atoms may be identified based on statistics of terms and query distribution information. Query rewrite component 314 may provide a set of atom conjunctions (connections) and cascading variables (cascading variables) for these atoms.
The document matching component 316 uses the rewritten query to query the search index 322 and identify a set of matching documents. For example, the rewritten query may include two or more atoms and the document matching component 316 may take the intersection of the tagged lists of these atoms to provide an initial set of matching documents.
The document pruning component 318 operates by pruning documents from the initial set of matching documents. This may include calculating a preliminary score for each document from the initial set of matching documents using the pre-calculated scores for document/atom pairs stored in the search index 322. The preliminary score may be based on a simplified scoring function that is tuned for performance and retrieval (recall). In some embodiments, the simplified scoring function used to generate the preliminary scores is established based on a full ranking algorithm that is then used to provide the final set of ranked documents. In this way, the simplified scoring function serves as an approximation of the final ranking algorithm. FOR example, a method such as that described in US patent application No. (not yet assigned) (attorney docket number mfcp.157122), entitled "simplified RANKING FOR efficiency scoring precomputation" may be used to establish the simplified scoring function. In some embodiments, the simplified scoring function includes a subset of the ranking features from the final ranking algorithm.
A number of different approaches may be used by the document pruning component 318 to prune the initial set of documents. In some implementations, the document pruning component 318 can retain a predetermined number of matches in the initial set of documents, while removing other documents from consideration (i.e., the top N matches). For example, the document pruning component 318 may retain the one thousand documents with the highest preliminary scores. The number of matches retained by the document pruning component 318 may be based on the fidelity confidence of the simplified scoring function used to generate the preliminary score. The fidelity confidence represents the ability of the simplified scoring function to provide a set of documents that match the set of documents that would be provided by the full ranking algorithm. For example, an average of 1200 documents may be taken from the simplified scoring function to get the top 1000 documents to be provided by the final ranking algorithm. In other implementations, instead of retaining a predetermined number of documents, the document pruning component 318 may retain documents having preliminary scores above a particular threshold.
In some implementations, the document matching component 316 and the document pruning component 318 can be tightly coupled such that document matching and pruning are combined in a single process for multiple iterations. For example, a preliminary score may be calculated because matching documents are identified and used to remove documents that would likely be discarded by the full ranking algorithm.
In some embodiments, a search INDEX using a hierarchical tag list (such as that described in U.S. patent application No. (not yet assigned) (attorney docket No. mfcp.157121) entitled "matching OF posing LISTS IN SEARCH ENGINE INDEX") may be used to facilitate this matching/pruning process. Each tag list will be associated with a given atom and will include layers ordered based on a pre-computed score assigned to the document, representing the relevance of the given atom to the context of each document. In each layer, the tags (posing) may be ordered internally by file. Using such a search index, document matching component 314 would take the initial set of documents using the first tier (with the highest pre-computed scores) and prune the initial set of documents using a simplified scoring function. If a sufficient number of files are provided, the matching/pruning process may end. Alternatively, if a sufficient number of files are not provided, matching and pruning may be repeatedly performed at a lower level layer until a sufficient number of files are retained.
The set of documents retained by the matching and pruning processes provided by the document matching component 316 and the document pruning component 318 are evaluated by a final document ranking component 320 to provide a final set of ranked documents. The final document ranking component 320 uses a full ranking algorithm that can operate on the original search query and the document collection retained by the matching and pruning process. The full ranking algorithm uses more ranking features and more data from the document than are used by the simplified scoring function used during the pruning process. Thus, the full ranking algorithm is a more expensive operation that requires more processing and takes longer to compute. However, because the set of candidate documents has been pruned, the full ranking algorithm is performed on a smaller set of documents.
The final file ranking component 320 provides a final set of ranked files that are indicated to the user interface component 312. The user interface component 312 then communicates the search results, including at least a portion of the final set of ranked files, to the user device 302. For example, the user interface component 312 may generate or otherwise provide a Search Engine Results Page (SERP) listing the search results based on the final set of ranked files.
Turning next to FIG. 4, a flow diagram is provided that illustrates an overall method 400 for using staged processing to return search results for a search query in accordance with an embodiment of the present invention. The staged processing begins with a pre-calculation/indexing stage, as shown at block 402. This phase is an offline phase, i.e., it is performed separately from any received search query. In the pre-computation/indexing stage 402, a file is crawled and data about the file is indexed in a search index. The process of indexing file data during the pre-calculation/indexing phase 402 according to one embodiment is discussed in further detail below with reference to FIG. 5.
After the pre-computation/indexing stage 402, the stages shown in FIG. 4 include an online stage in which a search query is received and search results are returned in response. The first phase of the online phase is the matching phase, as shown at block 404. During the matching phase 404, a search query is received and rewritten, which is used to identify matching files from the search index. The process for identifying matching files during the matching phase 404 according to one embodiment is discussed in further detail below with reference to FIG. 6.
The next stage after matching is the pruning stage, as shown at block 406. The pruning stage 406 takes the initial set of documents from the matching stage 404 and determines a preliminary score for each document using a simplified scoring function. Based on the preliminary score, documents are pruned from the initial set of documents. The process of pruning documents from the initial set of matching documents according to one embodiment is discussed in further detail below with reference to FIG. 7.
In some embodiments, the matching stage 404 and the pruning stage 406 may alternate. In particular, pruning may be performed when matching documents are identified to discard candidates earlier for further consideration where the preliminary score indicates that the document will likely be discarded by the final ranking algorithm.
The set of candidate documents retained after the matching stage 404 and the pruning stage 406 are further evaluated during the final ranking stage, as shown at block 408. During the final ranking stage 408, a full ranking algorithm is used to determine the final score of the retained documents. In some embodiments, a full ranking algorithm may be performed on the original search query and the retained data for each file. The full rating algorithm may use a number of different rating features to determine the final set of rated files. In response to the search query, search results are provided based on the final set of ranked files, as shown at block 410.
Turning now to FIG. 5, a flow diagram is provided that illustrates a method 500 for pre-computing scores for document/atom pairs and indexing data, in accordance with an embodiment of the present invention. Initially, as indicated at block 502, a file is accessed. For example, a crawler may be used to crawl documents and retrieve document data. At block 504, the file is processed. The document is processed to identify atoms contained in the document. As mentioned above, this processing includes analyzing the text of the document to identify words, n-grams, and n-tuples, and determining which of these atoms should be indexed for the document. Statistics on query distribution, term distribution, and/or a simplified scoring function to be used during funnel processing may be used to statistically select the best set of atoms to represent the document.
As shown at block 506, a score is generated for each atom identified in the document. The score represents the importance of the atom in the context of the document. Any number of algorithms may be used to calculate the scores of atoms found in a document. By way of example only, the score may be based on a word frequency inverse file frequency (TF/IDF) function as known in the art. For example, a BM25F ranking function may be used.
As shown at block 508, the data is indexed in the search index. This may include storing information about atoms found in the document and a score for each document/atom pair. These scores include pre-computed scores, which may be used during funnel processing. In some embodiments, a tag list is created for each atom. Each token list may include a list of documents that contain the atom and an indication of the pre-computed score for each document/atom pair.
Referring next to FIG. 6, a flow diagram is provided that illustrates a method 600 for retrieving an initial set of matching files during a matching phase, in accordance with an embodiment of the present invention. As shown at block 602, a search query is initially received. The search query may contain one or more search terms entered by a user using the user device.
As shown at block 604, the received search query is rewritten. In particular, the terms of the search query are analyzed to identify one or more atoms that may be used to query the search index. The analysis may be similar to the analysis used to identify atoms in documents when indexing document data. For example, statistics of terms and search queries may be used to identify atoms in the search queries. The rewritten query may include a set of conjunctions of atoms and their cascading variables (cascading variants).
As shown at block 606, the rewritten query is used to identify a set of matching files from the search index. In particular, the atoms identified from the original query are used to query the search index and identify matching documents. As indicated above, the search index may include a tagged list of the various atoms identified in the indexed document. A list of tokens corresponding to the atoms identified by the rewritten query may be identified and used to identify matching documents. For example, an initial set of matching documents may be provided based on the intersection of the token lists of the atoms of the rewritten query.
Turning to FIG. 7, a flow diagram is provided that illustrates a method 700 for pruning a document from an initial set of matching documents during a pruning stage in accordance with an embodiment of the present invention. As shown at block 702, a preliminary score is calculated for each document using the pre-calculated scores stored in the search index. This may include taking a pre-computed score for each atom of the document and using the pre-computed score in a simplified scoring function to generate a preliminary score for the document. The simplified scoring function may be established in such a way that: which provides an estimate of the final score provided by the full ranking algorithm. For example, the simplified scoring function may include a subset of the features used by the full ranking algorithm. In some embodiments, the simplified scoring function is defined using a process such as that described in U.S. patent application No. (not yet assigned) (attorney docket No. mfcp.157122), entitled "simple scoring FOR efficiency scoring precomputation".
As shown at block 704, documents are pruned from the initial set of matching documents based on the preliminary scores. In some embodiments, the top N documents are retained, i.e., the N documents with the highest preliminary scores are retained for further processing. The number of documents retained may be based on the fidelity of the simplified scoring function used to calculate the preliminary score. The fidelity of the simplified scoring function represents the ability of the simplified scoring function to provide a set of ranked documents similar to those provided by the final ranking algorithm. This knowledge can be used to determine the number of documents retained from the pruning stage if the correlation between the final ranking algorithm including the error in the simplified scoring function and the simplified scoring function is known. For example, if it is desired to provide 1000 search results and it is known that the top 1200 documents from the simplified scoring function, on average over all queries, will include the top 1000 documents from the final ranking algorithm, then the top 1200 documents will be retained from the pruning stage.
In some embodiments of the present invention, the funnel process may use a search index that includes a reverse index and a forward index. The inverted index is sorted by atom. This would facilitate fast retrieval of data during the matching and pruning stages of the funnel process. In particular, when a search query is received and an atom is identified from the search query, a list of tokens in an inverted index corresponding to the atom identified from the search query may be quickly accessed and used to identify matching documents, as well as to retrieve a pre-computed score used by the simplified scoring function. The forward index is sorted by file. This will facilitate the final classification stage of the funnel process. In particular, a pruned set of documents will be provided as a result of the matching and pruning stages. The pruned set of files will be relatively small. In this way, the forward index stores document data that is retrieved for documents in the pruned set of documents and used by the final ranking algorithm to provide a final set of ranked documents. In some embodiments, the forward index may be constructed as described in U.S. patent application No. (not yet assigned) (attorney docket No. mfcp.157165), entitled "EFFICIENT FORWARD RANKING IN A SEARCH ENGINE". Additionally, in some embodiments, a HYBRID DISTRIBUTION MODEL may be used FOR reverse and forward indexing, such as described in U.S. patent application No. (not yet assigned) (attorney docket No. mfcp.157166), entitled "HYBRID DISTRIBUTION MODEL FOR SEARCH ENGINE INDEXES" (the entire contents of which are incorporated herein by reference).
Turning now to FIG. 8, an exemplary system is illustrated in which embodiments of the present invention may be used. While some embodiments of the invention (as discussed herein) are directed to a funnel process that estimates and prunes document candidates in multiple stages, other embodiments are directed to identifying the most useful and relevant atoms in a document and indexing those atoms in a search index that is relevant to a particular document. Atoms can take a variety of forms, including words or unigrams, n-grams, or n-tuples. Although only a single word is commonly referenced herein, as will be discussed below, some types of atoms have multiple words, and thus, combinations of words may be commonly referenced. As used herein, a unigram maps to a single symbol or word (word) as defined by the employed tokenizer technique. Thus, a unigram may be a single word found in a document. An n-gram model is a sequence of "n" consecutive or nearly consecutive words extracted from a file. The n-gram may be tight or loose. An n-gram is said to be compact if it corresponds to a series of consecutive words. The loose n-gram contains words in the order they appear in the document, but the words need not be contiguous. A loose n-gram model is typically used to represent a class of equivalent phrases that are distinguished by a non-heavy word (word) (e.g., "i will be wet if it rains" versus "i will be wet if it rains"). For example, a bigram is two words with "n" equal to 2. Similarly, a trigram is three words with "n" equal to 3. An n-tuple, as used herein, is a set of "n" words that co-occur in a document, the order of which is independent. The atoms identified in the document are indexed into one or more search indexes. In one embodiment, there are respective indices for the unigram model, the n-gram model, and the n-tuple.
Returning to FIG. 8, it should be understood that this and other arrangements described herein are presented by way of example only. Other arrangements and elements (e.g., machines, interfaces, functions, orders, and combinations of functions, etc.) can be used in addition to or instead of those shown, and some elements may be omitted altogether. Further, many of the elements described herein are functional entities that may be implemented as discrete or distributed components or in conjunction with other components, and in any suitable combination and location. Various functions described herein as being performed by one or more entities may be carried out by hardware, firmware, and/or software. For example, various functions may be performed by a processor executing instructions stored in memory.
Among other components not shown, system 800 may include user device 802, index server 804, search index generator 808, and search index 818. Each of the components shown in FIG. 8 may be any type of computing device, such as computing device 100 described with reference to FIG. 1, for example. The components may communicate with each other via a network 806, which network 806 may include, but is not limited to, one or more Local Area Networks (LANs) and/or Wide Area Networks (WANs). Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet. It should be understood that any number of user devices, index servers, search index generators, and search indexes may be used within system 800 within the scope of the present invention. Each may comprise a single device or multiple devices cooperating in a distributed environment. For example, the index server 804 may include multiple devices arranged in a distributed environment that collectively provide the functionality of the index server 804 described herein. Likewise, there may be multiple search indexes, as described herein. These may be stored in the search index 818 or may be stored in a separate location. Additionally, other components not shown may also be included in the system 800.
Index server 804 generally operates to receive search queries from user devices, such as user device 802, and to provide search results responsive to the search queries by searching one or more search indexes. Search index generator 808 includes, among other things, an atom identification component 810, an information metric computation component 812, an atom pruning component 814, and a search index component 816. Generally, search index generator 808 is responsible for generating or populating existing search indexes with atom/document pairs that are determined to be most useful or relevant for future queries. The atom identification component 810 is generally responsible for examining documents and extracting individual terms from the documents. Additionally, the atom identification component 810 identifies those atoms that are n-gram models and n-grams. For example, the atom identification component 810 can potentially identify an n-gram by determining the location of various terms relative to one another. As mentioned, the words comprising the n-tuple are location independent and therefore can be located anywhere in the file. The description of FIGS. 9A, 9B, and 9C further explains the n-tuple model and n-tuple below.
The information metric computation component 812 computes an information metric. The atoms identified in the document may be selected as most relevant or useful to the particular document based on the information metrics. Typically, an information metric is a ranking of atoms relative to a particular document from which the atoms were identified or parsed. Information metrics estimate the usefulness of an atom in parsing a general query. In one embodiment, the information metric computation component 812 utilizes an algorithm to compute the information metric for each atom/document pair. A variety of factors may be used in conjunction with the algorithm to calculate the information metric. For purposes of example only, these factors may include the information score, the frequency of atoms in the document, the separation of words if the atom is an n-gram or n-tuple, the number of times the terms in the atom appear individually and together, and whether the atom or the term including the atom appears in the query log. This last factor proves that the atomic terms are somehow related and that the terms have been previously searched. With respect to the number of times a word appears in a document, if each word in an atom appears multiple times, but is not close in distance to each other, this may mean that the words simply happen to be in the same document without deeper meaning. It becomes more meaningful if the words appear closer to each other than would be expected by chance.
The atom pruning component 814 is responsible for pruning the number of atom/document pairs for each document such that those atoms that are unlikely to be relevant or important to a particular document are not indexed and thus do not take up too much storage space. For a file with 400 different words and therefore 400 entries in the search index, there would be 80000 pairs of words for this single file if a bigram was also identified in this file. This number grows even larger if the trigram and n-tuple are also indexed. Not only is the number of atom/document pairs increasingly large, but the location of each term can also be stored in the search index, which takes up storage space as does the atom/document pairs themselves. As mentioned, based on a number of factors, some of which are listed above, the algorithm computes an information metric that is later used to determine whether a particular atom/document pair is to be indexed. This determination is based on a threshold. The threshold is set in one embodiment by checking previous runs, such as the previous day, and also based on initial testing. There are many ways to calculate the threshold, and the above is provided for exemplary purposes only. The threshold value is therefore typically a predetermined value. Thus, based on the threshold, the atom pruning component 814 examines the information metrics of each atom/document pair and makes a decision as to whether each pair should be indexed or discarded.
Once the atom pruning component 814 has pruned the number of atom/document pairs, the search indexing component 816 can generate a search index or add entries to an existing search index, as described above. In one example, the search index may be generated during the process described above, and those entries in the search index may be incorporated into an existing search index, such as the primary index. Search index 818 stores a plurality of search indexes in one embodiment. Likewise, as previously mentioned, there may be separate search indices for various types of atoms, including a unigram index, an n-gram index, and an n-tuple index. A unigram index is a mapping from a given term to a list of file identification/rating records. In one embodiment, the pruning process is not applied to the meta-models, and may not need to be pruned, or at least need not be pruned as much as the n-grams and n-tuples, since the number of meta-models is typically manageable. The n-gram index includes the n-grams identified in the document for a given "n" by a sliding window algorithm. For example, for the word stream t1t2t3t4t5, where n =2, then the n-gram atoms include (t 1t 2), (t 2t 3), (t 3t 4), and (t 4t 5). Thus, from a string of five words of n =2, there are four atoms generated. These atoms are indexed and stored in a (docID, hierarchical) record, where the "hierarchy" is an approximation of the hierarchy of two consecutive words in the document.
In some implementations, the ranking or information metrics are not stored in the index, but instead are used only to determine which atoms are indexed and which are discarded. The n-tuple index is similar to the n-tuple model index described herein, except that there are an exponential number of n-tuples identified from the file, since the locations of the words of the n-tuples can be considered irrelevant. Thus, n-tuples are typically pruned more than n-grams and unigrams. Further, in some cases, the n-gram and n-tuple can be replicated, so the replica is discarded during the pruning process. Once identified and indexed, in one embodiment, the atoms (unigrams, n-grams, n-tuples) are stored in a dictionary using PRIORITY HASH indices, such as described in U.S. patent application No. 12/980582 (attorney docket No. mfcp.157119) entitled "PRIORITY HASH INDEX" (the entire contents of which are incorporated herein by reference).
FIGS. 9A, 9B, and 9C illustrate examples of entries in a unigram search index, an n-metagram search index, and an n-tuple search index, respectively, according to embodiments of the invention. The embodiments of FIGS. 9A, 9B and 9C use a sampling word string of "Holistic Approach in Southern Sweden". FIG. 9A illustrates a unigram model 900 identified from the word string. As shown, there are 5 identified unigrams, each consisting of a single word. FIG. 9B illustrates an n-gram model 910 identified from the sample string. Since the n-grams are close or adjacent to each other, 7 n-grams are identified, which is a higher number than the identified meta-models. FIG. 9C illustrates an n-tuple 920 identified from the sample string of words. As shown, a very large number of n-tuples are identified as compared to the unigram or n-gram model, since the n-tuples may be paired or otherwise matched together, even if they are not adjacent or near each other in the document. 13 n-tuples are identified from a sample word string consisting of 5 words. 9A, 9B, and 9C are shown to illustrate how the number of n-tuples is typically much larger than the number of unigrams or n-grams.
Referring to FIG. 10, a flow diagram of a method 1000 according to an embodiment of the present invention is shown, the method 1000 for populating one or more search indexes with atoms identified in a plurality of documents. Initially, a set of files to be indexed in a search index is identified at step 1010. The files are generally indexed so that when a search query is received, the most relevant files can be easily found for the user by accessing the search index. At step 1012, atoms are identified in each document. As mentioned, an atom may be one or more of a unigram, an n-gram, or an n-tuple. A meta-model is typically a single symbol or word, whereas when "n" is greater than one, an n-meta-model is a plurality of words or symbols that are disposed adjacent or near each other in a document. For example, an n-gram may be a sequence of consecutive or nearly consecutive words extracted from a particular document, where "n" is the number of consecutive or nearly consecutive words. An N-tuple is a plurality of words or symbols co-occurring in the same file, but not necessarily located adjacent or near each other in the file. In one example, the words that contain the n-tuple may not be close to each other at all, such as in different parts of the file. Furthermore, the n-tuples are order independent.
At step 1014, a list of atom/document pairs is generated. An atom/document pair is an atom identified in a document and a document identification corresponding to the document from which the atom is identified. An information metric is computed for each atom/document pair at step 1016. The information metrics represent a ranking of atoms that are related to a particular document, such as how related the atoms are with respect to the document in parsing the search query. In one embodiment, a machine learning tool is used to calculate an information metric for each atom, in addition to selecting the most relevant atom/document pair that is determined to be most relevant to the document from which the atom is identified based on the information metric and other factors. The calculation of the information metric may use an algorithm that utilizes a variety of factors. For purposes of example only, these factors may include the information scores of words in any corpus, the frequency of one or more words in a document that contain an atom, the spacing of words, how many times one or more words appear independently and how many times they appear together, and whether and how often an atom appears in a query log. There are other factors that may be used and are contemplated to be within the scope of the present invention.
A subset of atom/document pairs is selected at step 1018 as being most relevant to the particular document. This selection at step 1018 is based on the information metrics computed for the atom/document pair. Typically, a threshold is determined such that those information metrics above the threshold are considered relevant and those below are not considered relevant or at least not so relevant. In one embodiment, selecting the subset of atom/document pairs includes using a pruning algorithm to prune or limit the number of atom/document pairs to a smaller number such that less relevant atom/document pairs than other more relevant atom/document pairs are discarded and therefore not indexed. At step 1020, the search index is populated with a subset of the atom/document pairs for the particular document. As mentioned, all atom/document pairs may be initially indexed in a separate index, and then only those selected as most relevant are populated or indexed into the main search index. Additionally, as mentioned, there may be more than one search index, so in one embodiment, a unigram is indexed in a unigram index, an n-gram is indexed in an n-gram index, and an n-tuple is indexed in an n-tuple index.
In one embodiment, a search query is received. The search query may be rewritten as at least one of a unigram, an n-gram, an n-tuple, or a combination thereof. The search index in which the atoms have been indexed is accessed to determine the most relevant documents for the rewritten search query.
FIG. 11 is a flow diagram illustrating a method 1100 for populating one or more search indexes with atoms identified in a plurality of documents, in accordance with an embodiment of the present invention. Initially at step 1110, an atom is identified from a first document. Each of these atoms is classified in step 1112 as a unigram, n-gram, n-tuple, or a combination thereof. At step 1114, an information metric is computed for each of the identified atoms. As mentioned, the information metric represents the ranking of an atom/document pair, as it is useful in parsing a general query. Factors used in calculating the information metric include, but are not limited to, the frequency of the atom in the first document, the proximity of the location of two or more terms of the atom in the first document, the relevancy of the terms of the atom, and whether the terms of the atom have been previously linked together as evidenced by an examination of the query log. At step 1116, a determination is made as to whether the information metric for each atom meets a predetermined threshold. Atoms meeting the threshold are those that are deemed or known to be most relevant with respect to the first document. In one embodiment, the threshold may be arbitrary, or in another embodiment may be purely based on a quantity such as how many atoms are indexed. In yet another embodiment, the threshold is based on a prior attempt performed with respect to an atom found to be related to the particular document.
At step 1118, atoms that do not meet the predetermined threshold are discarded. Atoms that meet the threshold are incorporated into one or more search indexes, shown at step 1120. In one embodiment, the one or more search indexes include a unigram index, an n-metagram index, and an n-tuple index. In one embodiment, as previously mentioned, all of the unigrams identified in the indexed documents may be incorporated into the search index and thus not pruned. Additionally, the same process is appropriate for n-tuples in one embodiment. Alternatively, the n-gram model may be pruned to a certain degree but not as much as the n-grams. In this way, a greater percentage of n-tuples than n-grams and unigrams may be discarded. In addition, some n-tuples may also be identified as n-grams, so duplicates may be discarded during the pruning process.
Another embodiment of FIG. 11 contains an identification of an atom from a second document. Each of these atoms is classified as a unigram, n-gram, n-tuple, or a combination thereof. An information metric is calculated for each atom associated with the second document. Some atoms may be the same as or similar to those identified from the first document, but may have different information metrics based on the different documents from which the atoms were identified. Thereby determining whether the information metric for each atom meets a predetermined threshold. Those that are eligible are considered most relevant with respect to the second file. Those that do not meet the threshold are discarded. Those atoms that meet the threshold are incorporated into the search index.
FIG. 12 is a flow diagram illustrating a method 1200 for populating one or more search indexes with atoms identified in a plurality of documents, in accordance with an embodiment of the present invention. At step 1210, atoms are extracted from the document. Atoms can be classified as unigrams, n-grams, or n-tuples. For each atom, an information metric is computed at step 1212. The information metric represents a ranking of a particular atom with respect to the document. Further, the computation of the information metric may be based on, for example, the frequency of the atoms in the document, the proximity of the terms of the atoms in the document, the relevance of the terms of the atoms, and whether the terms of the atoms have been previously linked together as evidenced by the examination of the query log. Other factors may also be used and are contemplated to be within the scope of the present invention.
At step 1214, an information metric threshold is determined such that those atom/document pairs whose information metric meets or exceeds the information metric threshold are indexed. At step 1216, based on the information metric, a portion of the atom/file pairs are discarded, such as if the information metric does not meet a threshold. The one or more search indexes are populated at step 1218 with atom/document pairs for which the information metric meets or exceeds the information metric threshold. In one embodiment, the unigram, the n-gram, and the n-tuple are indexed separately. At step 1220, the search index is accessed to identify documents that are relevant to the atoms in the received search query.
As can be appreciated, embodiments of the present invention provide for the calculation of an information metric for each atom/document pair and use the information metric to determine which atom/document pairs are indexed and which are discarded. The present invention has been described in relation to particular embodiments, which are intended in all respects to be illustrative rather than restrictive. Alternative embodiments will become apparent to those skilled in the art to which the present invention pertains without departing from its scope.
From the foregoing, it will be seen that this invention is one well adapted to attain all the ends and objects set forth above, together with other advantages which are obvious and inherent to the system and method. It will be understood that certain features and subcombinations are of utility and may be employed without reference to other features and subcombinations. This is contemplated by and is encompassed by the scope of the claims.
Claims (11)
1. A method for populating one or more search indexes with atoms identified in a plurality of documents, the method comprising:
identifying (1010) a set of files to be indexed in a search index;
for each document of the collection of documents, identifying (1012) a plurality of atoms, the plurality of atoms comprising one or more unigrams, one or more n-grams, and one or more n-tuples;
generating (1014) a list of atom/document pairs based on the identified collection of documents and the plurality of atoms;
calculating (1016) an information metric for each atom/document pair, wherein the information metric represents a ranking of atoms associated with a particular document;
selecting (1018), based on the information metrics for each atom/document pair, a subset of atom/document pairs that are most relevant to the particular document from which the atom was identified; and
the search index is populated (1020) with a subset of atom/document pairs for a particular document.
2. The method of claim 1, wherein the search index comprises one or more search indexes, wherein the one or more search indexes comprise a unigram index, an n-metagram index, and an n-tuple index.
3. The method of claim 1, wherein selecting the subset of atom/document pairs that are most relevant to the particular document further comprises: the number of atom/document pairs is truncated to a smaller number using a pruning algorithm, such that atom/document pairs that are less relevant than other atom/document pairs are not indexed.
4. The method of claim 1, wherein a machine learning tool is used to compute information metrics for atom/document pairs and select a subset of atom/document pairs that are most relevant to the particular document from which the atom was identified.
5. The method of claim 1, further comprising:
receiving a search query;
rewriting the search query as at least one of one or more unigrams, one or more n-grams, or one or more n-tuples; and
using the rewritten search query, a search index is accessed to determine the most relevant files for the search query.
6. A method (1100) of populating one or more search indexes with atoms identified in a plurality of documents, the method comprising:
identifying (1110) a plurality of atoms from a first file of a plurality of files to be indexed;
classifying (1112) each atom of the plurality of atoms as one or more of a unigram, an n-gram, or an n-tuple;
calculating (1114), for each atom of the plurality of atoms, an information metric related to the first document;
determining (1116) whether the information metric for each atom of the plurality of atoms meets a predetermined threshold, wherein the atoms meeting the predetermined threshold are those that are most relevant for the first document;
discarding (1118) atoms that do not meet a predetermined threshold;
atoms that meet a predetermined threshold for the first document are incorporated (1120) into one or more search indexes.
7. The method of claim 6, wherein the information metric for the first atom identified in the first document represents a ranking of the first atom as to how useful the first atom is in parsing the search query with the first atom for the first document.
8. The method of claim 6, wherein the calculation of the information metric for each atom of the plurality of atoms is based on one or more of: a frequency of the atom in the first document, a proximity of two or more terms of the atom in the first document, a relevance of two or more terms of the atom, or whether two or more terms of the atom have been previously linked together as evidenced by an examination of the query log.
9. The method of claim 9, further comprising:
identifying a plurality of atoms from a second document;
classifying each atom of the plurality of atoms as one or more of a univariate model, an n-gram model, or an n-gram;
calculating, for each atom of the plurality of atoms, an information metric about the second document;
determining whether the information metric of each atom of the plurality of atoms meets a predetermined threshold, wherein the atoms meeting the predetermined threshold are those most relevant for the second document;
discarding atoms that do not meet a predetermined threshold; and
atoms meeting a predetermined threshold for the second document are incorporated into one or more search indexes.
10. A method (1200) for populating one or more search indexes with atoms identified in a plurality of documents, the method comprising:
extracting (1210) a plurality of atoms from the document, the plurality of atoms comprising one or more unigrams, one or more n-grams, and one or more n-tuples;
for each atom of the plurality of atoms, calculating (1212) an information metric representing a ranking of the particular atom with respect to the document, wherein the calculation of the information metric is based on one or more of: a frequency of atoms in the document, a proximity of two or more terms of atoms in the document, a relevance of two or more terms of atoms, or whether two or more terms of atoms have been previously linked together as evidenced by an examination of a query log;
determining (1214) an information metric threshold, wherein atom/document pairs for which the information metric meets or exceeds the information metric threshold are indexed;
discarding (1216) a portion of the atom/file pairs based on the information metric, wherein the information metric corresponding to the discarded atom/file pair is below the information metric threshold;
populating (1218) the one or more search indexes by indexing atom/document pairs for which an information metric meets or exceeds the information metric threshold, wherein the unigram, the n-gram, and the n-tuple are individually indexed; and
accessing (1220) the one or more search indexes to identify relevant files in the query for the atom.
11. One or more computer storage media storing computer-useable instructions that, when used by a computing device, cause the computing device to perform the method of any of claims 1-10.
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US13/045278 | 2011-03-10 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| HK1172124A true HK1172124A (en) | 2013-04-12 |
| HK1172124B HK1172124B (en) | 2018-04-20 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US9342582B2 (en) | Selection of atoms for search engine retrieval | |
| US8620907B2 (en) | Matching funnel for large document index | |
| US8463593B2 (en) | Natural language hypernym weighting for word sense disambiguation | |
| US8037068B2 (en) | Searching through content which is accessible through web-based forms | |
| Medelyan et al. | Domaināindependent automatic keyphrase indexing with small training sets | |
| US8051084B2 (en) | System and method for measuring the quality of document sets | |
| US9529908B2 (en) | Tiering of posting lists in search engine index | |
| US8805755B2 (en) | Decomposable ranking for efficient precomputing | |
| MXPA05005220A (en) | Method and system for schema matching of web databases. | |
| WO2009079875A1 (en) | Systems and methods for extracting phrases from text | |
| US20190362003A1 (en) | Techniques for processing long-tail search queries against a vertical search corpus | |
| CN120256648B (en) | A retrieval enhancement processing method and system based on a large model | |
| Bassil | A survey on information retrieval, text categorization, and web crawling | |
| CN118193681A (en) | A text information extraction method, device, equipment and storage medium | |
| AU2021102702A4 (en) | A process for query reformulation system using rank aggregation and genetic approach | |
| Maryamah et al. | Hybrid information retrieval with masked and permuted language modeling (MPNet) and BM25L for Indonesian drug data retrieval | |
| Gupta et al. | A review on important aspects of information retrieval | |
| HK1172124A (en) | Selection of atoms for search engine retrieval | |
| HK1172124B (en) | Selection of atoms for search engine retrieval | |
| CN102682073B (en) | Selection of atoms for search engine retrieval | |
| Kashefi et al. | Optimizing Document Similarity Detection in Persian Information Retrieval. | |
| Schedl et al. | Automatically detecting members and instrumentation of music bands via web content mining | |
| Zheng et al. | An improved focused crawler based on text keyword extraction | |
| Bhatia et al. | Contextual paradigm for ad hoc retrieval of user-centric web data | |
| Sharma et al. | Improved stemming approach used for text processing in information retrieval system |