HK1173524B

HK1173524B - Dynamic query master agent for query execution

Info

Publication number: HK1173524B
Application number: HK13100619.6A
Authority: HK
Inventors: Magne Risvik Knut; Hopcroft Michael; Kalyanaraman Karthik; Chilimbi Trishul; Setiawan Henry; W. Anderson Christopher
Original assignee: Microsoft Technology Licensing, Llc
Priority date: 2011-03-25
Filing date: 2013-01-15
Publication date: 2018-06-29

Description

Dynamic query master agent for query execution

Cross Reference to Related Applications

This application is a continuation-in-part application entitled "HYBRID-DISTRIBUTION MODEL FOR search engine indexing" filed on 11/2010 and 22/2010, and having application number 12/951,815 (attorney docket number mfcp. 157166), the entire contents of which are incorporated herein by reference.

Background

The amount of information and content available on the internet continues to grow very quickly. Given the vast amount of information, search engines have been developed to facilitate searches for electronic documents. In particular, users may search for information and documents by entering a search query that includes one or more terms that may be of interest to the user. After receiving a search query from a user, the search engine identifies relevant documents and/or web pages based on the search query. Because of its utility, web searching, i.e., the process of finding relevant web pages and documents for user-issued search queries, has proven to become one of the most popular services on the internet today.

In addition, search engines typically use a single-step process that utilizes a search index to identify relevant documents to be returned to a user based on a received search query. However, the search engine ranking (ranking) function has emerged as a very complex function that can be time consuming and expensive if used for each document that is indexed. In addition, the storage of the data required for these complex formulas can also cause problems, especially when stored in a reverse index, which is typically indexed with words or phrases. The extraction of relevant data required by complex formulas is inefficient when stored in the reverse index.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

Embodiments of the present invention involve employing both atomic-sharded and document-sharded distributions across the same set of nodes, such that each node or machine stores both a portion of the reverse index (e.g., sharded by atom) and a portion of the forward index (e.g., sharded by document). A segment (segment) may be assigned a set of documents for which it is responsible. The set of documents is indexed both by atom and by document such that there is a reverse index and a forward index associated with the set of documents. Each segment includes a plurality of nodes, and each node may be assigned a different portion of both the reverse and forward index. In addition, each node is responsible for performing multiple ordering calculations using both the reverse and forward index portions stored thereon. For example, the preliminary ranking process may utilize a reverse index and the final ranking process may utilize a forward index. These ranking processes form an overall ranking process that is used to identify the most relevant documents based on the received search query.

Other embodiments of the present invention are directed to the selection of preliminary segment roots and final segment roots. In general, the preliminary segment root is selected based on any known information when it is selected, and is only used temporarily until the final segment is selected. In an embodiment, the preliminary segment root utilizes an algorithm to select a final segment root based on statistical data received from the various nodes or machines that make up the segment. As will be explained in more detail herein, there are many segments used to decompose (resolve) a search query, each segment including multiple nodes or machines. The preliminary segment root is selected only from those nodes whose search index contains terms or atoms that exist in the search query that has been received. The set of nodes includes only the nodes that will be used to execute the particular search query. Once more information can be provided, such as input/output load, current and expected load, including query queues, problem signals associated with the nodes, etc., the final segment root is selected such that a minimum amount of data is transmitted across the network, thus reducing the overall cost of executing the search query.

Drawings

The invention is described in detail below with reference to the attached drawing figures, wherein:

FIG. 1 is a block diagram of an exemplary computing environment suitable for use in implementing embodiments of the present invention;

FIG. 2 is a block diagram of an exemplary system in which embodiments of the invention may be employed;

FIG. 3 is an exemplary diagram of a hybrid-distribution system in accordance with an embodiment of the present invention;

FIG. 4 is an exemplary diagram of a hybrid-distribution system illustrating payload requirements, in accordance with an embodiment of the present invention;

FIG. 5 is a flow diagram illustrating a method for utilizing a hybrid-distribution system for identifying relevant documents based on a search query, in accordance with embodiments of the present invention;

FIG. 6 is a flow diagram illustrating a method for generating a hybrid-distribution system for a multi-process document retrieval system in accordance with an embodiment of the present invention;

FIG. 7 is a flow diagram illustrating a method for utilizing a hybrid-distribution system for identifying relevant documents based on a search query, in accordance with embodiments of the present invention; and

8-10 are flow diagrams illustrating various methods for identifying a segment root from a plurality of nodes in accordance with embodiments of the present invention.

Detailed Description

The subject matter of the present invention is described with specificity herein to meet statutory requirements. However, the description itself is not intended to limit the scope of this patent. Rather, the inventors have contemplated that the claimed subject matter might also be embodied in other ways, to include different steps or combinations of steps similar to the ones described in this document, in conjunction with other present or future technologies. Moreover, although the terms "step" and/or "block" may be used herein to connote different elements of methods employed, the terms should not be interpreted as implying any particular order among or between various steps herein disclosed unless and except when the order of individual steps is explicitly described.

As described above, embodiments of the present invention provide that nodes form a segment, such that each stores a portion of the forward index and reverse index for the segment. For example, among the total number of documents to index (e.g., one trillion), each segment may be assigned a certain portion of documents, such that the segment is responsible for indexing and performing ranking calculations for those documents. The portion of the reverse and forward index stored on the particular segment is the complete reverse and forward index relative to the documents assigned to the segment. Each segment includes a plurality of nodes, which are essentially machines or computing devices with storage capabilities. Separate portions of the reverse index and the forward index are assigned to each node in the segment so that each node may be employed to perform various ordering calculations. Thus, each node has stored thereon a subset of the reverse index and forward index of the segment and is responsible for accessing each in various ordering processes within the segment. For example, the overall ranking process may include a matching phase, a preliminary ranking phase, and a final ranking phase. The matching/preliminary stage may entail identifying a first set of documents relevant to the search query using those nodes whose reverse index has indexed an atom from the search query. The first set of documents is a set of documents from the documents assigned to the segment. Subsequently, those nodes whose forward index has indexed document identifications associated with documents in the first set of documents may be used to identify a second set of documents that are more relevant to the search query. In one embodiment, the second set of documents is a subset of the first set of documents. This overall process may be used to limit a set of documents to those found to be relevant, so that a final ranking process, which is typically more time consuming and expensive than the preliminary ranking process, is used to rank fewer documents than would be possible if each document in the index was ranked (whether relevant or not).

Accordingly, in one aspect, embodiments of the present invention are directed to one or more computer storage media storing computer-useable instructions that, when used by a computing device, cause the computing device to perform a method for utilizing a hybrid-distribution system to identify relevant documents based on a search query. The method includes assigning a set of documents to a segment, the set of documents indexed by atom with a reverse index and indexed by document with a forward index, and storing different portions of the reverse index and the forward index on each of a plurality of nodes forming the segment. In addition, the method includes accessing a reverse index portion stored on each of the first set of nodes to identify a first set of documents relevant to the search query. The method additionally includes accessing a forward index portion stored on each of a second set of nodes based on document identifications associated with the first set of documents to restrict a number of related documents in the first set of documents to the second set of documents.

In another embodiment, aspects of the invention are directed to one or more computer storage media storing computer-useable instructions that, when used by a computing device, cause the computing device to perform a method for generating a hybrid-distribution system for a multi-process document retrieval system. The method includes receiving an indication of a set of documents assigned to a segment, the segment including a plurality of nodes. For segments, the method further includes indexing the assigned set of documents by atom to generate a reverse index and indexing the assigned set of documents by document to generate a forward index. The method additionally includes assigning a portion of the reverse index and a portion of the forward index to each of a plurality of nodes forming the segment such that each of the plurality of nodes has stored a different portion of the forward index and a different portion of the reverse index.

Another embodiment of the invention is directed to one or more computer storage media storing computer-useable instructions that, when used by a computing device, cause the computing device to perform a method for utilizing a hybrid-distribution system to identify relevant documents based on a search query. The method includes receiving a search query, identifying one or more atoms in the search query, and transmitting the one or more atoms to a plurality of segments, each segment having assigned thereto a set of documents indexed by atom and by document, such that a reverse index and a forward index are generated and stored at each of the plurality of segments. Each of the plurality of segments includes a plurality of nodes, each node being assigned a portion of a forward index and a reverse index. Based on the one or more atoms, the method identifies a first set of nodes at the first segment whose reverse index portion contains at least one of the one or more atoms from the search query. Additionally, the method includes accessing the reverse index portion stored at each of the first set of nodes to identify a first set of documents found to be related to the one or more atoms and identifying a second set of nodes whose forward index portion contains one or more of the document identifications associated with the first set of documents based on the document identifications associated with the first set of documents. The method also includes accessing a forward index portion stored at each of a second set of nodes to identify a second set of documents that is a subset of the first set of documents.

In other embodiments of the invention, the segment root is selected from the nodes used to execute a particular query, such as those nodes whose search index includes terms or atoms present in the search index. Initially, a preliminary segment root is selected and temporarily used to collect data from other nodes. In one embodiment, the preliminary segment root collects and aggregates statistics from other nodes that will be used to execute the search query. The preliminary revision uses an algorithm to determine which node is best suited to serve as the final segment root for the particular query. For example, a segment root with the most data to transmit may be a good choice for a final segment root because it is not necessary to transmit a large amount of data to another node for aggregation. In one embodiment, the primary concern in selecting the final segment root is cost, such as timeliness and ease of transferring data from one node to the final segment root. For example, when a node involved in the execution of a particular query identifies documents relevant to a search query, this data must be transferred to the final segment root that aggregates this information from many nodes. The final segment root passes the aggregated query extraction data from many nodes to another component that assembles similar data from multiple segments. Likewise, the goal is to select a final segment root that will make the overall query execution process more cost effective.

Likewise, an aspect is directed to one or more computer storage media storing computer-useable instructions that, when used by a computing device, cause the computing device to perform a method for assigning segment roots. The method includes receiving a search query, identifying a set of nodes in a segment that will be used to decompose the search query, and selecting a preliminary segment root from the set of nodes. Further, the method includes receiving, at the preliminary segment root, statistical data from each node in the identified set of nodes, the statistical data indicating a capability of each node to act as a final segment root responsible for assembling results of query execution from the set of nodes based on the search query. The method additionally includes algorithmically selecting a final segment root from the set of nodes based on the statistical data and notifying the set of nodes of the final segment root so that the nodes know where to send their respective query execution results.

A second aspect is directed to one or more computer storage media storing computer-useable instructions that, when used by a computing device, cause the computing device to perform a method for assigning segment roots. The method includes receiving a search query to be executed at a segment that includes a plurality of nodes, and identifying a set of nodes from the plurality of nodes that will be used to execute the search query. Prior to executing the search query, the method includes selecting a preliminary segment root from the plurality of nodes, the selecting based on one or more of an expected load for each node or a random selection. Further, the method includes receiving statistical data at the preliminary segment root from each node in the set of nodes that will be used to execute the search query. The statistical data includes current load and cost data associated with sending data across the network. Based on the statistical data, a final segment root is selected that will aggregate query execution data from the set of nodes during query execution. The search query is then executed.

A third aspect is directed to one or more computer storage media storing computer-useable instructions that, when used by a computing device, cause the computing device to perform a method for assigning a segment root. The method includes receiving a search query at a subject root (corpus root) that includes a plurality of segments. Each of the plurality of segments includes a plurality of nodes. Each node has a portion of the search index stored thereon. The method also includes identifying a set of nodes in each segment to be used to execute the received search query and identifying a preliminary segment root from the set of nodes for each of the plurality of segments. In addition, statistical data is requested from each node in the set of nodes that will be used to execute the received search query. The statistics are received from each node in the set of nodes, the statistics indicating an availability of each node to act as a final segment root that collects query execution data from the set of nodes in its respective segment. The method additionally includes selecting a final segment root for each segment based on the statistical data and executing the search query.

Having briefly described an overview of embodiments of the present invention, an exemplary operating environment in which embodiments of the present invention may be implemented is described below in order to provide a general context for various aspects of the present invention. Referring initially to FIG. 1 in particular, an exemplary operating environment for implementing embodiments of the present invention is shown and designated generally as computing device 100. Computing device 100 is but one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should the computing device 100 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated.

The invention may be described in the general context of computer code or machine-useable instructions, including computer-executable instructions such as program modules, being executed by a computer or other machine, such as a personal data assistant or other handheld device. Generally, program modules including routines, programs, objects, components, data structures, etc., refer to code that perform particular tasks or implement particular abstract data types. The invention may be implemented in a variety of system configurations, including hand-held devices, consumer electronics, general-purpose computers, more specialty computing devices, and the like. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.

Referring to FIG. 1, computing device 100 includes a bus 110 that directly or indirectly couples the following devices: memory 112, one or more processors 114, one or more presentation components 116, input/output (I/O) ports 118, input/output components 120, and an illustrative power supply 122. Bus 110 may represent one or more buses, such as an address bus, a data bus, or a combination thereof. Although the various blocks of FIG. 1 are shown with lines for the sake of clarity, in reality, delineating various components is not so clear, and metaphorically, the lines would more accurately be grey and fuzzy. For example, one may consider a presentation component such as a display device to be an I/O component. In addition, the processor has a memory. The inventors recognize that such is the nature of the art, and reiterate that the diagram of FIG. 1 is merely illustrative of an exemplary computing device that can be used in connection with one or more embodiments of the present invention. As is contemplated within the scope of fig. 1 and with reference to "computing devices" in their entirety, no distinction is made between categories such as "workstation," "server," "laptop," "handheld device," and so forth.

Computing device 100 typically includes a variety of computer-readable media. Computer readable media can be any available media that can be accessed by computing device 100 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by computing device 100. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or modified in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.

The memory 112 includes computer storage media in the form of volatile and/or nonvolatile memory. The memory may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Computing device 100 includes one or more processors that read data from various entities such as memory 112 or I/O components 120. The presentation component 116 presents data indications to a user or other device. Exemplary presentation components include a display device, speaker, printing component, vibrating component, and the like.

I/O ports 118 allow computing device 100 to be logically coupled to other devices including I/O components 120, some of which may be built-in. Illustrative components include a microphone, joystick, game pad, satellite dish, scanner, printer, wireless device, and the like.

Referring now to FIG. 2, a block diagram is provided illustrating an exemplary system 200 in which embodiments of the present invention may be employed. It should be understood that this and other arrangements described herein are set forth only as examples. Other arrangements and elements (e.g., machines, interfaces, functions, orders, and groupings of functions, etc.) can be used in addition to or instead of those shown, and some elements may be omitted altogether. Further, many of the elements described herein are functional entities that may be implemented as discrete or distributed components or in conjunction with other components, and in any suitable combination and location. Various functions described herein as being performed by one or more entities may be carried out by hardware, firmware, and/or software. For example, various functions may be performed by a processor executing instructions stored in a memory.

Among other components not shown, system 200 includes user device 202, segment 204, and hybrid-distribution system server 206. Each of the components shown in FIG. 2 may be any type of computing device, such as computing device 100 described with reference to FIG. 1, for example. The components may communicate with each other via a network 208, which network 208 may include, without limitation, one or more Local Area Networks (LANs) and/or Wide Area Networks (WANs). Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet. It should be understood that any number of user devices, segments, and hybrid distribution system servers may be employed within the system 200 within the scope of the present invention. Each may comprise a single device or multiple devices cooperating in a distributed environment. For example, the segment may include multiple devices arranged in a distributed environment that collectively provide the functionality of the segment 204 described herein. In addition, other components not shown may also be included within the system 200, and the components shown in FIG. 2 may be omitted in certain embodiments.

The user device 202 may be any type of computing device owned and/or operated by a user having access to the network 200. For example, the user device 202 may be a desktop computer, a laptop computer, a tablet computer, a mobile device, or any other device that may access a network. Generally, an end user may employ the user device 202 to, among other things, access electronic documents by submitting a search query to a search engine. For example, an end user may employ a web browser on the user device 202 to access and view electronic documents stored in the system.

The segment 204 typically includes a plurality of nodes, also referred to as blades. In fig. 2, two nodes are illustrated, including node 1, numbered 210, and node 2, numbered 212. Although two nodes are illustrated in the embodiment of fig. 2, a segment may include many more nodes than two (e.g., 10, 40, 100). Two nodes are illustrated for exemplary purposes only. Each segment, such as segment 204, is assigned a set of documents for which it is responsible. Likewise, the reverse index and forward index are generated and stored at segment 204. While the reverse index and forward index for a particular segment may be generated at the segment 204 itself, in alternative embodiments, the index is generated at some other location or on some other computing device and sent to the segment 204. Further, once the reverse index and forward index are generated based on the set of documents assigned to segment 204, the two indexes are divided into portions. In one embodiment, the number of portions is equal to the number of nodes associated with a particular segment. Thus, in the case of 40 nodes in a particular segment, both indexes are divided by 40 ways, such that each node is responsible for a different portion of each of the reverse index and forward index. As shown, node 1 has a reverse index portion 214 and a forward index portion 216. Node 2 also has a reverse index portion 218 and a forward index portion 220. Although shown separately from the nodes, in one embodiment, the indexes are stored on the nodes themselves. In any case, each node is responsible for indexing a portion of the reverse index and forward index based on the set of documents assigned to the segment.

As noted, the index may be indexed or sliced by atom or sliced by document. As used herein, sharding refers to the process of indexing a set of documents by atom or by document. There are pros and cons to use each method separately without the other. For example, when sharding by document, advantages include isolation of processing between shards such that only consolidation of results is required. Furthermore, the information of each document is easily aligned with the matching. Furthermore, network traffic is small. Conversely, disadvantages include the need for each shard to process any particular query. If the reverse index data is placed on disk, a minimum of O (KN) disk lookups (disk seek) are required for K atomic queries over N shards. When sharding by atom, advantages include reduced computations such that only K shards are needed to process K atom queries. If the reverse index data is placed on disk, O (K) disk lookups are required for K atomic queries. However, the disadvantages include, in contrast, the need for connected processing, such that all shards storing atoms participating in a query need to collaborate. In addition to the information of each document being not easily managed, network traffic is also significant. Embodiments of the present invention require less management of data per document than conventional approaches. Reasons for this include that some scores are pre-computed and scored with an index (such as a reverse index), and further refinement and filtering of the document also occurs after the matching stage (L0). Also, the above disadvantages are greatly reduced with respect to the management of data for each document.

In addition, each node in a particular segment can perform various functions, including ranking functions that allow for identification of relevant search results. In some embodiments, the search engine may employ a staged process to select search results FOR the search query, such as the staged approach described in U.S. patent application entitled "MATCHING FUNNEL FOR LARGE DOCUMENT INDEX" (application number not yet assigned) (attorney docket No. mfcp.157120). Here, each node may be able to employ multiple stages of the overall ranking process. An exemplary ranking process is described below, but is merely one example of a ranking process that may be employed by each node. An overall ranking process may be employed when a search query is received to pair the number of matching documents down to a manageable size. When a search query is received, the search query is analyzed to identify atoms. The atoms are then used during various stages of the overall ranking process. These stages may be referred to as the L0 stage (matching stage) to query the search index and identify an initial set of matching documents that contain atoms from the search query. This initial process may reduce the number of candidate documents from all documents indexed in the search index to those documents that match the atoms from the search query. For example, a search engine may search millions or even trillions of documents to determine those that are most relevant to a particular search query. Once the L0 matching phase is complete, the number of candidate documents is greatly reduced. However, many algorithms for locating the most relevant documents are expensive and time consuming. In this way, two other stages may be employed, including a preliminary ranking stage and a final ranking stage.

The preliminary ranking stage, also referred to as the L1 stage, employs a simplified scoring function that is used to calculate a preliminary score or ranking for candidate documents retained from the L0 matching stage described above. The preliminary ranking component 210 is likewise responsible for providing a preliminary ranking for each candidate document retained from the L0 matching stage. Alternatively, the candidate documents may be scored and likewise given an absolute number rather than a ranking. The preliminary ranking stage is simplified when compared to the final ranking stage because it employs only a subset of the ranking features used by the final ranking stage. For example, one or more (but in some embodiments not all) of the ranking features used in the final ranking stage are employed by the preliminary ranking stage. In addition, features not employed by the final ranking stage may be employed by the preliminary ranking stage. In embodiments of the invention, the ranking features used in the preliminary ranking stage do not have atomic interdependencies, such as word compactness and word co-occurrence. For example, the ordering attributes used in the preliminary ordering stage may include static attributes and dynamic atom isolated components (atom-isolated components), for exemplary purposes only. Generally, static features are those components that only investigate query-independent features. Examples of static features include page order, spam ranking of particular web pages, and the like. A dynamic atom isolation component is a component that only looks at features related to a single atom at a time. Examples may include, for example, the BM25f, the frequency of an atom in a document, the location (context) of an atom in a document (e.g., title, URL, anchor, header, body, service, class, attribute), and so forth.

Once the number of candidate documents has been reduced again by the preliminary ranking stage, a final ranking stage, also referred to as the L2 stage, ranks the candidate documents provided to it by the preliminary ranking stage. The algorithm used in conjunction with the final ranking stage is a more expensive operation with a greater number of ranking features when compared to the ranking features used in the preliminary ranking stage. However, the final ranking algorithm is applied to a much smaller number of candidate documents. The final ranking algorithm provides a set of ranked documents and provides search results in response to the original search query based on the set of ranked documents. In some embodiments, the final ranking stage as described herein may employ a forward index, as described in U.S. patent application entitled "EFFICIENT FORWARD RANKING IN A SEARCHENGINE (efficient forward ranking in search engines)" (attorney docket number mfcp. 157165), which has not been assigned.

Returning to FIG. 2, hybrid-distribution system server 206 includes document assignment component 222, query parsing component 224, query distribution component 226, and result merging component 228. The document assignment component 222 is generally responsible for assigning documents to the various segments used in a given ranking system. For exemplary purposes only, if there are 1 million documents that need to be indexed and there are 100 segments available, 1 million documents may be assigned to each segment. Alternatively, on a larger scale, if there are 1 trillion documents that need to be indexed and there are 100,000 segments available, 1 million documents may be assigned to each segment. As illustrated with the above example, the documents may be evenly distributed among the segments, or may be distributed in a different manner such that each segment does not have exactly the same number of documents for which it is responsible.

For example, when a search query is received via a user interface on the user device 202, the query parsing component 224 operates to reformat the query. The query is reformatted from its free-text form into a format that facilitates querying the search index, such as reverse index and forward index, based on how the data is indexed in the search index. In embodiments, the terms of the search query are parsed and analyzed to identify atoms that may be used to query the search index. Atoms may be identified using similar techniques used to identify atoms in documents when indexing documents in a search index. For example, atoms may be identified based on query distribution information and statistics of terms. The query parsing component 224 can provide a set of connections of atoms and cascading variants of these atoms.

An atom or atomic unit, as used herein, may refer to a variety of units of a query or document. These units may include, for example, words, n-grams (n-grams), n-tuples (n-tuple), near k n-tuples (k-near n-tuple), and so on. The words are mapped down to a single symbol or word defined by the particular tokenizer technique being used. In one embodiment, the words are simple characters. In another embodiment, the term is a single word or group of words. An n-gram is a sequence of "n" consecutive or nearly consecutive terms that can be extracted from a document. An n-gram is said to be "compact" if it corresponds to a run of consecutive words, and "loose" if it contains words in the order in which they appear in the document, but the words are not necessarily consecutive. Loose n-grams are typically used to represent a class of equivalent phrases that differ by insignificant words (e.g., "if it rains, I will get wet" and "if it rains, I will get wet"). An n-tuple, as used herein, is a set of "n" words that appear together (order independent) in a document. Further, a near k n tuple, as used herein, refers to a set of "n" words that appear together within a window of "k" words in a document. Thus, an atom is generally defined as a generalization of all of the above. Implementations of embodiments of the present invention may use different kinds of atoms, but as used herein, atoms generally describe each of the various kinds described above.

The query distribution component 226 is essentially responsible for receiving and distributing submitted search queries among the segments. In one embodiment, each search query is distributed to each segment such that each segment provides a preliminary set of search results. For example, when a segment receives a search query, the segment or a component within the segment determines which nodes are to be tasked with performing a preliminary ranking function that utilizes the reverse index portions stored on the nodes. In one case, the selected nodes that are part of the first set of nodes are those whose reverse index has indexed one or more of the atoms that have been parsed from the search query, as described above. Thus, when the search query is reformatted, one or more atoms are identified and sent to each segment. Each of the first set of nodes returns a first set of documents found to be relevant to the search query based on a preliminary ranking function, as briefly described above. Subsequently, a second set of nodes is determined. In one embodiment, each of these nodes has stored in its respective forward index at least one of the documents in the first set of documents. Each of the second set of nodes performs a final ranking function using the forward index data and other considerations, and as a result, a second set of documents is identified. In one embodiment, each document in the second set is included in the first set because the document identification associated with the first set of documents is used in the final ranking stage.

The result merging component 228 is given the search results (e.g., document identifications and snippets) from each segment and forms a merged final search result list from those results. There are various ways to form the final search result list, including simply removing any duplicate documents and placing each document in the list in an order determined by the final ranking. In one embodiment, a component similar to result merging component 228 exists on each segment such that the results produced by each node are merged into a single list at the segment before the list is sent to result merging component 228.

Turning now to FIG. 3, an exemplary diagram of a hybrid-distribution system 300 is shown, in accordance with an embodiment of the present invention. FIG. 3 illustrates various components, including a principal manager 310, a principal root 312, and two segments, segment 314 and segment 316. More than two segments may be provided, as indicated by ellipsis 310. The principal manager 310 maintains state of which processes serve which shard of the forward index and reverse index. It also maintains the temperature and state of each process. This data is used to generate a set of processes for federating queries with different segments. The subject root 312 is a top-level root process that also performs query planning functions. The subject root 312 will scatter the query and collect and merge results across all required segments, and may include custom logic. Each segment has a segment root, such as segment root 320 and segment root 322. The segment root acts as a process for federation of queries and aggregation of results from the processes being federated. The fragment root 322 is likely a dynamic process that is reassigned to the leaf or node that is best for the final query assembly (assembly).

As shown, each segment root includes a plurality of nodes. Due to spatial constraints, three nodes are illustrated for segment root 320 and segment root 332. Segment root 320 includes node 322, node 324, and node 326. Ellipses 328 indicate that more than three nodes are contemplated to be within the scope of the present invention. Segment root 332 includes node 334, node 336, and node 338. Ellipses 340 indicate any additional number of nodes, as any number of nodes may constitute a segment root. As noted, each node is a machine or computing device capable of performing multiple computations (such as ranking functions). For example, in one embodiment, each node includes an L01 matcher 322A and an L2 ranker (ranker)322B as shown at node 322. Likewise, node 334 includes an L01 matcher 334A and an L2 ranker 334B. These are described in more detail above, but the L0 matching and L1 ranking stages (preliminary ranking stages) of the overall ranking process may be combined and collectively referred to as the L01 matcher. Since each node includes an L01 matcher and an L2 ranker, each node must also have stored a portion of the reverse index and forward index, since in one embodiment, the L01 matcher utilizes the reverse index and the L2 ranker utilizes the forward index. As mentioned, each node may be assigned a portion of the reverse and forward indices that belong to the segment. The segment communication bus 330 associated with segment 314 and the segment communication bus 342 associated with segment 316 allow each node to communicate with, for example, the segment root when necessary.

Fig. 4 is an exemplary diagram of a hybrid-distribution system 400 illustrating payload requirements, in accordance with an embodiment of the present invention. System 400 is an illustration of a single segment root 410 having multiple nodes. Here, six nodes (including nodes numbered 412, 414, 416, 418, 420, and 422) are illustrated. Although six nodes are illustrated, any number may be utilized to implement embodiments of the present invention. As indicated previously, each node has the functionality to perform various sort calculations, including those in the matching phase (L0), the preliminary sort phase (L1), and the final sort phase (L2). As such, the node 412 has, for example, both an L01 matcher 412A for the L0 and L1 stages described herein and an L2 ranker 412B for the L2 stage described herein. However, the payloads for the different stages may vary widely. To better illustrate this, the payload for the L01 matcher is shown in a first pattern, as indicated by numeral 424, and the payload for the L2 matcher is shown in a second pattern, as indicated by numeral 426.

A group of documents assigned to a particular segment is indexed or sharded by atom (reverse index) and by document (forward index). The indices are divided into equal parts as the number of nodes that make up the particular segment. In one embodiment, there are forty nodes, and thus, each of the reverse index and the forward index is divided into forty portions and stored at each of the nodes. When a search query is submitted to a search engine, the query is sent to each segment. It is the responsibility of the segment to identify a first set of nodes whose reverse index has one or more of the atoms from the indexed query. Using this approach, if the query is resolved into two atoms, such as "William" and "Shakespeare" from the query "William Shakespeare", the maximum number of nodes in the fragment that will be responsible for the L01 matcher will be two. This is shown in fig. 4 because the L01 matchers associated with node 412 and node 416 are the only two identified for use in the L01 matching process. Since the documents assigned to each segment are indexed by atom, each atom is indexed only once in the reverse index, such that any particular atom appears only in one of the reverse index portions assigned to the nodes of the segment. In an exemplary scenario, once the first set of nodes is identified, atoms from the search query that match atoms in the reverse index are sent to the appropriate nodes. The node performs a number of calculations so that a set of documents is identified. This first set of documents, in one embodiment, includes those documents that have received the highest ranking from the preliminary ranking stage.

This first set of documents is collected at segment root 410 from each node from the first set of nodes, including nodes 412 and 416. These results are combined in any of a number of ways so that the segment root 410 can next identify a second set of nodes to be used in conjunction with the final ranking stage. As shown, each L2 sorter is used in the final sort stage or L2 stage. This is because each node has stored a portion of the forward index for that segment, so there is a good chance that most or all of the forward indices will need to be accessed in the final stage of the sorting. In the final ranking stage, each node in the second set of nodes is given the document identification it contains in its forward index so that the node can rank that document based at least on the data found in the forward index. Since most or all of the nodes are used in the final ranking stage, the payload for the final ranking stage is typically larger than the payload for the matching/preliminary ranking stage, as shown in the system 400 of fig. 4. The segment communication bus 428 allows nodes to communicate with other components, such as the segment root 410.

Referring to FIG. 5, a flow diagram illustrates a method 500 for utilizing a hybrid-distribution system for identifying relevant documents based on a search query, in accordance with an embodiment of the present invention. Initially, a set of documents is assigned to a segment at step 510. The set of documents is both indexed by atom in the reverse index and by document in the forward index before or after the segment is received, as indicated by step 512. Thus, documents indexed by the forward index are documents that include a set of documents assigned to the segment and atoms in the reverse index are parsed from the content of the documents. At step 514, a portion of the reverse index and the forward index are stored at each node in the segment. Generally, a segment includes a plurality of nodes. Each node is a machine or computing device capable of performing ranking calculations based on the reverse index portion and the forward index portion stored thereon. In one embodiment, each node stores different or unique portions of the reverse index and forward index of the segment.

Step 516 indicates accessing the reverse index portion at each node of the first set of nodes. Each node in the first set of nodes has been identified as having indexed one of the atoms of the received search query. A first set of documents is identified at step 518. In one embodiment, the documents have been ranked using a preliminary ranking function so that the most relevant documents can be identified. This step may, for example, correspond to the L1 preliminary ranking stage and/or the L0 matching stage. The forward index portion is accessed at each of a second set of nodes based on document identifications associated with documents in the first set of documents, shown at step 520. This step may correspond to the L2 final sort stage. This effectively limits the number of relevant documents for a particular search query. Thus, the number of documents is limited to the second set of documents, shown at step 522. In many or most cases, the number of nodes in the second set is greater than the number of nodes in the first set, as described in more detail above. This is because the search query may have only two atoms, such that at most two nodes are required for the L01 matching stage, but thousands of documents are identified as being related to two atoms of the search query, so that many more nodes may be employed to perform the final ranking calculation using their respective forward indices to identify a second set of documents. Further, in an embodiment, since the final ranking function utilizes the document identifications generated from the preliminary ranking function, the number of documents in the second group is less than the number of documents in the first group, such that each document in the second group is also included in the first group.

In one embodiment, the overall process may involve receiving a search query. One or more atoms in the search query are identified, and once each segment is aware of the one or more atoms, a first set of nodes is identified in the segment that contains at least one of the one or more atoms from the search query. For example, each node in the first set of nodes sends a first set of documents (e.g., document identifications) to the segment root, enabling the segment root to join (e.g., de-duplicate) and merge the results. The second set of nodes then sends a second set of documents to the segment root. Likewise, the segment root joins and merges the results to produce a final set of documents that are presented to the user in response to the search query.

Turning to FIG. 6, a flow diagram is illustrated of a method 600 for generating a hybrid-distribution system for a multi-process document retrieval system, in accordance with an embodiment of the present invention. At step 610, an indication of a set of documents is received, the set of documents being assigned to the segment in which the set of documents was received. The segment includes a plurality of nodes (e.g., ten, forty, fifty). The set of documents is indexed by atom to generate a reverse index, shown at step 612. At step 614, the set of documents is indexed by document to generate a forward index. At step 616, a portion of the reverse index and a portion of the forward index are assigned to each node comprising the segment. In an embodiment, each node is assigned a different portion of the reverse and forward indexes such that a particular atom is indexed in the forward index of only one node within the segment.

In an embodiment, at a segment, an indication of one or more atoms that have been identified from a search query is received. A first set of nodes is identified whose reverse index portion includes at least one of the one or more atoms. Each of these nodes is capable of performing various sequencing functions. A first set of documents is identified based on the reverse index portion of the first set of nodes. Each node in the first group may generate and send the first group to the segment root so that the various first group nodes can be joined and merged. In one example, a first set of documents is generated via a preliminary ranking process that utilizes a multi-stage ranking process of the reverse index portion stored thereon. In addition, a second set of nodes may then be identified whose forward index portion has indexed one or more document identifications corresponding to the first set of documents. A second set of documents may then be identified based in part on the data stored in the forward index, and may be characterized in real-time rather than using pre-computed scores. The second set of documents may be identified based on a final ranking process of the multi-stage ranking process utilizing the forward index. Once the second set of documents from each node in the second set of nodes is joined and merged, it is also merged with the second set of documents from all other segments so that a final set of documents is formed and returned to the user as search results.

FIG. 7 is a flow diagram illustrating a method 700 for utilizing a hybrid-distribution system to identify relevant documents based on a search query, in accordance with embodiments of the present invention. Initially, at step 710, a search query is received. In one embodiment, the query is supplemented or modified, such as using a spelling correction tool or stemming. Atoms in the search query are identified at step 712. At step 714, the atoms are transferred to various fragments. Each segment has been assigned a set of documents that are indexed both atomically and by document to form a reverse index and a forward index that are stored at each segment. Each segment includes a plurality of nodes, each node assigned a portion of a reverse index and a forward index. At step 716, a first set of nodes is identified whose reverse index portion contains at least one of the atoms from the search query. At step 718, the reverse index portion of each node in the first set of nodes is accessed to identify a first set of related documents. Based on the document identification associated with each document in the first set of documents, a second set of nodes is identified at step 720. Each node in the second set of nodes has stored at least one of the document identifications in its respective forward index portion so that the node can perform a ranking process on each document. The forward index portion at each node in the second set of nodes is accessed at step 722 to limit the number of relevant documents. In one embodiment, each document in the second set of documents is also included in the first set of documents. Based on the second set of documents, search results are generated (e.g., by compiling the second set of documents from the plurality of segments) and presented to the user.

Referring now to FIG. 8, a method 800 for assigning a segment root to each segment of a subject root is provided. In embodiments such as those described above, the segment root may take the form of a preliminary segment root and a final segment root. For example, the preliminary segment root may be selected based on information known at the time (such as the current or expected load of each node in the segment), or may even be selected randomly, such as on a recurring schedule. The final segment root is selected based on a number of factors and will be discussed in more detail below. In general, the segment root acts as a process for federation of queries and aggregation of results from the processes being federated. The process of selecting preliminary and final segment roots, such as segment root 322 shown in fig. 3, may be a dynamic process. For example, the process of assigning preliminary and final segment roots may occur for each received query and across multiple segments simultaneously. The node selected for the final segment root is considered optimal for the final query compilation.

As shown in fig. 3 herein, each segment root includes a plurality of nodes. Due to spatial constraints, three nodes are illustrated for segment root 320 and segment root 332. Segment root 320 includes node 322, node 324, and node 326. Ellipses 328 indicate that more than three nodes are contemplated to be within the scope of the present invention. Segment root 332 includes node 334, node 336, and node 338. Ellipses 340 indicate any additional number of nodes, as any number of nodes may constitute a segment root. As mentioned, each node is a machine or computing device capable of performing multiple computations (such as ranking functions). For example, in one embodiment, each node includes an L01 matcher 322A and an L2 ranker 322B as shown at node 322. Likewise, node 334 includes an L01 matcher 334A and an L2 ranker 334B. These are described in more detail above, but the L0 matching and L1 ranking stages (preliminary ranking stages) of the overall ranking process may be combined and collectively referred to as the L01 matcher. Since each node includes an L01 matcher and an L2 ranker, each node must also have stored a portion of the reverse index and forward index, since in one embodiment, the L01 matcher utilizes the reverse index and the L2 ranker utilizes the forward index. As mentioned, each node may be assigned a portion of the reverse and forward indices that belong to the segment. The segment communication bus 330 associated with segment 314 and the segment communication bus 342 associated with segment 316 allow each node to communicate with, for example, the segment root when necessary.

Returning to FIG. 8, a search query is initially received at step 810. Referring to FIG. 3, a search query may be received at a subject root 312, which subject root 312 then distributes the search query, or portions thereof, to segments. At step 812, a set of nodes in the segment is identified based on whether each node will be used to decompose the search query. As mentioned, each node is assigned a set of documents from which a reverse index and a forward index are generated. Thus, depending on the atoms in the search query, some nodes will be used for a particular query and some nodes will not. The set of nodes identified at step 812 is identified based on the hash of the nodes or atoms in the query that will be used to resolve the particular query. The hash function takes the terms or atoms in the search query and determines which nodes have stored thereon a list of records (posting list) with that particular term or atom. This allows for the identification of the nodes that will be used to resolve the particular search query. The list of records is simply a list of terms and those documents that contain the terms. An algorithm may be used for the hash function. Exemplary algorithms include MD5 or CRC, but others are certainly contemplated within the scope of the present invention.

At step 814, a preliminary segment root is selected from the set of nodes. In one embodiment, the preliminary segment roots are selected randomly, such as cyclically. However, in another embodiment, the preliminary segment root is selected based on expected loads such that the node with the lowest expected load is selected as the preliminary segment root. In one example, this may be the node with the lowest number of recorded outstanding requests (such as current load), such that the node with the lowest current load of outstanding queries is selected as the preliminary segment root. Once the preliminary segment root is selected from the set of nodes, statistics from each node in the set of nodes are received at the preliminary segment root, shown at step 816. In one embodiment, the preliminary segment root requests that this data be sent by means of, for example, a communication bus connected to each node. The statistics may indicate the ability of each node to act as a final segment root. The final segment root is responsible for compiling query execution results from the set of nodes based on the search query. For exemplary purposes only, the statistics may include the length of the record list for each node, the input/output load, the problem signal associated with a particular node, or the amount of data that will be required to be transferred to the final segment root. Generally, the node that is considered to incur the least cost (e.g., time, money) when acting as the final segment root is selected.

For example, if a particular node has an extremely long list of records, then a large amount of data would need to be transmitted across the network to the final segment root, enabling the final segment root to aggregate query extraction data from all nodes. In one embodiment, this particular node with a large amount of data to transmit may be selected as the final segment root so that its data does not have to be sent to another node, which would result in a high cost data transmission. As previously mentioned, the node sending out the problem signal may have some problems. Many types of auxiliary signals may indicate that the performance of the node to fetch data is impaired. Also as described, input/output load may be considered when selecting the final segment root. This may cover, for example, the length of the node queue that extracts the data from the hard disk. Furthermore, if there are three nodes whose record list includes the word "dog", then when a query is received that also includes the word "dog", the node with the lowest load may be selected because it will have more time to act as the final segment root. Other factors including bandwidth may also be included when determining the final segment root. In one example, the preliminary segment root actually executes an algorithm that determines the final segment root.

At step 818, a final segment root is algorithmically selected from the set of nodes based on the statistical data. In some embodiments, the preliminary segment root and the final segment root are the exact same node, but in other embodiments they are different nodes. An algorithm may be used to make the determination as to which node will be used as the final segment root for a particular query. The query will take into account the above statistics. At step 820, the group of nodes is informed of the identity of the final segment root so that the nodes know where to send their respective query execution results. In one embodiment, the preliminary segment root itself undertakes the task of transmitting itself to the final segment root, or if it is selected as the final segment root, the preliminary segment root may communicate to other nodes that it is now the final segment root.

In an embodiment, the search query is executed using the set of nodes previously identified. As described herein, a first set of nodes may be identified as participating in a preliminary ranking stage (e.g., utilizing reverse indexes stored on the nodes), and a second set of nodes may be identified as participating in a final ranking stage (e.g., utilizing forward indexes stored on the nodes). The final segment root may be used to collect and aggregate data from both the preliminary and final ranking stages. For example, the nodes involved in the preliminary ranking stage return a list of documents that contain a certain term or atom in the search query. The nodes involved in the final ranking stage return documents, such as document identifications, that are most relevant to the search query. As such, the query execution results may refer to results from the preliminary ranking stage, the final ranking stage, or both. Still further, a multi-step ordering process may not be used. In instances where there is a single ranking or search process, a single set of results is collected and aggregated by the final segment root.

The method described in fig. 8 may be used as a system. For example, various system components may be used to select the preliminary and final segment roots. For exemplary purposes only, these components may include a preliminary segment root selection component, a statistics receiving component, a final segment root selection component, and a query execution component. These components may communicate with each other over a network, such as network 208 shown in FIG. 2. The preliminary segment root selection component is responsible for aggregating data and performing a hash calculation to determine which nodes in the particular segment will be used to execute the search query. As described, each node has a portion of a record list and a search index. The list of records includes the atom and the document in which the atom exists. The statistics receiving component may request and receive statistics from the nodes indicating the availability and ability of each node to act as a final segment root. The final segment root selection component is responsible for selecting the final segment root according to statistics received from the nodes that will be used to execute the search query. In one embodiment, the preliminary segment root uses an algorithm to make this determination. Finally, the query execution component distributes the search query, or portions thereof, to the final segment root, which then distributes the search query portions to the appropriate nodes in the segment. The node determines which documents are most relevant to the query or portion of the query and transfers the data to the final segment root by means of, for example, a communication bus. As described, this process occurs simultaneously for multiple fragments.

Turning to FIG. 9, a method 900 for selecting a segment root from a plurality of nodes is illustrated. Initially, at 910, a search query is received at a segment that includes a plurality of nodes. As noted, there are multiple segments (e.g., hundreds of segments) that each execute a search query simultaneously. Each segment includes a plurality of nodes, each of which is assigned a portion of a document indexed with one or more search indexes. Thus, each node has stored a portion of the reverse and forward indexes, indexed by atom and document, respectively. At step 912, a set of nodes from the plurality of nodes is identified as being used to execute the received search query. The determination as to which nodes are to be used may be made, for example, using a hash function. This may depend on the index stored on each node, such that those nodes whose index already stores a particular term or atom in the search query are identified as being used to resolve the particular search query. At step 914, a preliminary segment root is selected from the plurality of nodes prior to executing the search query. The selection of the preliminary segment root is based on the expected load of each node, the current load of each node, a random selection, and the like. Information available at the time of preliminary segment root selection is used to make the selection. The preliminary segment root acts as the preliminary segment root until the final segment root is selected.

Statistics are received at the selected preliminary segment root at step 916. The statistics may be requested from each node in the set of nodes used to execute the search query prior to being received and also received at the preliminary segment root. As previously described, the statistics may include one or more of the length of the record list for each node or other data that will have to be transmitted across the domain network, the input/output load on the node (such as how long the queue length will be extracting data from the hard disk), problem signals associated with the node, costs, and the like. At step 918, a final segment root is selected based on the received statistics. The final segment root collects and aggregates query execution data from the set of nodes during query execution. In one embodiment, the final segment root accepts search queries from an external source, such as a server. The final segment root may also or alternatively accept search queries from a subject root, such as subject root 312 shown in FIG. 3. At step 920, a search query is executed. As noted, there may be one or more phases of query execution, such as a preliminary ranking phase and a final ranking phase.

FIG. 10 illustrates a method 1000 for selecting a segment root. At a subject root having one or more segments, a search query is received at step 1010. Each segment in the subject root has a plurality of nodes, each having a portion of the search index stored thereon. For example, each node may have stored thereon a portion of the reverse index organized atomically and a portion of the forward index organized by documents. This may be the case when multiple ranking or search stages are used to provide search results based on a search query. At step 1012, a set of nodes to be used to execute the received search query is identified in each segment. Multiple segments simultaneously perform the process of selecting the preliminary and final segment roots. Further, this process may occur for each search query received at the subject root. At step 1014, for each segment in the subject root, a preliminary segment root is identified from the set of nodes. Statistics are requested, at step 1016, from each node in the set of nodes that will be used to execute the received search query, e.g., by the preliminary segment root. The statistics include any data indicating the ability or availability of a node to be selected as the final segment root. The overall goal is to transfer as little data as possible over the network, e.g., from the node to the final segment root. The most cost effective node is selected as the final segment root. In some instances, the overall goal may be adjusted more toward load, but in other instances, the preliminary segment root may be more favorable to network capacity.

At step 1018, statistics are received from each node in the set of nodes at the preliminary segment root. As described, the statistics indicate the availability of each node to act as a final segment root that collects query execution data from the set of nodes in its respective segment. In one embodiment, a communication bus is used to communicate statistics and other data from the nodes to the preliminary or final segment root. A final row segment root is selected for each segment based on the statistics at step 1020. At step 1022, a search query is executed. In some embodiments, prior to executing the search query, an end message is transmitted to the plurality of nodes, or at least the identified set of nodes indicating the final segment root, so that the nodes know where to send their respective data including the query execution data. The final segment root receives query execution data from each node in the set of nodes after executing the query.

The present invention has been described in relation to particular embodiments, which are intended in all respects to be illustrative rather than restrictive. Alternative embodiments will become apparent to those skilled in the art to which the present invention pertains without departing from its scope.

From the foregoing, it will be seen that this invention is one well adapted to attain all the ends and objects set forth above, together with the advantages which are obvious and inherent to the system and method. It will be understood that certain features or sub-combinations are of utility and may be employed without reference to other features and sub-combinations. This is contemplated by and is within the scope of the claims.

Claims

1. A method (800) for assigning a segment root, the method comprising:

receiving (810) a search query;

identifying (812) a set of nodes in a segment to be used to resolve a search query, the segment assigned a set of documents, the set of documents indexed by atom with a reverse index and indexed by document with a forward index, wherein respective portions of the reverse index and forward index are assigned to each node in the set of nodes, and wherein the reverse index is used for a preliminary ranking process and the forward index is used for a final ranking process;

selecting (814) a preliminary segment root from the set of nodes;

receiving (816) statistical data from each node in the identified set of nodes at the preliminary segment root, the statistical data indicating a capability of each node to act as a final segment root responsible for compiling query execution results from the set of nodes based on the search query;

algorithmically selecting (818) a final segment root from the set of nodes based on the statistical data; and

the final segment root is notified (820) to the set of nodes so that the nodes know where to send their respective query execution results.

2. The method of claim 1, wherein each node in the set of nodes has stored thereon a portion of a search index used to execute the search query.

3. The method of claim 1, wherein the preliminary segment root is selected based on expected loads such that a node with a lowest expected load is selected as the preliminary segment root.

4. The method of claim 1, wherein the preliminary segment root is selected based on current load such that a node with the lowest current load of outstanding queries is selected as the preliminary segment root.

5. The method of claim 1, wherein the preliminary segment root and the final segment root are selected for each segment in response to a received search query.

6. The method of claim 1, wherein the statistical data includes one or more of a length of a record list for each node, an input/output load, a problem signal associated with a particular node, or an amount of data to be required to be transmitted to a final segment root.

7. A method (900) for assigning a segment root, the method comprising:

receiving (910), at a segment comprising a plurality of nodes, a search query to be executed;

identifying (912), from the plurality of nodes in the segment, a set of nodes to be used to perform a search query, the segment assigned a set of documents, the set of documents indexed by atom with a reverse index and indexed by document with a forward index, wherein respective portions of the reverse index and forward index are assigned to each node in the set of nodes, and wherein the reverse index is used for a preliminary ranking process and the forward index is used for a final ranking process;

selecting (914) a preliminary segment root from the plurality of nodes prior to executing the search query, the selecting based on one or more of an expected load of each node or a random selection;

receiving (916) statistical data at a preliminary segment root from each node in the set of nodes to be used to perform the search query, wherein the statistical data includes a current load and cost data associated with transmitting data across the network;

selecting (918), based on the statistical data, a final segment root from which query execution data is to be aggregated from the set of nodes during query execution; and

a search query is executed (920).

8. The method of claim 7, wherein each node of the plurality of nodes has stored thereon a portion of a search index that is used to identify relevant documents based on a search query.

9. The method of claim 7, wherein each of the plurality of nodes has stored thereon a portion of a reverse index indexed by atom and a portion of a forward index indexed by document.

10. A method (1000) for assigning a segment root, the method comprising:

receiving (1010) a search query at a subject root comprising a plurality of segments, wherein each of the plurality of segments comprises a plurality of nodes each having a portion of a search index stored thereon, the search index comprising a set of documents indexed by atom with a reverse index and indexed by document with a forward index, and wherein each of the plurality of nodes has stored thereon a respective portion of the reverse index and forward index assigned to each node, and wherein the reverse index is used for a preliminary ranking process and the forward index is used for a final ranking process;

identifying (1012) a set of nodes in each segment to be used to execute the received search query;

identifying (1014), for each of the plurality of segments, a preliminary segment root from the set of nodes;

requesting (1016) statistical data from each node in the set of nodes to be used to execute the received search query;

receiving (1018) statistical data from each node in the set of nodes, the statistical data indicating an availability of each node to act as a final segment root that collects query execution data from the set of nodes in its respective segment;

selecting (1020) a final segment root for each segment based on the statistical data; and

the search query is executed (1022).

11. An apparatus for assigning a segment root, the apparatus comprising:

means for receiving (810) a search query;

means for identifying (812) a set of nodes in a segment to be used to resolve a search query, the segment assigned a set of documents, the set of documents indexed by atom with a reverse index and indexed by document with a forward index, wherein respective portions of the reverse index and forward index are assigned to each node in the set of nodes, and wherein the reverse index is used for a preliminary ranking process and the forward index is used for a final ranking process;

means for selecting (814) a preliminary segment root from the set of nodes;

means for receiving (816), at the preliminary segment root, statistical data from each node in the identified set of nodes, the statistical data indicating a capability of each node to act as a final segment root responsible for compiling query execution results from the set of nodes based on the search query;

means for algorithmically selecting (818) a final segment root from the set of nodes based on the statistical data; and

means for notifying (820) the set of nodes of the final segment root so that the nodes know where to send their respective query execution results.

12. An apparatus for assigning a segment root, the apparatus comprising:

means for receiving (910), at a segment comprising a plurality of nodes, a search query to be executed;

means for identifying (912) a set of nodes to be used to perform a search query from the plurality of nodes in the segment, the segment assigned a set of documents indexed by atom in a reverse index and indexed by document in a forward index, wherein a respective portion of the reverse index and forward index are assigned to each node in the set of nodes, and wherein the reverse index is used for a preliminary ranking process and the forward index is used for a final ranking process;

means for selecting (914) a preliminary segment root from the plurality of nodes prior to executing the search query, the selecting based on one or more of an expected load of each node or a random selection;

means for receiving (916), at a preliminary segment root, statistical data from each node in the set of nodes to be used to perform the search query, wherein the statistical data includes a current load and cost data associated with sending data across the network;

means for selecting (918), based on the statistical data, a final segment root that will aggregate query execution data from the set of nodes during query execution; and

means for performing (920) a search query.

13. An apparatus for assigning a segment root, the apparatus comprising:

means for receiving (1010) a search query at a subject root comprising a plurality of segments, wherein each of the plurality of segments comprises a plurality of nodes each having a portion of a search index stored thereon, the search index comprising a set of documents indexed by atom with a reverse index and indexed by document with a forward index, and wherein each of the plurality of nodes has stored thereon a respective portion of the reverse index and forward index assigned to each node, and wherein the reverse index is used for a preliminary ranking process and the forward index is used for a final ranking process;

means for identifying (1012) a set of nodes in each segment to be used to execute the received search query;

means for identifying (1014), for each of the plurality of segments, a preliminary segment root from the set of nodes;

means for requesting (1016) statistics from each node in the set of nodes to be used to execute the received search query;

means for receiving (1018) statistical data from each node in the set of nodes, the statistical data indicating an availability of each node to act as a final segment root that collects query execution data from the set of nodes in its respective segment;

means for selecting (1020) a final segment root for each segment based on the statistical data; and

means for executing (1022) the search query.