HK1174111B - Filtering queried data on data stores - Google Patents
Filtering queried data on data stores Download PDFInfo
- Publication number
- HK1174111B HK1174111B HK13101082.2A HK13101082A HK1174111B HK 1174111 B HK1174111 B HK 1174111B HK 13101082 A HK13101082 A HK 13101082A HK 1174111 B HK1174111 B HK 1174111B
- Authority
- HK
- Hong Kong
- Prior art keywords
- data
- subset
- filtered
- store
- request
- Prior art date
Links
Description
Technical Field
The invention relates to filtering query data on a data store.
Background
Within the field of computing, many scenarios involve queries to be applied to data sets stored by one or more data stores. For example, a user or data-driven process may request a particular subset of data by requesting a query specified in a query language, such as Structured Query Language (SQL), from the data store. The data store may receive the query, process the query using a query processing engine (e.g., a software pipeline includes components that perform various parsing operations on the query, such as associating names in the query with named objects of a database and identifying operations specified by various operators), apply the operations specified by the parsed query to the stored data, and return query results that have been specified by the query. The query results may include a set of records specified by the query, a set of attributes for the records, and results computed from the data (e.g., a count of records matching particular query criteria). The results may also include reports of actions taken with respect to the stored data, such as creating or modifying a table, or inserting, updating, or deleting records in a table.
In many such cases, the database may be distributed over several, and possibly a large number of, data stores. For example, in a distributed database, different portions of the stored data may be stored in one or more data stores in a server farm. When a query is received to be applied to a data set, the machine receiving the query may identify which data stores may contain the data targeted by the query, and may send the query to one or more of those data stores. Each such data store may apply the query to the data stored therein, and may send the query results back. If the query is applied by two or more data stores, the query results may be combined to generate an aggregated query result. In some cases, one machine may coordinate the process of distributing the query to the involved data stores and aggregating the query results. Techniques such as the MapReduce framework have been designed to enable such distribution and aggregation in an efficient manner.
The data engines used by such data stores can be quite complex, and can apply many complex computational processes to such data stores, such as database transactions, logging, executing stored processes, and accepting and executing agents. The query language itself may increase the complexity of the query to be processed by the data store, including nesting, computationally intensive similarity comparisons of strings and other data types, and modifications to the structure of the database. In addition, the logical processes applied by the query processing engine of the data store can answer complex queries in an efficient manner, and can even improve the query by using techniques such as query optimization. As a result of these and other processes, the evaluation of queries by the data store may consume a significant amount of computing resources.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key factors or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
While it may be advantageous to equip a data store with a complex query processing engine that may handle complex transactions, some disadvantages may also arise. In particular, configuring the data store to perform complex queries on locally stored data may be disadvantageous or inefficient. For example, data stores happen to store particularly high demand data, but query processing engines can be burdened by the application of complex queries (tax) applied to the stored data, while other queries (some of which may be simple) remain pending. Thus, complex queries may create bottlenecks that reduce the capacity and throughput of query evaluation.
As a second example, a distributed database architecture in which the data store also performs complex queries may compromise some security principles in that the machine that is storing the data is also permitted to perform potentially dangerous or malicious operations on the data. Additionally, the query processing engine may even permit execution of arbitrary code on the stored data (e.g., a proxy scenario where an executable module is received from a third party and executed against the stored data). The security principle of separating the storage of data (on a first set of machines) and the execution of complex computations (including arbitrary code) on the data (distributed to a second set of machines) may present several security advantages, such as data item partitioning between stored data and compromised machines.
These and other advantages may result from the removal of complex processing of data from a data store (e.g., a machine of a server farm configured to store data of a distributed database). However, it may also be disadvantageous to configure a data store without processing capability, for example, as a data store that is purely used as a data storage device, which is only capable of providing requested data objects (e.g., entire tables)), or making specified changes thereto. For example, another machine may request only a subset of data from the data store, such as a subset of records from a table that satisfy particular filter criteria. However, if the request specifies only a small number of records in a table containing many records, sending the entire table may be overly inefficient, particularly given the bandwidth constraints between the machine and the data store in a networked environment.
Presented herein are techniques for configuring a data store to fulfill requests for data stored therein. According to these techniques, the data store does not utilize a query processing engine, which may incur significant computational costs, reduce performance in fulfilling requests, and/or permit execution of arbitrary code on the stored data. However, the data store can also provide only a subset of the data stored therein. The data store achieves this result by accepting requests that specify one or more filter criteria that each reduce the requested amount of data in a particular manner. For example, the request may include a filter criterion specifying a particular filter criterion value, and may request a record having only that filter criterion value for the particular filter criterion (e.g., in a data store configured to store data representing events, the filter criterion may identify the type of event or the time at which the event occurred). Thus, the request specifies only various filter criteria, and the data store is capable of providing data that satisfies the filter criteria, but is not configured to process queries that may specify complex operations. Thus, the configuration may facilitate partitioning a distributed database into a set of data nodes configured to store and provide data and a set of compute nodes capable of applying complex queries (including arbitrary code).
To the accomplishment of the foregoing and related ends, the following description and annexed drawings set forth certain illustrative aspects and implementations. These aspects and implementations are indicative of but a few of the various ways in which one or more aspects may be employed. Other aspects, advantages, and novel features of the invention will become apparent from the following detailed description when considered in conjunction with the drawings.
Drawings
FIG. 1 is a diagram representing an exemplary scenario in which a query is applied to a data set distributed over several data stores.
FIG. 2 is an illustration of an exemplary scenario featuring an application of a request for data from a data set stored by a data store.
FIG. 3 is an illustration of an exemplary scenario featuring a request to apply data from a data set stored by a data store featuring at least one filter criterion in accordance with the techniques presented herein.
FIG. 4 is a flow diagram illustrating an exemplary method of fulfilling a request targeting one of the data sets.
FIG. 5 is a flow diagram illustrating an exemplary method of fulfilling a request targeting one of the data sets.
FIG. 6 is an illustration of an exemplary computer-readable medium comprising processor-executable instructions configured to embody one or more of the provisions set forth herein.
FIG. 7 is an illustration of an exemplary scenario featuring data items stored by an index dataset.
Fig. 8 is an illustration of an exemplary scenario featuring data items stored in a partitioned data set.
FIG. 9 is an illustration of an exemplary scenario featuring a data item processor set comprising data item processors configured to filter data items in response to a request featuring at least one filter criterion.
FIG. 10 illustrates an exemplary computing environment wherein one or more of the provisions set forth herein may be implemented.
Detailed Description
The claimed subject matter is now described with reference to the drawings, wherein like reference numerals are used to refer to like elements throughout. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the claimed subject matter. It may be evident, however, that the claimed subject matter may be practiced without these specific details. In other instances, various structures and devices are shown in block diagram form in order to facilitate describing the claimed subject matter.
In the field of computing, many scenarios involve data sets, such as databases stored by data stores. The data store may include a computer equipped with a storage component (e.g., memory circuit, hard drive, solid state storage device, or magnetic or optical storage disk) on which the data set is stored, and may be configured to execute software that satisfies requests to access data that may be received from various users and/or processors. In many such cases, the stored data may be large, may scale to millions or billions of records stored in one table and/or a large number of tables, and/or may be complex, such as a large number of interrelationships between records and tables, and complex constraints that act as constraints on the types of data that may be stored therein.
In some such cases, the data set may be stored on multiple data stores. As a first example, two or more data stores may store identical copies of a data set. This configuration may be advantageous to facilitate availability (e.g., one data store may respond to a request for data while another data store is occupied or offline). As a second example, the data sets may be distributed over the data stores such that each data store stores a portion of the data set. This configuration may be advantageous for improving efficiency (e.g., the distribution of computational burden to satisfy a request for a particular set of data (such as a particular record) may be limited to the data store that is storing the requested data). In many such examples, dozens or hundreds of data stores may be provided, such as in a server farm that includes a very large number of data stores that store and provide access to very large data sets together.
FIG. 1 presents an exemplary scenario 10 featuring a first architecture for applying a query 14 submitted by a user 12 to a data set 20, including a set of data tables 22 each storing a set of records 26 having particular attributes 24. In this exemplary scenario 10, the data set 20 has been distributed across many data stores 18 in various ways. As a first example, the data sets 20 may be distributed vertically; for example, the data set 20 may include several data tables 22 storing different types of records 26, and the first data store 18 may store the records 26 of the first data table 22, while the second data store 18 may store the records 26 of the second data table 22. As a second example, the data sets 20 may be distributed horizontally; for example, for a particular data table 22, a first data store 18 may store a first set of records 26, while a second data store 18 may store a second set of records 26. The distribution may be arbitrary or may be based on specific attributes 24 of the data table 22 (e.g., for an attribute 24 specifying a string of letters, the first data store 18 may store records 26 beginning with the letters "a" through "L," while the second data store 18 may store records 26 beginning with the letters "M" through "Z"). Other ways of distributing data table 22 and data records 26 may also be devised; for example, for a particular data table 22, a first data store 18 may store a first set of attributes 24 for a record 26, while a second data store 18 may store a second set of attributes 24 for the record 26, or both data stores 18 may redundantly store the same record 26 to facilitate availability of the record 26 and rapid evaluation of queries involving the record 26.
In many such scenarios, a user or process may submit a query to be applied to the data set 20. For example, a Structured Query Language (SQL) query may include one or more operations to be applied to the data set 20, such as selecting records 26 from one or more data tables 22 having particular values for particular attributes 24, projecting particular attributes 24 of such records 26, combining attributes 24 of different records 26 to create a composite record 26, and applying various other operations to the selected data (e.g., sorting, grouping, or counting records) before presenting the query results. The query may also specify various changes to the data set 20, such as inserting a new record 26, setting various attributes 24 of one or more records 26, deleting a record 26, establishing or terminating relationships between semantically related records 26, and changing the layout of the data set 20, such as by inserting, modifying, or deleting one or more data tables 22. These operations may also be linked together into a collection, sequence, or conditional hierarchy of such operations. Variables of the structured query language also support more complex operations, such as complex data searches (e.g., support for identifying records that match regular expressions), log records (e.g., recording the application of operations that can later be reversed), and transactions (e.g., two or more operations, any of which are operations that are successfully performed, or not all of which are applied). Other variables of the structured query language may also support execution of code on the data store; for example, the query may specify or invoke a stored procedure performed on the stored data by the data store, or may include an agent, such as an interpretable script or executable binary that is provided to the data store for local execution. To evaluate and fulfill such queries, the data store 18 may include a query processing engine, such as a software pipeline that includes components to perform various parsing operations on the query, such as associating names in the query with named objects of the database and identifying operations specified by various operators. The data store 18 can enable evaluation and fulfillment of a query by lexically parsing the language of the query (e.g., identifying individual components of the query according to syntactic rules of the query language), identifying operations specified by each component of the query and the logical structure and sequence of the operations, and invoking components capable of fulfilling the operations.
In these and other situations, the task of applying complex queries to data sets distributed across many data stores can present many implementation challenges. Many techniques and architectural frameworks have been proposed to implement such applications in an efficient and automated manner.
The exemplary scenario 10 of FIG. 1 also presents one technique that is commonly used to apply a query 14 to a data set 20 that is distributed across many data stores 18. In this exemplary scenario 10, a user 12 may submit a query 14 that includes a set of operations 16 that may be applied against a data set 20. Further, the operations 16 may be linked together in a logical sequence, e.g., using Boolean operators to specify that the results of a particular operation 16 are to be used together. The query 14 may be delivered to a MapReduce server 28, comprising a computer configured to apply "MapReduce" techniques to distribute the query 14 across the data stores 18 that are storing various portions of the data set 20. For example, the MapReduce server 28 may identify that various operations 16 within the query 14 target various portions of the data set 20 that are respectively stored by a particular data store 18. For example, the first operation 16 may target data stored by the first data store 18 (e.g., a Select operation applied to a table 22 and/or set of records 26 stored by the first data store 18), while the second operation 16 may target data stored by the second data store 18. Thus, the MapReduce server may decompose the query 14 into various query portions 30, each query portion including one or more operations to be performed by a particular data store 18. The data store 18 may receive the query portion 30, apply the operations 16 specified therein, and generate query results 32 that may be delivered to the MapReduce server 28 (or another data store 18 for further processing). The MapReduce server 28 may then compose the query results 32 provided by the data store 18 to generate query results 34 that may be provided to the user 12 in response to the query 14. In this manner, the data store 18 and the MapReduce server 28 may interoperate to enable fulfillment of the query 14.
The exemplary scenario 10 of fig. 1 may present some advantages (e.g., automatically splitting the query 14 into (parcel) multiple data stores 18, which may enable concurrent evaluation of various query portions 30, which may speed up evaluation of the query 14). However, the exemplary scenario 10 may also present some drawbacks. In particular, it may be desirable to design an architecture for a distributed data set, such as a distributed database, in which storage and access to data is performed on a first set of devices, while complex computing processes are performed on a second set of devices. Such partitioning may be advantageous, for example, to improve the security of the data set 20. For example, the queries 14 to be applied to the data set 20 may be computationally expensive (e.g., involving a large amount of memory), contradictory (e.g., recursive queries that are not finished or can not be evaluated logically), or malicious (e.g., overly or privately involving unauthorized disclosure and modification of the data set 20). In some cases, the computation may involve executing code (such as a query 14 that invokes a stored procedure that has been implemented on the data store 18), or a mobile proxy scenario, where a third party may provide a "proxy" (e.g., an interpretable script, or a partially or fully compiled executable file) that may be applied to the data set 20. Thus, the security of the data set 20 may be improved by limiting complex computations to a particular set of computers that may be carefully monitored and may be suspended, taken offline, or replaced if such computers appear to be operating in a manner that may damage the data set 20. However, the exemplary scenario 410 of fig. 1 does not involve such partitioning. In contrast, the data store 18 storing portions of the data set 20 also performs the query portion 30 on such data, and thus cannot separate access to the data set 20 from the computations performed thereon.
A second drawback that may arise in the exemplary scenario 10 of fig. 1 relates to the performance of the data set 20. For example, a particular data store 18 may be configured to store query portions 30 that are frequently accessed on a temporary or long-term basis, such that the data store 18 receives and processes many queries 14 that relate to portions of the data set 20 in a short period of time. However, if the data store 18 is also configured to perform complex computational processing of stored data, queries 14 involving complex operations may consume computational resources (e.g., memory capacity, and bandwidth) of the data store 18 that are not available to fulfill other queries 14. Thus, a single complex query 14 may preemptively evaluate and fulfill other queries 14 that relate to the same data stored by the data store 18. Conversely, if complex computations involving the data are partitioned from storing such data, many computers may be configured to process queries 14 in parallel, and a complex query 14 that occupies the resources of one computer may not affect the evaluation or fulfillment of other queries 14 processed by other computers.
In view of these and other shortcomings resulting from the architecture presented in the exemplary scenario 10 of FIG. 1, it may be desirable to separate the storage and access of data in the data set 20 from complex computational queries that may be applied to such data. However, a rigid partitioning in which the data store 18 provides only low-level access and the compute nodes provide all computations may also be inefficient.
FIG. 2 presents an exemplary scenario 40 in which the data store 18 is configured to store a data set 20 that includes a large number of records 26 (e.g., 50,000 records). The user 12 may submit a query 14, and the query 14 may be received and fully evaluated by the compute node 42. The compute node 42 may include, for example, a query processing engine that may lexically parse the query 14, identify operations 16 specified therein, and invoke various components to perform such operations 16 including retrieving data from the data store 18. For example, instead of sending the query 14 or query portion 30 to the data store 18, the computing node 42 may simply send a request 44 for a particular set of records 26, such as the records 26 of the data table 22 that comprise the data set 20. The data store 18 may respond with a request result 48 that includes the requested record 26, and the compute node 42 may apply some complex computation (e.g., the operation 16 specified in the query 14) to the request result 48 and may return the query result 34 to the user 12. However, this exemplary scenario 40 illustrates the inefficiency of this rigid division of responsibility between the compute nodes 42 and the data store 18. For example, the query 14 may request that a single record 26 be retrieved (e.g., a record 26 for an employee associated with a particular identifier), but the data table 22 stored by the data store 18 may include many such records 26. Thus, the data store 18 may provide the request results 48 including 50,000 records 26 to the compute node 42 even though only one such record 26 is included in the query results 34. Moreover, the records 26 may be readily identified from within the scope of the query 14 (e.g., if the query 14 identifies the requested record 26 according to an index field having a unique identifier for the corresponding record 26), but the data store 18 does not perform this relatively simple filtering because the data store 18 cannot perform the computations involved in evaluating the query 14. This inefficiency may become particularly apparent, for example, if the request results 48 are sent to the computing nodes 42 over a network 46 that may have limited capacity. Sending many records 26 over the network 46 may impose a rate-limiting factor on completing queries 14, thereby imposing a significant delay in fulfilling relatively simple queries 14 involving small query results 34. These and other drawbacks may result from a hard partitioning of responsibilities between the data store 18 and the compute nodes 42 comprising the data set 20.
Presented herein are techniques for configuring a data set 20 to evaluate a query 14. These techniques may be designed, for example, in view of the advantages and disadvantages of the exemplary scenario 10 of fig. 1 and the exemplary scenario 40 of fig. 2. In accordance with these techniques, the data store 18 may be configured to store one or more data items of the data set 20 (e.g., various tables 22, attributes 24, and/or records 26 of the data set 20) and participate in the evaluation of the query 14 for such data items. In contrast to the exemplary scenario 10 of FIG. 1, the data store 18 is not configured to evaluate the query 14; for example, the data store 18 may not include a query processing engine and may refuse to accept or evaluate queries 14 that are customized in a query language, such as a Structured Query Language (SQL) query. In contrast, the data store 18 is not limited to providing one or more portions of the data store 20 in response to the request 44, which may result in inefficiencies resulting from rigid partitioning, such as illustrated in the exemplary scenario 40 of FIG. 2. Rather, in accordance with these techniques, the data store 18 is configured to accept requests 44 that include one or more filter criteria that define the filtered subset of data. For example, the data store 18 may store one or more data tables 22 that include individual records 26, but may index a small number of attributes 24 of the records 26. Filtering may involve identifying, retrieving, and providing a data subset of the data set 20, including records 26 having a particular value of one of the indexed attributes 24. Since applying the filtering criteria to the data set 20 may result in a significant reduction in the data to be sent in the filtered data subset 58 while consuming a small portion of the computing resources involved in evaluating the query 14, the data store 18 may be configured to perform the filtering in response to the request 44. However, the data store 18 may be configured to refrain from performing more complex computational processes; for example, the data store 18 may omit the query processing engine altogether, may deny acceptance of a query 14 specified in the query language, or may deny a request 44 specifying an attribute 26 without an index. In this manner, the techniques presented herein may achieve greater efficiency and security than in the exemplary scenario 10 of fig. 1, while also avoiding the disadvantages presented in the exemplary scenario 40 of fig. 2.
FIG. 3 presents an illustration of an exemplary scenario 50 featuring the application of the techniques presented herein to apply a query 14 submitted by a user 12 to a data set 20 storing various data items 52 to generate and provide query results 34. In this exemplary scenario 50, access to the data set 20 may be enabled through the data store 18, and the data set 20 may in turn be accessible through the compute node 42. However, if the user 12 or the compute node 42 submits the query 14 to the data store 18, the data store 18 may refuse to accept the query 14 or may not be able to evaluate the query 14. (alternatively, the data store 18 may only accept and evaluate the query 14 in certain circumstances, such as where the query 14 is submitted by an administrator.) instead, the user 12 (or an automated process) may submit the query 14 to the compute node 42, and the compute node 42 may endeavor to interact with the data store 18 to evaluate the query and provide the query results 34. In particular, the computing node 42 may examine the query 14 to identify a request 44 that includes one or more filter criteria 54 that may specify retrieval of a particular data item 52 from the data store 18. (e.g., identifying one or more operations 16 of the query 14, which one or more operations 16 may be expressed as a request 44 for data items 52 that satisfy one or more filter criteria 54.) the data store 18 is configured to receive the data items 52 and store the received data items 52 in a storage component (e.g., a memory circuit, a hard disk drive, a solid state storage device, or a magnetic or optical disk) as part of the data set 20. Additionally, the data store 18 is configured to receive a request 44 that includes one or more filter criteria 54. Upon receiving the request 44, the data store 18 may perform filtering 56 to identify data items 52 that satisfy the filtering criteria 54, and generate a filtered subset of data 58 to be returned to the compute node 42. The compute node 42 may receive the filtered data subset 58 and may apply the remainder of the query 14 (e.g., performing the complex computations specified by the operations 16 of the query 14 that are not expressed in the request 44). In some such cases, the compute node 42 may send a second or further request 44 to the data set 20 specifying other filter criteria 54, and may use the second or other filtered subset of data 58 in the computation. Finally, the computing node 42 may generate the query results 34, and the query results 34 may be presented to the user 12 in response to the query 14 (or an automated process). In this manner, configuring the data store 18 and, optionally, the data node 42 may enable fulfillment of the query 14 in a more efficient and secure manner than presented in the exemplary scenario 10 of FIG. 1 and/or the exemplary scenario 40 of FIG. 2.
FIG. 4 presents a first embodiment of these techniques, as illustrated by an exemplary method 60 of fulfilling requests 44 targeting a data set 20. The exemplary method 60 may be performed by, for example, a data store 18 configured to store or access some or all of the data set 20. Additionally, the exemplary method 60 may be implemented, for example, as a set of software instructions stored in a memory component (e.g., a system memory circuit, a disk of a hard disk drive, a solid state storage device, or a magnetic or optical disk) of the data store 18 that, when executed by a processor of the data store 18, cause the processor to perform the techniques presented herein. The exemplary method 60 begins at 62 and involves executing instructions on a processor at 64. More specifically, the instructions are configured to store 66 the data item 52 in the data set 20 after receiving the data item 52. The instructions are further configured to, after receiving 68 a request 44 specifying at least one filter criterion 54, retrieve 70 data items 52 of the data set 20 that satisfy the at least one filter criterion to generate a filtered data subset 58, and send 72 the filtered data subset 58 in response to the request 44. In this manner, the example method 60 enables fulfillment of the request 44 to access the data set 20 without exposing the data store 18 to security risks, inefficiencies, and consumption of the computing resources involved in evaluating the query 14, and thus ends at 74.
FIG. 5 presents a second embodiment of these techniques, as illustrated by an exemplary method 80 of applying a query 14 to a data set 20 stored by a data store 18. The exemplary method 80 may be performed on, for example, a device having a processor, such as the compute node 42. Additionally, the exemplary method 80 may be implemented as a set of software instructions stored, for example, in a memory component (e.g., a system memory circuit, a disk of a hard drive, a solid state storage device, or a magnetic or optical disk) of the computing node 42 or other device that, when executed by a processor, causes the processor to perform the techniques presented herein. The exemplary method 80 begins at 82 and involves executing 84 instructions on a processor. More specifically, the instructions are configured to generate 86 a request 44 from the query 14 specifying at least one filter criterion 54. The instructions are also configured to send 88 the request 44 to the data store 18, and apply 90 the query 14 to the filtered data subset 56 after receiving the filtered data subset 58 from the data store 18 in response to the request 44. In this manner, the example method 80 enables fulfillment of the query 14 of the data set 20 without exposing the data store 18 to security risks, inefficiencies, and consumption of computing resources involved in evaluating the query 14, and thus ends at 92.
Yet another embodiment relates to a computer-readable medium comprising processor-executable instructions configured to apply the techniques presented herein. Such computer-readable media may include, for example, computer-readable storage media including tangible devices, such as memory semiconductors (e.g., semiconductors using Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), and/or Synchronous Dynamic Random Access Memory (SDRAM) technology), disks of a hard drive, flash memory devices, or magnetic or optical disks (such as CD-R, DVD-R, or floppy disks), encoded with a set of computer-readable instructions that, when executed by a processor of the device, cause the device to implement the techniques presented herein. Such computer-readable media may also include (as a class of technologies that differ from computer-readable storage media) various types of communication media, such as signals that may propagate through various physical phenomena (e.g., electromagnetic, acoustic, or optical signals) as well as in various wired scenarios (e.g., via ethernet or fiber optic cables) and/or wireless scenarios (e.g., a Wireless Local Area Network (WLAN) such as WiFi, a Personal Area Network (PAN) such as bluetooth, or a cellular or radio network), and which are encoded with a set of computer-readable instructions that, when executed by a processor of a device, cause the device to implement the techniques presented herein.
An exemplary computer-readable medium that may be devised in these ways is illustrated in FIG. 6, wherein the implementation 100 comprises a computer-readable medium 102 (e.g., a CD-R, DVD-R, or a platter of a hard disk drive), on which is encoded computer-readable data 104. The computer-readable data 104, in turn, includes a set of computer instructions 106 configured to operate according to the principles set forth herein. In one such embodiment, the processor-executable instructions 106 may be configured to perform a method of fulfilling requests targeting a data set of the data set, such as the exemplary method 60 of fig. 4. In another such embodiment, the processor-executable instructions 106 may be configured to perform a method of applying a query to a data set stored by a data store, such as the exemplary method 80 of FIG. 5. Some embodiments of the computer-readable medium may include a non-transitory computer storage medium (e.g., a hard disk drive, an optical disk, or a flash memory device) configured to store processor-executable instructions configured in this manner. Many such computer-readable media may be devised by those of ordinary skill in the art that are configured to operate in accordance with the techniques presented herein.
The techniques discussed herein may be designed with variations in many aspects, and some variations may present additional advantages and/or reduce disadvantages relative to other variations of these and other techniques. Moreover, some variations may be implemented in combination, and some combinations may characterize additional advantages and/or reduced disadvantages through cooperation. Various variations may be incorporated into various embodiments (e.g., exemplary method 60 of fig. 4 and exemplary method 80 of fig. 5) to confer individual and/or synergistic advantages upon such embodiments.
A first aspect that may vary among embodiments of these techniques relates to the situation in which such techniques may be used. As a first variation, many types of data stores 18 (and/or compute nodes 42) may be used to apply queries 14 and requests 44 to data sets 20. As one such example, the data store 18 and/or compute nodes 42 may comprise different hardware devices (e.g., different machines or computers), different circuitry operating within a particular hardware device (e.g., a Field Programmable Gate Array (FPGA)), or software processes (e.g., separate threads) executing within one or more computing environments on one or more processors of a particular hardware device. The data store 18 and/or compute nodes 42 may also include virtual processes, such as distributed processes that may be incrementally executed on individual devices of a device collection. Additionally, the respective data store 18 may store internally data items 52 comprising the data set 20, or may access other data stores 18 that store internally data items 52 (e.g., a data access layer or device that interfaces with a data storage layer or device). As a second variation, many types of data sets 20 may be accessed using the techniques presented herein, such as a database, a file system, a media library, an email mailbox, a set of objects in an object system, or a combination of such data sets 20. Similarly, many types of data items 52 may be stored in the data set 20. As a third variation, queries 14 and/or requests 44 evaluated using the techniques presented herein may be specified in a number of ways. For example, the queries 14 may be specified in terms of Structured Query Language (SQL) variables, as language-integrated queries (e.g., LINQ queries), or as interpretable scripts or executable objects configured to perform various manipulations of the data items 52 within the data set 20. The request 44 may also be specified in various ways, such as simply specifying the indexed attributes 24 to be included in the filtered data subset 58 and one or more values of such attributes 24. Although the request 44 is limited to one or more filtering criteria 54 that specify data items 52 to be included in the filtered data subset 58, the language, syntax, and/or protocol by which the query 14 and request 44 are formatted does not significantly affect the application or implementation of the techniques presented herein.
A second aspect that may vary among embodiments of these techniques relates to the data store 18 storing data items 52 in data sets 20. As a first variation, the data store 18 may include at least one index that may correspond to one or more filtering criteria 54 (e.g., a particular attribute 24 such that a record 26 containing one or more values of the attribute 24 is to be included in the filtered data subset 58). The data store 18 may be configured to, upon receiving a data item 52, index the data items in the index according to the filter criteria 54 (e.g., according to the values of one or more attributes 24 of the data item 52 that may be targeted by the filter criteria 54). The data store 18 can then fulfill the request 44 by using the index corresponding to the filter criteria 54 to identify data items 52 that satisfy the filter criteria 54 of the request 44. It may be advantageous to select attributes 24 of data items 52 that may be targeted by the filter criteria 54 of the request 44 for indexing, and to refrain from indexing other attributes 24 of the data items 52 (e.g., indexes must be maintained when the data items 52 change, and it may be disadvantageous to undertake the computational burden of such maintenance to index attributes 24 that may not be included as filter criteria 54). For example, in a database configured to track events performed by various users at various times, it may be desirable to configure the data store 18 to generate and maintain an index for an index set that includes: specifying an event index represented by each data item 52; a time index specifying the time of the event represented by each data item 52; and a user index specifying at least one user associated with the event represented by the respective data item 52. However, it may be undesirable to generate and maintain an index of other attributes 24 for the data set 20, such as a Uniform Resource Identifier (URI) of the digital resource involved in the request, a comment field over which textual comments about a particular event may be entered by various users and administrators, or a "bubble" (blob) field relating to a large data set involved in an event (e.g., a system log or captured image depicting the event).
As another variation of this second aspect, the index may identify data items 52 associated with one or more particular filter criterion values for a particular filter criterion 54 in various ways. As one such example, for a filter criterion value of a filter criterion 54 corresponding to an index, the index may specify a set of data items that identify data items having the filter criterion value for the filter criterion 54. For example, for each filter criterion value of the filter criteria 54, the index may store a set of references to the data items 52 associated with the filter criterion value. In addition, the collection of data items stored in the index may be accessed in a variety of ways. For example, the index may permit a set of data items to be incrementally written (e.g., indexing new data items 52 by adding data items 52 to a set of data items having filter criteria values for filter criteria), but may permit only a set of data items to be atomically read (e.g., for a request 44 specifying a particular filter criteria value for a particular filter criteria 54, the index may read and present the entire set of data items, including the entire set of references to such data items 52). As another variation, the data store 18 may store the data items 52 in the data item buffer after receiving the respective data items 52, such that when the data item buffer exceeds a data item buffer size threshold (e.g., a capacity of the data item buffer), the data store 18 may add the data items to the respective data item set and empty the data item buffer.
FIG. 7 presents an illustration of an exemplary scenario 110 featuring data items 52 in one or more data item sets 118 indexed according to index 112. In this exemplary scenario 110, the data store 18 may receive various data items 52 (e.g., a set of reported events), and may store such data items 52 in the data set 20. In particular, the data store 18 may generate an index 112 comprising a set of index entries 114 containing references 116 to one or more data items 52 of one or more data item sets 118, each index entry corresponding to a different filter criterion value (e.g., month and year of date when the event occurred) of the filter criteria 54. Upon receiving a data item 52, the data store 18 may identify one or more filter criterion values for the data item 52 and may store references to the data item 52 stored in the index entry 114 of the index 112 corresponding to the filter criterion value. The data store 18 may then store the data item 52 in the data set 20 (e.g., by appending the data item 52 to the list of records 26). When a user 12 submits a request 44 to the data store 18 (either directly or indirectly by submitting the query 14 to a computing node 42 configured to generate the request 44 from the query 14 specifying one or more filter criteria 54), the data store 18 may fulfill the request 44 by retrieving a set of data items 118 associated with the filter criteria value, and in particular may do so by identifying index entries 114 of the index 112 by identifying data items 52 of the set of data items 118 that correspond to the filter criteria value. The data store 18 may then use the references 116 stored in the index entries 114 to retrieve data items 52 of a set of data items 118, and may send such data items 52 as the filtered data subset 58. In this manner, the data store 18 can fulfill the request 44 in an efficient manner by using the index 112 corresponding to the filter criteria 54 of the request 44, even if the data items 52 are stored together in any manner. For example, for a first filter criterion value of a filter criterion 54, a respective index entry 114 of the index 112 may store a reference to a data item partition corresponding to a respective second filter criterion value of a second filter criterion 54. This two-tier indexing technique may be used to store and/or retrieve data items 52. For example, storing the data item 52 may involve using the index 112 to identify an index entry 114 associated with a first filter criterion value for a first filter criterion 54 of the data item 52, examining a data item partition referenced by the index entry 114 to identify a data item partition associated with a second filter criterion value for a second filter criterion 54 of the data item 52, and storing the data item 52 in the data item partition. Conversely, retrieving a data item 52 having a particular first filter criterion value of the first filter criterion 54 and a particular second filter criterion value of the second filter criterion 54 may involve using the index 112 to identify an index entry 114 associated with the first filter criterion value; examining the data item partitions referenced by the index entries 114 to identify data item partitions associated with the second filter criterion value; and retrieving and sending the data item partition in response to the request 44.
As another variation of this second aspect, the data store 18 may configure the index as a set of partitions, each partition including data items 52 (or references thereto, e.g., memory references or URIs in which the data items 52 may be accessed, or different identifiers of the data items 52 such as key values of key fields of the data table 22) that satisfy particular filter criteria 54. For example, the data store 18 may generate various partitions, such as various fractions of memory allocated to store data items 52 having particular filter criteria values for particular filter criteria 54. After receiving the data item 52, the data store 18 may store the data item 52 in the respective partition; and upon receiving the request 44 specifying the filter criterion value for the particular filter criterion 54, the data store 18 may cause the data item partition to store the data items 52 having the filter criterion value for the filter criterion and send the data item partition as the filtered data subset 58. As yet another variation, two or more indices may be used to group data items according to two or more filter criteria 54.
FIG. 8 presents an illustration of an exemplary scenario 120 featuring the division of a data item 52 into respective data item partitions 122. In this exemplary scenario 120, the data store 18 may receive various data items 52 (e.g., a set of reported events), and may store such data items 52 in the data set 20. The data store 18 may again generate an index 112 (not shown) including a set of index entries 114 containing references 116 to one or more data items 52 of one or more data item sets 118, each index entry corresponding to a different filter criterion value (e.g., month and year of the date when the event occurred) of the filter criteria 54. However, in contrast to the exemplary scenario 110 of FIG. 7, in this exemplary scenario 120, the data items 52 are stored in a manner that is partitioned according to the filter criterion values. Upon receiving the data items 52, the data store 18 may identify one or more filter criterion values for the data items 52, and may identify data item partitions 122 associated with the filter criterion values. The data store 18 may then store the data items 52 in the data item partitions 122 corresponding to the filter criterion values. When a user 12 submits a request 44 to the data store 18 (either directly or indirectly by submitting a query 14 to a computing node 42 configured to generate a request 44 from the query 14 specifying one or more filter criteria 54), the data store 18 may fulfill the request 44 by retrieving a set of data items 118 associated with the filter criteria values, and in particular may do so by identifying a partition of data items 122 corresponding to the filter criteria values. The data store 18 may then retrieve the entire data item partition 122, and may send the entire data item partition 122 to the user 12. The additional data item partitions 122 may be retrieved and sent in response to other filter criteria 54 (e.g., two or more filter criteria values for a particular filter criteria 54, or a filter criteria value specified in the alternative for each of two or more different filter criteria 54). In this manner, the data store 18 can identify and provide data items 52 that satisfy the filter criteria 54 in an efficient manner by using the data item index 122 corresponding to the one or more filter criteria 54 specified in the request 44. Those of ordinary skill in the art may devise many ways of storing the data items 52 of the data set 20 in accordance with the techniques presented herein.
A third aspect that varies among embodiments of these techniques relates to configuring the data store 18 and/or the compute nodes 42 to retrieve data items 52 that satisfy the filter criteria 54 of the request 44. As a first variation, the request 44 may include many types of filtering criteria 54. In particular, the request 44 may specify a first filtered data subset 58 that may be related to the data items 52 that include a second filtered data subset 58, and the data store 18 may use the first filtered data subset 58 in generating the second filtered data subset 58. For example, query 14 may involve request 44 specifying another filtered data subset 58 (e.g., "select a username from users in query 14, with user. ID (user's identifier) in (10, 22, 53, 67)", filtering request 44 according to a set of digital user IDs presented as filtered data subset 58). As another variation, the query 14 may involve a first request 44 specifying a first filtered subset of data 58, the first filtered subset of data 58 referenced in a second request 44 specifying a second filtered subset of data 58. For example, in the query 14, a username is selected from the user, with user.id in (user selected according to event, with event type 12 "), a first filtered data subset 58 is generated from the event data table (using the first request 44, e.g.," SET _1 _ event.type 12 "), and the first filtered data subset 58 is referenced by the second request 44 (e.g., user.id in SET _ 1), resulting in a second filtered data subset 58. in this way, the request 44 can reference the filtered data subset 58 generated by another request 44 (including the earlier requests 44 provided and processed) when evaluating the same query 14.
As a second variation of this third aspect, when presented with a request 44 that includes at least one filter criterion 54, the data store 18 may be configured to retrieve content items 52 from the data set 20 that satisfy the respective filter criterion 54 of the request 44 (e.g., by using the index 112 to identify the data set 118 and/or the data item partition 122, as in the exemplary scenario 110 of FIG. 7 and the exemplary scenario 120 of FIG. 8). Alternatively, instead of utilizing an index, the data store 18 may retrieve all of the data items 52 of the data set 20 and may send only the data items 52 that satisfy at least one filter criterion (e.g., to the computing node 42 or user 12 submitting the request 44 to the data store 18). In the previous example, filtering of data items 52 is implemented during indexing of data items 52 after receipt; in the subsequent example, however, filtering of the data items 52 is implemented during transmission of the data items 52. For example, it may be difficult to filter all data items 52 in real time in order to fulfill the request 44. However, alternatively or in conjunction with using the index 112 and/or the partition 122, some techniques may be used to accelerate the real-time filtering of the data items 52.
FIG. 9 presents an illustration of an exemplary scenario 130 featuring one technique for implementing real-time filtering of data items 52. In this exemplary scenario 130, the data store 18 receives a request 44 from a user 12 specifying at least one filtering criterion 54 and endeavors to fulfill the request 44 by providing a filtered data subset 58 that includes only data items 52 that satisfy the filtering criterion 54 of the request 44. However, in this exemplary scenario 130, the data store 18 retrieves all of the data items 52 from the data set 20, and then applies the data item processor set 132 to the entire data item set 52 to identify and provide only the data items 52 that satisfy the filter criteria 54. The set of data item processors 132 may include, for example, a set of data item processors 134, each having a status 136 and at least one filtering condition (e.g., logic evaluating any particular data item 52 to identify whether the filtering criteria 54 is satisfied). The data item processors 134 may each be configured to update the state 136 of the data item processor 134 after receiving the data item 52; and when the state 136 of the data item processor 134 satisfies at least one filtering condition, the data item processor 134 may authorize sending the data item 52 (e.g., by including the data item 52 in the filtered data subset 58, or by sending the data item 52 to a different data item processor 134 for further processing). Thus, the data item processors 134 may be interconnected and interoperable as a real-time processing system that uses a state machine to evaluate the data items 52. thus, the data store 18 may call the data item processor collection 132 after retrieving the data items 52 from the data set 20, and may only send data items 52 that have been authorized to be sent by the data item processor collection 132. In this manner, the data store 18 may enable ad hoc, real-time evaluation of all data items 52 of the data set 20 to identify and deliver data items 52 that satisfy the filter criteria 54 of the request 44 without the need to generate, maintain, or use the index 112 or partition 122.
As a third variation of this third aspect, the data store 18 may estimate the size of the filtered data subset 58 prior to providing the filtered data subset 58 in response to the request 44 (and optionally prior to retrieving data items 18 that match the filter criteria 54 of the request 44). For example, a request 44 received by the data store 18 may involve a relatively large filtered subset of data 58 that may occupy a large amount of computing resources, being retrieved and sent in response to the request 44. Thus, for a request 44 received from a requestor (e.g., a particular user 12 or an automated process), one embodiment may first estimate a filtered data subset size of the filtered data subset 58 (e.g., a total estimated volume of records 26 or data items 52 included in the filtered data subset 58), and may endeavor to verify that retrieval of the size of the filtered data subset 58 is acceptable to the requestor. Accordingly, one embodiment may be configured to estimate a filtered data subset size of the filtered data subset 58 and send the filtered data subset size to the requestor prior to sending the filtered data subset 58 in response to the request 44, and the retrieving and sending of the filtered data subset 58 may occur only after receiving a filtered data subset authorization from the requestor. Conversely, the computing node 42 may be configured to receive an estimate of a filtered data subset size of the filtered data subset 58 from the data store 18 after sending the request 44 specifying the at least one filtering criterion 54 and before receiving the filtered data subset 58 in response to the request 44, and may verify the filtered data subset size (e.g., by presenting the filtered data subset size to the user 12, or by comparing the filtered data subset size to an acceptable filtered data subset size threshold, defining an acceptable utilization of the computing resources of the data store 18 and/or the network 46). If the estimated filtered data subset size is acceptable, the compute node 42 may generate and send a filtered data subset authorization to the data store 18, and may then receive the filtered data subset 58. One of ordinary skill in the art may devise many ways of configuring the data store 18 and/or the data nodes 42 to retrieve data items 52 from the data set 20 in accordance with the techniques presented herein.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
As used in this application, the terms "component," "module," "system," "interface," and the like are generally intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a controller and the controller can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers.
Furthermore, the claimed subject matter may be implemented as a method, apparatus, or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof to control a computer to implement the disclosed subject matter. The term "article of manufacture" as used herein is intended to encompass a computer program accessible from any computer-readable device, carrier, or media. Of course, those skilled in the art will recognize many modifications may be made to this configuration without departing from the scope or spirit of the claimed subject matter.
FIG. 10 and the following discussion provide a brief, general description of a suitable computing environment to implement embodiments of one or more of the principles set forth herein. The operating environment of FIG. 10 is only one example of a suitable operating environment and is not intended to suggest any limitation as to the scope of use or functionality of the operating environment. Example computing devices include, but are not limited to, personal computers, server computers, hand-held or laptop devices, mobile devices (such as mobile phones, Personal Digital Assistants (PDAs), media players, and the like), multiprocessor systems, consumer electronics, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
Although not required, embodiments are described in the general context of "computer readable instructions" being executed by one or more computing devices. Computer readable instructions may be distributed via computer readable media (discussed below). Computer readable instructions may be implemented as program modules, such as functions, objects, Application Programming Interfaces (APIs), data structures, and so forth, that perform particular tasks or implement particular abstract data types. Typically, the functionality of the computer readable instructions may be combined or distributed as desired in various environments.
FIG. 10 illustrates an example of a system 140 comprising a computing device 142 configured to implement one or more embodiments provided herein. In one configuration, computing device 142 includes at least one processing unit 146 and memory 148. Depending on the exact configuration and type of computing device, memory 148 may be volatile (such as RAM, for example), non-volatile (such as ROM, flash memory, etc., for example) or some combination of the two. This configuration is illustrated in fig. 10 by dashed line 144.
In other embodiments, device 142 may include additional features and/or functionality. For example, device 142 may also include additional storage (e.g., removable and/or non-removable) including, but not limited to, magnetic storage, optical storage, and the like. Such additional storage is illustrated in fig. 10 by storage 150. In one embodiment, computer readable instructions to implement one or more embodiments provided herein may be located in storage 150. Storage 150 may also store other computer readable instructions to implement an operating system, an application program, and the like. Computer readable instructions may be loaded in memory 148 for execution by processing unit 146, for example.
The term "computer-readable media" as used herein includes computer storage media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions or other data. Memory 148 and storage 150 are examples of computer storage media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by device 142. Any such computer storage media may be part of device 142.
The device 142 may also include communication connections 156 that allow the device 142 to communicate with other devices. Communication connection(s) 156 may include, but is not limited to, a modem, a Network Interface Card (NIC), an integrated network interface, a radio frequency transmitter/receiver, an infrared port, a USB connection, or other interfaces for connecting computing device 142 to other computing devices. The communication connection 156 may comprise a wired connection or a wireless connection. Communication connection 156 may transmit and/or receive communication media.
The term "computer readable media" may include communication media. Communication media typically embodies computer readable instructions or other data in a "modulated data signal" such as a carrier wave or other transport mechanism and includes any information delivery media. The term "modulated data signal" may include a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
Devices 142 may include input devices 154 such as a keyboard, mouse, pen, voice input device, touch input device, infrared camera, video input device, and/or any other input device. Output device(s) 152 such as one or more displays, speakers, printers, and/or any other output device may also be included in device 142. The input device 154 and the output device 152 may be connected to the device 142 via a wired connection, a wireless connection, or any combination thereof. In one embodiment, an input device or an output device from another computing device may be used as input device 154 or output device 152 of computing device 142.
Components of computing device 142 may be connected by various interconnects, such as a bus. Such interconnects may include Peripheral Component Interconnect (PCI) such as PCI Express, Universal Serial Bus (USB), firewire (IEEE 1394), optical bus structures, and the like. In another embodiment, components of computing device 142 may be interconnected by a network. For example, memory 148 may be comprised of multiple physical memory units located in different physical locations interconnected by a network.
Those skilled in the art will realize that storage devices utilized to store computer readable instructions may be distributed across a network. For example, a computing device 160 accessible via network 158 may store computer readable instructions to implement one or more embodiments provided herein. Computing device 142 may access computing device 160 and download a part or all of the computer readable instructions for execution. Alternatively, computing device 142 may download pieces of the computer readable instructions, as needed, or some instructions may be executed at computing device 142 and some at computing device 160.
Various operations of embodiments are provided herein. In one embodiment, one or more of the operations described may constitute computer readable instructions stored on one or more computer readable media, which if executed by a computing device, will cause the computing device to perform the operations described. The order in which some or all of the operations are described should not be construed as to imply that these operations are necessarily order dependent. Alternative ordering will be appreciated by those skilled in the art having the benefit of this description. Moreover, it should be understood that not all operations are necessarily present in each embodiment provided herein.
Moreover, the word "exemplary" is used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as "exemplary" is not necessarily to be construed as advantageous over other aspects or designs. Rather, use of the word exemplary is intended to present concepts in a concrete fashion. As used in this application, the term "or" is intended to mean an inclusive "or" rather than an exclusive "or". That is, "X employs a or B" means any of the natural inclusive permutations unless specified otherwise or clear from the context. That is, if X employs A; b is used as X; or X employs both A and B, then "X employs A or B" is satisfied under any of the above circumstances. In addition, the articles "a" and "an" as used in this application and the appended claims may generally be construed to mean "one or more" unless specified otherwise or clear from context to be directed to a singular form.
Also, although the invention has been shown and described with respect to one or more implementations, equivalent alterations and modifications will occur to others skilled in the art based upon a reading and understanding of this specification and the annexed drawings. The present invention includes all such modifications and alterations, and is limited only by the scope of the appended claims. In particular regard to the various functions performed by the above described components (e.g., elements, resources, etc.), the terms used to describe such components are intended to correspond, unless otherwise indicated, to any component which performs the specified function of the described component (e.g., that is functionally equivalent), even though not structurally equivalent to the disclosed structure, which performs the function in the herein illustrated exemplary implementations of the invention. In addition, while a particular feature of the invention may have been disclosed with respect to only one of several implementations, such feature may be combined with one or more other features of the other implementations as may be desired and advantageous for any given or particular application. Furthermore, to the extent that the terms "includes," has, "" contains, "and" with "and variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term" comprising.
Claims (10)
1. A method (60) of fulfilling requests (44) targeting a data set (20) on a data store (18) having a processor (146), the data store comprising at least one index for at least one filtering criterion, the method (60) comprising:
storing a data item (52) in the data set (20) after receiving the data item (52); and
after receiving (68) a request (44) specifying at least one filter criterion (54):
indexing the data items in at least one index according to at least one filter criterion, wherein indexing data items in the at least one index comprises: adding, for a respective filter criterion, the data item to a set of data items for data items having a filter criterion value of the filter criterion;
retrieving (70) data items (52) of the data set (20) that satisfy the at least one filtering criterion (54) to generate a filtered data subset (58), wherein retrieving data items of the data set that satisfy the at least one filtering criterion comprises: identifying, for a respective filter criterion, data items that satisfy the filter criterion by using an index corresponding to the filter criterion, wherein the index specifies, for a filter criterion value of the filter criterion corresponding to the index, a set of data items that identify data items having a filter criterion value of the filter criterion, wherein identifying the data items that satisfy the filter criterion having a filter criterion value comprises: retrieving a set of data items for data items having filter criteria values for the filter criteria; and
sending (72), in response to the request (44), the filtered subset of data (58); the data store has a data item buffer configured to store received data items; and storing the data item comprises: when a data item stored in the data item buffer exceeds a data item buffer size threshold:
adding the respective data item of the data item buffer to the set of data items, an
Emptying the data item buffer.
2. The method of claim 1,
the data store comprises at least one data item partition configured to store data items having filter criteria values of filter criteria; and
retrieving data items of the data set that satisfy the at least one filtering criterion comprises: retrieving data items stored in the data item partition storing data items having the filter criterion values of the filter criterion for at least one filter criterion value of a respective filter criterion.
3. The method of claim 2, wherein the method comprises, after receiving a data item:
identifying at least one filter criterion value of at least one filter criterion corresponding to the index;
identifying a data item partition storing data items having filter criteria values of the filter criteria; and
storing the data item in the data item partition.
4. The method of claim 2,
the data storage includes:
at least one data item partition configured to store data items having a first filter criterion value of a first filter criterion and a second filter criterion value of a second filter criterion; and
at least one index configured to identify, for data items having a first filter criterion value of the first filter criterion, respective data item partitions storing data items also having a respective second filter criterion value of the second filter criterion.
5. The method of claim 1,
the request specifies a first filtered subset of data for generating the filtered subset of data; and
retrieving data items of the data set includes: retrieving data items of the data set that satisfy the at least one filtering criterion and generating a filtered data subset using the first filtered data subset.
6. The method of claim 1, retrieving the data item comprising:
retrieving all data items of the data set; and
only data items satisfying the at least one filter criterion are sent.
7. The method of claim 6,
the data store comprises a set of data item processors including at least one data item processor having a status and at least one filter condition, and the at least one data item processor is configured to:
updating a state of the data item processor after receiving a data item; and
authorizing transmission of the data item after the state of the data item processor satisfies the at least one filter condition; and
sending only data items that satisfy the at least one filter criterion includes:
providing respective data items to said set of data item processors, an
Sending the data items authorized for sending by the set of data item processors.
8. The method of claim 1,
receiving the request from a requestor; and
the method comprises, prior to sending the filtered subset of data:
estimating a filtered data subset size of the filtered data subset;
sending the filtered data subset size to the requestor; and
after receiving a filtered subset of data authorization from the requestor, sending the filtered subset of data in response to the request.
9. A method (80) of applying a query (14) to a data set (20) stored by a data store (18), the method (80) being performed by a device (142) having a processor (146) and comprising:
generating (86), from the query (14), a request (44) specifying at least one filtering criterion (54);
sending (88) the request (44) to the data store (18); and
applying (90) the query (14) to the filtered data subset (58) after receiving the filtered data subset (58) from the data store (18) in response to the request (44);
wherein the query comprises:
the first filtering criterion generates a first filtered subset of data, an
Generating a second filtered subset of data using a second filtering criterion using the first filtered subset of data;
generating the request includes: generating a first request specifying the first filtered subset of data according to the first filter criterion value;
sending the request to the data store comprises: sending the first request to the data store; and
applying the query comprises:
after receiving the first filtered subset of data from the data store in response to the first request:
generating a second request specifying the second filtered subset of data according to the second filtering criteria and using the first filtered subset of data; and
sending the second request to the data store; and
after receiving the second filtered subset of data from the data store in response to the second request, applying the query to the second filtered subset of data.
10. The method of claim 9, comprising, prior to receiving the filtered subset of data from the data store:
receiving a filtered data subset size of the filtered data subset from the data store;
verifying the filtered data subset size to generate a filtered data subset authorization; and
after generating the filtered data subset authorization, sending the filtered data subset authorization to the data store.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US12/979,467 US10311105B2 (en) | 2010-12-28 | 2010-12-28 | Filtering queried data on data stores |
| US12/979,467 | 2010-12-28 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| HK1174111A1 HK1174111A1 (en) | 2013-05-31 |
| HK1174111B true HK1174111B (en) | 2016-02-26 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20190266195A1 (en) | Filtering queried data on data stores | |
| US11989194B2 (en) | Addressing memory limits for partition tracking among worker nodes | |
| US12118009B2 (en) | Supporting query languages through distributed execution of query engines | |
| US11586627B2 (en) | Partitioning and reducing records at ingest of a worker node | |
| US20240320231A1 (en) | Addressing memory limits for partition tracking among worker nodes | |
| EP3259668B1 (en) | System and method for generating an effective test data set for testing big data applications | |
| US20190258631A1 (en) | Query scheduling based on a query-resource allocation and resource availability | |
| US12014248B2 (en) | Machine learning performance and workload management | |
| US8712998B2 (en) | Deadline-driven parallel execution of queries | |
| CN114443680B (en) | Database management system, related apparatus, methods and media | |
| Rodrigues et al. | Big data processing tools: An experimental performance evaluation | |
| US11561974B2 (en) | Cross-datasource querying using composite shapes | |
| US10599614B1 (en) | Intersection-based dynamic blocking | |
| US20070250517A1 (en) | Method and Apparatus for Autonomically Maintaining Latent Auxiliary Database Structures for Use in Executing Database Queries | |
| Hu et al. | Towards big linked data: a large-scale, distributed semantic data storage | |
| US20170031982A1 (en) | Maintaining Performance in the Presence of Insertions, Deletions, and Streaming Queries | |
| US11663216B2 (en) | Delta database data provisioning | |
| CN118708608A (en) | Processing engine selection method, device, computer equipment, and storage medium | |
| US12061621B2 (en) | Bulk data extract hybrid job processing | |
| Urbani | On web-scale reasoning | |
| HK1174111B (en) | Filtering queried data on data stores | |
| Zhao et al. | A multidimensional OLAP engine implementation in key-value database systems | |
| Rani et al. | Journey from Data Warehouse to Data Lake | |
| Khalifa | Achieving consumable big data analytics by distributing data mining algorithms | |
| Huang et al. | A SPARQL query processing system using map-phase-multi join for big data in clouds |