METHOD AND SYSTEM FOR INDEXING AND SEARCHING OF SEMI-STRUCTURED DATA
FIELD OF THE INVENTION
[0001] The present invention relates generally to data processing, and, more particularly, to a method and system for indexing and searching of semi- structured data.
BACKGROUND OF THE INVENTION
[0002] Extensible Markup Language (XML) is a meta language for exchanging content among different platforms such as the World Wide Web. XML that allows data to have nested substructures with free text ("semi- structured data"). In particular, XML allows for user-defined tags ("meta data") to structure data in which the data can have any number of nested or hierarchical levels ("hierarchical data structure"). The flexibility of XML to define tags allows XML data to have a wide variety of formats that can evolve with changing conditions ("evolving XML data"). As such, XML is popular with business partners or customers allowing them to exchange evolving XML data over the Internet.
[0003] Traditional data management systems are not well suited to handle evolving XML data. In particular, traditional systems require re-engineering or modification of their underlying data models to handle evolving XML data. For example, entity A may want to partner with entity B to exchange data regarding their employees. Entity A may store, e.g., a record of <person> with a description of programming skill sets using an XML format as illustrated below.
<person> <name><John</name> <skill>
<programming>
<novice><language>Java</language></novice> <expert><language>Python</language></expert> </programming> <skill> <person>
[0004] Entity B may store, e.g., a record of <person> with a description of programming skill sets using a different XML format as illustrated below.
<person>
<name>Bill</name>
<skill>Python Programmer</skill> </person>
[0005] Although entities A and B both use a <person> record, there remains diversity in the underlying XML structure for their definition of <person> records. Specifically, entity A has a more sophisticated nested structure for the <person> record than entity B. The above records can evolve if the person acquires new and different types of skills. As a result, the underlying data structure for the <person> records may need modification. [0006] With traditional management systems, to handle such modification to XML data, a standard XML format is needed. Updating XML data into a standard format, however, is time extensive. Furthermore, each change to XML data may require updating the underlying schema or format of the XML data. For business entities that update or change data formats frequently, traditional systems are inadequate to handle XML data under dynamic conditions.
[0007] One traditional system for handling XML data is a Relational Database Management System ("RDBMS"). RDBMS is a management system that models data using arrays of tables. For a hierarchical data structure such as XML, RDBMS splits levels or substructures of XML data into separate sections of tables. This requires a mapping to define relationships between data in each substructure to a particular section of tables. Thus, one limitation with RDBMS is that a new mapping process is required when adding new substructures to XML data. This requires a time extensive mapping process each time a new substructure is added to XML data. [0008] Another limitation with RDBMS is that the structure of the tables is fixed and well-defined. For RDBMS, if a new substructure is to be added, RDBMS requires a change to the underlying structure of the tables in RDBMS. This type of change requires complicated and time consuming tasks such as, for example, backing up existing data in tables, deleting tables, and re-creating new tables for the new substructure. In addition, RDBMS is not well suited for storing and querying data in hierarchical, sub-structured data such as XML. [0009] Another system for handling XML data uses a "free-text" unstructured text search engine ("free-text system"). A free-text system typically stores data as documents. The free-text system indices documents based on content using an inverted tree indexing scheme. This type of indexing does not allow for searching in a specific substructure within hierarchical, semi-structured data such as XML. Thus, one limitation with a free-text system is that it does not differentiate, e.g., between text strings in underlying substructures of XML data. In particular, a user may want to search
:or the text string "Java programming" in the skill sets category of an employee -ecord and not for the text string in other substructures. The free-text system, nowever, would search for the text string in all substructures. As such, the data retrieval and querying abilities of a free-text system do not effectively leverage hierarchical data structures to search for particular data within a specific substructure.
[0010] Another system for handling XML data is a hybrid of RDBMS and the free-test system ("hybrid system"). The hybrid system typically splits unstructured XML data (e.g., free text) into a free-text system and structured XML data (e.g., substructure data such as a path) into a RDBMS. The hybrid system splits data retrieval between the two different systems. For example, a query for a keyword is performed in the free-text system and a query for structured data is performed in the RDBMS. Thus, one limitation with the hybrid system is that it requires an additional application specific layer to manage the hybrid implementation. The hybrid system, like RDBMS, is still tightly mapped to a specific domain and structure. Thus, the hybrid system is not well-suited to handle evolving XML data.
[0011] There exists, therefore, a need to handle evolving hierarchical, semi- structured data such as, for example, XML data, in a form native to the hierarchical and semi-structured environment.
SUMMARY OF THE INVENTION
[0012] Methods and system consistent with the present invention, as embodied and broadly described herein, provide indexing and searching of semi-structured data.
[0013] Consistent with the invention, a method for indexing semi-structured data includes determining an element within the semi-structured data, the element being associated with a path; and creating a plurality of indices for the element, the indices including an index by keyword or phrase for the element, an index by the path for the element, an index to support word stems for the element, a keys index for path manipulation of the element, a soundex index for the element, and a synonym index for the element. [0014] Consistent with the invention, a method for processing a query request regarding semi-structured data, the data including at least one element associated with a path, the method includes parsing the query request into query components, each query component representing a search request for an element within the semi-structured data; and processing the query components to search for the associated elements using a plurality of indices including an index by keyword or phrase for the element, an index by the path for the element, an index to support word stems for the element, a keys index for path manipulation of the element, a soundex index for the element, and a synonym index for the element.
[0015] Consistent with the invention, a method for request and index management in a system for processing index and modification requests based on defined rules and conditions, including determining to process at least one
of the subsequent processing steps based on the defined rules and conditions of the system, the step of determining including: processing at least one modification request into new index files if at least one pending modification request exits and a number of index files in the system is less than a first threshold limit; merging a first size of index files into a first merged file if no modification request is pending or the number of index files in the system is larger than a second threshold limit; merging a second size of index files into a second merged file if the step of merging the first size of index files is not being processed and the number of index files is larger than second threshold limit; reclaiming freed memory space by deleting the merged first size and second size of index files; and optimizing the system by building additional index files if space has been reclaimed and optimization is required. [0016] Consistent with the invention, a method for processing an event trigger on semi-structured data includes parsing the event trigger into components; processing the components to determine at least one path used to invoke the event trigger; creating a segmented index based on each determined path; and if a modification is performed on the semi-structured data, determining at least one path affected by the request; determining if the path affected by the modification is in the segmented index; and notifying at least one party of the modification if the affected path is in the segmented index.
[0017] Consistent with the invention, a method for merging semi-structured data, the data including primary and secondary semi-structured record types as being subject to declarative requests, the method includes parsing a declarative
request into components; identifying primary and secondary semi-structured record types being requested and a least one relationship between the requested semi-structured record types using the parsed components; building a temporary index based on the identified secondary record types; and merging the primary record types with the secondary record types based the temporary index.
BRIEF DESCRIPTION OF THE DRAWINGS
[0018] The accompanying drawings, which are incorporated in, and constitute a part of, this specification illustrate exemplary embodiments of the invention, and together with the description, serve to explain the advantages and principles of the invention. In the drawings,
[0019] FIG. 1 A is block diagram of an exemplary system architecture in which methods and system consistent with the invention may be implemented;
[0020] FIG. 1 B is an internal block diagram of an exemplary computer system in which methods and system consistent with the invention may be implemented;
[0021] FIG. 2 is a flow diagram of a method for indexing semi-structured data;
[0022] FIG. 3 is a flow diagram of a method for querying semi-structured data;
[0023] FIG. 4A is a flow diagram of a method for evaluating a WHERE clause in FIG. 3;
[0024] FIG. 4B is a flow diagram of a method for evaluating group of expressions in FIG. 4A;
[0025] FIG. 4C is a flow diagram of a method for loading and reconstructing results in FIG. 3;
[0026] FIG. 5A is a flow diagram of a method for processing a query request by a query engine;
[0027] FIGS. 5B is a flow diagram of a method for an index engine to process queued index or modification requests;
[0028] FIG. 6 is an exemplary XML event trigger in which methods consistent with the invention may be implemented;
[0029] FIG. 7A is a flow diagram of a method for processing an XML event trigger;
[0030] FIG. 7B is a flow diagram of a method for executing an XML event trigger; and
[0031] FIG. 8 is a flow diagram of a method for merging semi-structured data.
DETAILED DESCRIPTION
[0032] Reference will now be made in detail to implementations of the present invention as illustrated in the accompanying drawings. The same reference numbers may be used throughout the drawings and the following description refer to the same or like parts. Implementations of indexing and searching of semi-structured data are described. Indexing and searching may be implemented for an element associated with a path within semi-structured data, such as, for example, Extensible Markup Language (XML) data. [0033] Nevertheless, implementations described herein may apply to other types of hierarchical, semi-structured data such as, for example, Hypertext Markup Language (HTML) data, Standard Markup Language (SGML) data, Wireless Markup Language (WML) data, or other like types of semi-structured data, consistent with the invention.
EXEMPLARY SYSTEM ARCHITECTURE Basic Architecture [0034] FIG. 1 A is block diagram of an exemplary system architecture 100 in which methods and system consistent with the invention may be implemented. System architecture 100 includes clients 104 and 106 connected to a query server 110 via a network 102. Query server 110 is connected to an index , server 112 and an XML repository 120. XML repository 120 stores XML data and index files consistent with the invention. In one embodiment, XML repository 120 is a database system including on or more storage devices. XML repository 120 may store other types of information such as, for example,
configuration or storage use information. Network 102 may be the Internet, a local area network (LAN), or a wide area network (WAN). System architecture 100 is suitable for use with the Java™, Python, C++, SQL™ programming languages, and other like programming languages.
[0035] Clients 104 and 106 include user interfaces such as, for example, a web browser 103 and client application 105, respectively, to send query or index requests to query engine 111 operating in query server 110. A query request is a search request for desired data in XML repository 120. An index request ("modification request") is a modification request to XML repository 120. In one embodiment, a modification request is used for an update, delete, or insert to XML repository 120. Clients 104 and 106 can send a query or modification request to query engine 111 of query server 110 using a standard protocols such as Hypertext Markup Transfer Protocol (HTTP) or Structured Query Language (SQL) protocol.
[0036] Query engine 111 determines whether a request from clients 104 or 106 is a query or modification request. Query engine 111 processes a query request from clients 104 and 106 by parsing the query request for execution of a search consistent with the invention. Query engine 111 may use index files in XML repository 120. Query engine 111 loads search results of records that match the query request and returns the results to requesting clients 104 or 106. In one embodiment, query engine 111 stores modification requests in an index request queue 115 ("modification request queue 115") for an index engine 113. Query engine 111 notifies index engine 113 of an inbound modification request placed in the modification request queue 115 via a
"notification" communication path. The notification path may be a direct connection or a logical connection path to send notification messages. [0037] Index engine 113 creates indices related to particular elements of XML data or documents consistent with the invention. In one embodiment, index engine 113 stores indices in index files consistent with the invention. Index engine 113 also loads and processes index requests from the index request queue 115. After processing an index request, index engine 113 updates' or creates indices in XML repository 120 related to an update, delete, or insert request. In one embodiment, index engine 113 periodically polls modification request queue 115 to determine if any new pending modification requests exist. Index engine 113 notifies query engine 111 of completed processed modification requests and of updates to indices in XML repository 120 such that query engine 111 can use updated indices for query searching in XML repository 120.
Exemplary Computer System [0038] FIG. 1B is an internal block diagram of an exemplary computer system 150 in which methods and system consistent with the invention may be implemented. Computer system 150 may represent the internal components of the clients or servers of exemplary system architecture 100 in FIG. 1. In one embodiment, a query engine or an index engine, consistent with the invention, may be implemented in computer system 150. In another embodiment, both a query engine and an index engine may be implemented in computer system 150.
[0039] Computer system 150 includes several components all interconnected via a system bus 160. Bus 160 may be, for example, a bidirectional system bus that connects the components of computer system 150. For example, bus 160 may contain thirty-two address lines for addressing a memory 165 and thirty-two bit data lines for transferring data among the components. Alternatively, multiplexed data/address lines may be used instead of separate data and address lines. Computer system 150 may communicate with other computing systems on network 102 via network interface 185, examples of which include Ethernet or dial-up telephone connections. [0040] Computer system 150 contains a central processing unit (CPU) 155 connected to a memory 165. CPU 155 may be a microprocessor such as the Pentium® family microprocessors manufactured by Intel Corporation. However, any other suitable microprocessor, micro-, mini-, or mainframe computer, may be used. Memory 165 may include a random access memory (RAM), a readonly memory (ROM), a video memory, or mass storage. The mass storage may include both fixed and removable media (e.g., magnetic, optical, or magnetic optical storage systems or other available mass storage technology). [0041] Memory 165 may contain a program, an application programming interface (API), and other instructions for performing the methods consistent with the invention. The query engine 111 and index engine 113 may be implemented as software programs in memory 165 executed by CPU 155. In one embodiment, query engine 111 and index engine 113 are computer programs suitable for the Python programming language.
[0042] Computer system 150 may also receive input via input/output (I/O) devices 170, which may include a keyboard, pointing device, or other like input devices. Computer system 150 may also present information and interfaces via display 180.
XML INDEXING AND FILE MANAGEMENT [0043] FIG. 2 is a flow diagram of a method for indexing semi-structured data. The method provides indices for flexible path searching of semi- structured data. For purposes of explanation, the method refers to the following exemplary XML document.
(Exemplary XML Document)
<person>
<userjd>gkill2</user_id> <pass_hash>92818181728</pass_hash> <soc_sec_no>555-555-5555</soc_sec__no> <name> <first>gill</first> <last>killroy</last> </name>
<email>gill@killroy.com</email> <phone> <office>(408) 364-1741 </office> </phone>
<interest>programming</interest> <interest> sky diving & bungee jumping</interest> <skill>
<programming>
<languages> java</languages> <languages> perl</languages> <languages> c</languages> </programming
<architecture>ejb</architecture> <skill> </person>
[0044] The above exemplary XML document represents exemplary semi- structured data. The exemplary XML document includes a person/user's profile
with elements or sections describing the person's name, email, phone number, interest, skill, and etc. The exemplary XML document includes three nested levels (person > skill > programming) and includes fields with data types. The method does not require XML documents to conform to any particular schema or format. In one embodiment, the method builds index files in an ASCII
Contiguous Index Sequential Access Method (CISAM) format.
[0045] At stage 202, a keyword or phrase index file is built by performing an inverted tree algorithm on one or more elements in the XML document.
Appropriate inverted tree algorithms may be found in Donald Knuth, The Art of
Computer Programming Sorting and Searching, (1998) ("the Knuth
Reference"). Each element is associated with a path. For example, the element "sky diving & bungee jumping" is associated with a path
"person. interest." The keyword or phrase index file stores indices of records to keywords or phrases for each element within the XML document. Each index relates to a word or phrase that occurs in an element of a structure within the
XML document. For the exemplary XML document, using phrases "sky diving" or "bungee jumping" the element <interest> sky diving & bungee jumping
</interest> would translate into the following indices for the keyword or phrase index file. bungeeΛperson.interestA0000000126 divingAperson.interestA0000000126 jumpingΛperson.interestA0000000126 skyAperson.interest 0000000126
where the first entry "bungee" occurs in the path "person. interest" of record "000000126" that represents the ID of the record ("record ID") containing the keyword "bungee."
[0046] In one embodiment, an entire phrase that occurs in a given element can be indexed, e.g., "sky diving & bungee jumping" and stored in the keyword and phrase index file. In another embodiment, to improve query searching efficiency, if the size of the phrase is larger than a pre-defined threshold (e.g., 1 K), the phrase can be hashed to store a hash code, which may be used in place of the phrase. In this case, a hash code to phrase mapping is maintained in a separate index file.
[0047] The keyword and phrase index file may store individual words and its associated paths, e.g., .bungeeAperson. interest. This allows for advanced parametric and step searching capabilities. In particular, such indices can be used to find all paths where the word "bungee" occurs. Additionally, such indices are useful for determining the actual structure where certain words or phrases occur.
[0048] At stage 204, an index by path file is built by performing btree algorithms on one or more elements in the XML document. Appropriate btree algorithms may be found in the Knuth reference. The index by path file stores indices of records based on the path for each element within the XML document. Each index refers to the path where the element is found in the XML document. For the exemplary XML document, indices by path for elements in the XML document would translate into the following index by path entries.
person. emailAgill@killroy. com 0000000126 person. interestAprogrammingA0000000126 person.inierestAsky diving & bungee jumpingAOOOOOO0126 person, name. first gillAO0000O0126 person.name.lastAkillroy 0000000126 person. pass_hashA000000092818181728 0000000126 person.phone.officeA(408)364-1741A0000000126 person.skill. architectureAejbA0000000126 person.skill. programming. languagesAc 0000000126 person.skill. programming. languagesAja va A0000000126 person.skill. programming. languagesAperl 0000000126 person. soc_sec_noA555-555-5555A0000000126 person. user_idAgkill2 0000000126
[0049] Taking the first entry as an example, an index to "gill@killroy.com" is designated by its path "person. email" for the record "0000000126." An index by path provides useful indices to search for elements within hierarchical data structure such an XML document that is changing and evolving. As will be described below, a query request may use special semantics for a query search, e.g., SELECT WHERE person*skill CONTAINS "Java". This query looks for the text string "Java" at the path "person*skill" where " * " indicates that there can be zero or more undetermined nested levels between the person and the skill attribute. Thus, in order to resolve variable path queries, an index by path is useful for semi-structured data.
[0050] The index by path also indicates the occurrence of a specific path for a given word or phrase. In one embodiment, repeating elements with different values (e.g., <languages>) are represented individually in the index by path file.
In one embodiment, entries that are similar, e.g., "bungee", are stored in the index by path file to enable advanced step and parametric query searches.
[0051] At stage 206, an index file to support word stems is built by performing word stemming algorithms for one or more elements in the XML document. Appropriate word stemming algorithms may be found in Martin
Porter, Readings in Information Retrieval, (1997). This index file stores indices that refer to records containing variations of keywords or phrases related to an element of the XML document (e.g., rates - rate). One or more elements may or may not have a corresponding index to a word stem record. A stemming algorithm can be applied to determine if a keyword or phrase occurs in the beginning, end, or in the middle of the actual word thus providing very advanced query capabilities, e.g., select * where person* skill CONTAINS "Java" may match "javascript", "hotjava", "Java" and other variations of the word Java that occur in the <person> .... <skill> section.
[0052] At stage 208, a keys index file is built for path manipulation by maintaining a stack of previously processed paths on one or more elements in the XML document. The keys index file stores keys referring to resolved paths for a particular element within an XML document. For the above example XML document, the keys for resolved paths of the element "person" are provided below. person person. email person. interest person. name person. name. first person. name.last person. pass_hash person.phone
person. phone. office person.skill person.skill. architecture person.skill. programming person.skill.programming. languages person. soc_sec_no person.userjd
[0053] The keys index file is useful to resolve query searches such as, for example, retrieving all elements that contain a certain word, retrieving all elements that begin with a certain word, retrieving all elements that are children of a certain element within a given path by using the key for the given path.
[0054] At stage 210, a soundex index file is built by performing soundex algorithms on one or more elements in the XML document. Appropriate soundex algorithms may found in the Knuth reference. A soundex index file stores indices to records having similar sounding words of one or more elements in an XML document. For example, the query command SELECT *
WHERE person*name like Javier" using a soundex index file may return records for a person having a name Xavier, Javier, Havier, and similar sounding variations. At stage 212, a synonym index file is built by performing a synonym algorithm on one or more elements in the XML document. Appropriate synonym algorithms may be found in the Knuth reference. A synonym index file stores indices of records having synonyms to the element in the XML document.
[0055] At stage 214, an identification (ID) file is built to maintain the records of the index files. In one embodiment, the ID file is a component of an XML repository. The ID file stores record IDs and all paths for that record ID. Thus, the ID file represents a master record file for the complete set of elements in an
XML document. If an index has been used to resolve a matching record, the complete record is loaded from the ID file and reconstructed before delivery to a client.
[0056] The ID file may include meta data information about an XML repository or database (e.g., number of fields/elements within a given document/record, etc.) This information is useful for optimized query searches and for building alternative query execution plans. The ID file may also include an array index that lists a plurality of values for an element, which are represented with their literal array path. For example, person. skill. programming. languages has more than one entry and can be distinguished using an array index ".1" as shown below in example entries of an ID file for the exemplary XML document.
0000000126Ano_fieldsA13
0000000126Aperson.emailAperson. email . Agill@killroy. com
0000000126Aperson. interestAperson. interest.1 A.Asky diving \u0026 bungee jumping
0000000126Aperson. interest person. interest A. programming
00000001 '26 <Aperson. name. firstAperson. name. firstA.Agill
0000000126Aperson.name.last person.name.last .Akillroy
0000000126Aperson.pass_hash Aperson.pass_hash .A92818181728
0000000126Aperson.phone.officeAperson.phone.officeA. (408)364-1741
0000000126Aperson.skill.architectureAperson.skill.architectureA.Aejb
0000000126Aperson. skill. programming. languages person. skill, programming, la nguages.1A.Aperl
0000000126Aperson.skill.programming.languagesAperson.skill. programming. la nguages.2A.Ac
0000000126Aperson. skill, programming. languagesAperson.skill. programming, la nguages .Ajava
0000000126Aperson.soc_sec_noAperson.soc_sec_noA. A555-555-5555
0000000126Aperson. user_idAperson. userJdA. Agkill2
[0057] The above exemplary index file representations can be used for storing and searching of other types of structured or semi-structured data such
as HTML data. Furthermore, index files can be compressed for memory optimization.
QUERY PROCESSING OF SEMI-STRUCTURED XML CONTENT [0058] FIG. 3 is a flow diagram of a method for querying semi-structured data consistent with the invention. At stage 302, a query request is received. In one embodiment, the query request is provided in the form of SQL type statements. In other embodiments, a query request may be provided in other types of query command formats. The request may include a "dynamic query request" or a "server-side query request." Exemplary dynamic and server-side query requests are illustrated below for querying semi-structured data such as XML data. The requests include keywords such as "SELECT" and "WHERE." The "SELECT" keyword sets forth the substructures or path for the query search. The "WHERE" keyword sets forth the specific text string(s) or data for the desired search. [0059]
(Dynamic Query Request)
<xml_select>
SELECT person*name, person*skill
WHERE person*skill contains "Java" </xml_select>
(Server-side Query Request)
<xsql:query connection- 'demo" mlns:xsql="um:pybiz- xsql">
SELECT person. name, person*skill
WHERE person*skill contains "{Java}" </xsql:query>
[0060] In the above examples, unique extensions are used such as, for example, " * " and " . " in the query command format. Other symbols may also be used for extensions. The extensions indicate substructures/semi-structures for the specific type of information requested, e.g., SELECT person*skill
WHERE person*skill contains "Java," in XML data. The extensions allow for specific queries to any number of paths within XML data using any number of types of query command statements.
[0061] At stage 304, the query request is parsed into query elements based on keywords or clauses in the query request such as, for example, based on the "SELECT" clause and the "WHERE" clause. The elements related to the
SELECT clause are parsed into its components, e.g., (person*skill, person. name). The WHERE clause is parsed into components based on expressions and operators within the expressions.
[0062] An expression includes operators such as, for example,
"CONTAINS", "EQ", "BETWEEN", "LIKE", "AND", "OR." For each operator, there are two components - one for the left and right side (LHS), (RHS) of the operator. For example, the above example expression (person*skill contains
"Java") includes the operator CONTAINS having a LHS component
(person*skill) and a RHS component ("Java"). A WHERE clause may be associated with any number of expressions. An example model of a WHERE clause having a number of expressions is illustrated below.
(where
Expression 1 = LHS operator RHS (e.g., person*skill contains "Java") i. AND ii. Expression2 = LHS operator RHS v. OR v. Expression 3 = LHS operator RHS)
[0063] As shown in the above example, the WHERE clause is associated with three expressions (Expressions 1-3) connected by AND and OR logical connectors for specific search requirements.
[0064] At stage 306, a check is made to determine if there is a "WHERE" clause. If there is no "WHERE" clause, at stage 307, all the record IDs associated with "SELECT" clause are retrieved from a master ID file. Although not shown, from stage 307, the method can continue to stage 310. At stage 308, if there is a WHERE clause, the where expressions are evaluated in a manner described below to obtain the desired search results. The search results include all or parts of the record match the request. At stage 310, the results are loaded and reconstructed for the query request as described below with regard to FIG. 4C. At stage 312, the results are returned or delivered to the requesting client(s).
[0065] FIG. 4A is a flow diagram of a method consistent with the invention for evaluating a WHERE clause at stage 308 in FIG. 3. At stage 402, the various WHERE clause expressions connected by "OR" are grouped. For example, in the above example regarding FIG. 3, expressions (i) and (ii) would be grouped together while expression (iii) would be in a separate group. At stage 404, each expression in the group that are connected by "AND" is evaluated as described below. At stage 406, a union of record IDs is performed if multiple groups were separated by "OR" to obtain a complete set of record IDs.
[0066] FIG. 4B is a flow diagram of a method consistent with the invention for evaluating each expression of the groups at stage 404 of FIG. 4A. At stage 408, each expression is tokenized or parsed into components such as {LHS (left handside), operator (e.g. LIKE, CONTAINS, EQ, BETWEEN, etc.), RHS (right hand side)}. At stage 410, the various paths are resolved for the LHS(e.g., person*skill can be based on the input data). The following are exemplary matching paths:
1. <person><skiII>;
2. <person><business><skill>;
3. <person><technical><skill>; and
4. <person><technical><software><programming> <skill>.
[0067] At stage 412, an index/search method is determined based on the type of operator. For example, if the operator is "LIKE," a soundex index can be used to perform a soundex method consistent with the invention. If the operator is "EQ," an index by keyword or phrase can be used to perform a keyword search. Any combination of indices may be used for the query request as described above consistent with the invention. At stage 414, the record IDs that match one of the resolved paths and satisfy the current expression in the WHERE clause are loaded. At stage 416, if the search fails for the expression, the method exits the expression. At stage 418, all record IDs that do not exist in the current expression are removed.
[0068] FIG. 4C is a flow diagram of a method consistent with the invention for loading and reconstructing results of FIG. 3. In one embodiment, the
following method is iterated for each record ID in the matching result set. At stage 422, all the fields/ elements associated with the record ID (e.g., from an
ID file) are loaded. At stage 424, the SELECT clause is resolved to determine the substructures that need to be projected in the returned result (e.g. select person*id, person. name) would translate into:
<person><id_section_if_exists><id></id></id_section_if_exists><name></ name>
[0069] At stage 426, a projection is performed from the set of elements and fragments obtained at stage 422 and the results are filtered that match the above expressions. The record IDs that match the expressions can be stored in another index file for later use.
RULES BASED REQUEST PROCESSING [0070] FIG. 5A is a flow diagram of a method for a query engine to process a query request. At stage 502, a request is received from a client application. At stage 504, the type of the request is determined, i.e., is it a query request or an index request (insert, update, or delete). Exemplary index requests for an insert, update, and delete are illustrated below.
(Insert)
<xml_insert>
<!-- A set of records is to be added for this insert ->
<record> <person> <userjd>gkill2</user_id> <passjιash>92818181728</pass_hash> <name> <first>gill</first>
<last>killroy</last> </name>
<interest>programming</interest> </person> </record>
<record> <service> <id>pjourn</id> <admin>jaky98</admin> <uri>http://programmersJoumal</uri> <topioprogramming</topic> </service> </record> </xml insert>
(Update)
<xml_update> <update> <cmd> UPDATE MERGE WHERE person.name.first EQ "gill" AND person. name. last EQ "killroy"; </cmd> <record> <person> <contact>
<email>gill@killroy.com</email> <email>gkillroy@hotmail.com</email> <phone> <office>(408)364-1741 </office> <home>(650)447-1990</home> <cell>(980)652-9822</cell> </phone> <home> <address>3386 valley corp way</address> <city>San Jose</city> <state>California</state> <zip>95117</zip> </home> </contact> </person> </record> </update> </xml_update>
(Delete)
xml_delete>
<delete> DELETE WHERE person.contact-info.name.first CONTAINS "gr";
</delete> </xml delete>
[0071] At stage 506, if the request is a query request, a search is made through all index files and associated index extent files to process the query request. The results are returned and delivered to the client. In one embodiment, a query request is given priority over a modification request and processed before a modification request. The query requests can be processed using the methods described above with respect to FIGS. 3, 4A, 4B, and 4C. At stage 508, if the request is a modification request, i.e., an insert, update, or delete, the query engine queues the modification requests in modification request queue for an index engine.
[0072] FIGS. 5B is a flow diagram of a method for an index engine to process queued modification requests. The method is used to maintain and optimize the indices while there are query requests to process based on defined rules and conditions in a system. In one embodiment, new rules can bee added and order the changed. The following method is an exemplary embodiment to perform rules based processing of queued modification requests. Exemplary rules are defined at stages 526, 530, 534, 538, and 542 (i.e., the conditions to perform those stages).
[0073] At stage 524, a determination is made if there is a need for a next step where the next step will be defined based on defined rules and the current state of a system. At stage 526, a first rule is evaluated to determine if there
are modification requests present in the modification request queue and the size of the index files are less than a small merge threshold (i.e., the number index files that is considered a "small number" of index files. At stage 528, if the first rule is met, the modification request is processed and a set of new index files are generated to represent the modification request. Once the request has been processed, the method returns to stage 524. At stage 530, if the first rule is not met, a second rule is evaluated to determine if there are no pending modification requests or the size of the index files is greater than the small merge threshold.
[0074] At stage 523, if the second rule is met, a small merge of index files is performed resulting from stage 528. The small index files are merged into larger index files to improve query performance. The method continues back to stage 524. At stage 534, if the second rule is not met, a third rule is evaluated to determine if the number of index files is greater than a large merge threshold limit. At stage 530, if the third rule is met, the small index files generated by the small merge at stage 532 are merged into larger index files. In one embodiment, the large merge is performed on hard disk. The method continues back to stage 524.
[0075] At stage 538, if the third rule is not met, a fourth rule is evaluated to determine if the small merge, large merge, and pending requests are completed. At stage 540, if the fourth rule is met, unused memory space is reclaimed by deleting the index files that were merged into larger index files. At stage 542, if the fourth rule is not met, a fifth rule is evaluated to determine if optimization of the indices is required. At stage 544, if the fifth rule is met, an
optimization process is performed by building secondary indices (e.g., bitmap index, segmented index). At stage 546, if the fifth rule is not met, the system is quiescent and is in silent mode performing no activity. The method then continues to stage 524. It should be noted that additional rules may be defined and existing rules modified to implement the method.
XML EVENT TRIGGER
[0076] FIG. 6 is an exemplary XML event trigger. An XML event trigger is a mechanism to notify interested parties if a specific change occurs to an XML document. The XML event trigger allows for notifications regarding specific changes to a substructure in an XML document. The following define nomenclature for the XML event trigger.
[0077] The XML event trigger definition includes a "TRIGGER" section 602, a "PERFORM" section 604, and "NOTIFY" sections 606 and 608. The XML event trigger, consistent with the present invention, requires certain syntax rules. For example, commas are used to separate key value attribute specifications in the various sections. In addition, the ordering of the keywords and sections, as described below, is required. The "TRIGGER" section 602 of
FIG. 6 is reproduced below.
TRIGGER AFTER INSERT at_risk_sales_over_50K WHEN service_request.sales__manager EQ "John" AND service_request.company_id EQ "ZIPPY"
[0078] TRIGGER section 602 , the "TRIGGER" keyword is reserved to specify an XML event trigger. The keywords "AFTER INSERT" indicate the type of modification, i.e., INSERT, to evaluate the XML event trigger.
Specifically, in the above section, the XML trigger is evaluated after an INSERT modification. Other keywords may be used such as, for example, "AFTER UPDATE," "AFTER DELETE," BEFORE INSERT," BEFORE UPDATE," BEFORE DELETE," to specify other types of conditions for evaluating the XML event trigger. Following the keywords "AFTER INSERT" is the trigger name, which in the above example is "at_risk_sales_over_50K." In one embodiment, the trigger name is globally unique for all triggers used in a system. [0079] The "WHEN" keyword specifies a clause that sets forth the conditions to invoke the event trigger. In the above example, the event trigger is invoked WHEN the substructure "service_request._sales_manger" EQ "john" AND the substructure "service_request.company_id" EQ "ZIPPY" are true. In the "true" case, the operations specified in the PERFORM section 604 and NOTIFY sections 606 and 608 are performed. If the WHEN clause is not satisfied or not true, then no action is performed for the XML event trigger. Thus, if a change occurs to an XML document, the WHEN clause of an XML event trigger is checked to determine if the operations in the PERFORM section 604 and NOTIFY sections 606 and 608 are to be performed. [0080] PERFORM section 604 of FIG. 6 is reproduced below.
PERFORM get_open_bids_over_one_rnillion.xsql AS PR2 company_id = service_request.company_id, service_id = service_request.id, format = "x1924"
[0081] PERFORM section 604 is optional for an XML event trigger. If a PERFORM section 604 is present, the PERFORM section must occur after the WHEN clause and before and NOTIFY clauses. The "PERFORM" executes an
xsql file "get_open_bids_over_one_million.xsql." Executing an xsql file performs a query that returns a set of XML records, which are sent to parties specified in the NOTIFY sections. The PERFORM transforms results from above into a default data structure as "d.result.1" where .1 indicates the results of the first perform that occurs in the XML trigger event and .2 would be the results of the second perform and so on.
[0082] The default data structure for a PERFORM can be an XML structure or like semi-structured data such as HTML used in the NOTIFY sections. If there are no results, subsequent PERFORM or NOTIFY sections that attempt to reference this result will be skipped. For example, if AS is used to set the results to d.pr2 and a subsequent NOTIFY specifies d.pr.2 as a data parameter that NOTIFY would be skipped. If the AS is specified, the results will not be the default input to the notify clauses as specified below. Attributes that are not reserve words in the PERFORM section 604 are passed as query parameters to a query enginer. The input to a PERFORM section is automatically the input data if the PERFORM section is in the first statement executed or the results of the previous PERFORM. This sequence can be overridden by using a data = d. input attribute. [0083] NOTIFY sections 606 and 608 of FIG. 6 are reproduced below.
NOTIFY type = smtp, mailto = "john@acme.com", from = "jack@acme.com", accout = "9998", server = "maple.he.net", user = "jim", pass = "J1974",
type = "smtp", data = d. input
NOTIFY type = http_post, uri = "http://acme.com/deals .at. .risk" account = "9998", data = = d.PR2
[0084] NOTIFY sections must occur after all PERFORM sections. NOTIFY sections dictate the manner in which to transmit a notice based on an XML event trigger. In one embodiment, a HTTP post or a SMTP email is used to provide the notice. The "type" attribute sets the notice type. The number and use of additional attributes varies depending on the desired type of delivery desired. For example, the HTTP post supports a content-type attribute which will be sent as part of the header for the post.
[0085] The input data for the NOTIFY sections is specified using the keyword "DATA." If there is no DATA attribute set, it will default to the output of the last perform that was not redirected using the AS clause or possibly input data if there have been no un-redirected PERFORM clauses executed. If the default or specified data parameter was the result of a previous perform and that perform did not find any records in its nested queries, this perform will be skipped.
[0086] The input data structure can be a sub portion of one of the query available data structures. For example, it could be data = d.input.person. address, which is a valid XML data structure. In one embodiment, attributes that are not considered reserved words for the selected
type of handler agent will be passed as query parameters wherever possible. Common input attributes are provided below.
Common Input Attributes
TYPE: This filed sets forth the type of delivery agent chosen for the notification.
DATA: This field sets forth the main data structure to be sent in the notification.
XSL: This field sets for the XSL template which the data structure is transformed into prior to sending it in the HTTP post. The XSL sheet may produce HTML, XML or other data types as allowed by the processor. row-set: This field sets forth a mechanism to wrap the data being sent in. For examples row-set = "orders" would allow an array of orders to be pulled out of one of the perform query sets and then wrap it in the <orders> tag to make it valid XML. Unless specified it is assumed that there is already a valid xml enclosure present. If there is a XSL template specified then this is setting is applied to the data element and the result is passed to the XSL template and the result of it is sent in the body of the notification.
HTTP Post Special Attributes
uri: This field sets forth the port numbers for sending the post proxy: This field sets for the proxy server to proxy the request content-type: This field sets forth the typical content type for delivery, e.g., text/xml.
SMTP Special Attributes
sendto: This field sets forth the email address of the recipient of the notification. from: This field sets for the email address of the sender. account: This field sets forth the user id to use to access the smtp server.
• server: This field sets forth the name or IP address of the smtp server to send the email to.
• pass: This field sets forth the sending users password on the smtp- server.
• data: This field sets forth the source of the data structure to send to the remote smtp server
[0087] In one embodiment, HTTP posts can deliver results back on their own and it is possible to add those results to the internal event data model using the AS keyword as detailed below. For example, the following NOTIFY section uses an AS keyword for a HTTP post type of notification.
NOTIFY AS NR1X type = httppost, uri = "http://acme.com/deals_at_risk", account = "9998"
[0088] In the above example, the AS clause places the result in d.NRIX as an element that can be drawn off of by name by subsequent notifications. In general, NOTIFY operations can be considered un-ordered in that they are dumped into a queue and, as the remote resources necessary to deliver the notification become available, the NOTIFY operations are delivered. However, when the AS clause is introduced in this context, it becomes necessary for the event notification engine to stop what it is doing, perform the post, and receive the results before continuing to process the subsequent NOTIFY operations.
[0089] The AS feature can be very powerful since it makes the XML event trigger effectively enabled for distributed gathering of data and interaction before sending the final notification to the recipient. That means that if some of the needed data is present, for example, a server which has XSQL installed in it becomes relatively easy to fetch that data as part of processing this XML trigger
event. If the content-type of the data from the notify action is text xml or xml/xml then it will be parsed stored in the as parameter in a form that allows access to sub components.
[0090] FIG. 7A is a flow diagram of a method for processing an XML event trigger. At stage 702, a trigger definition is parsed into components as explained in the TRIGGER section above. At stage 704, the various paths of the "WHEN" clause are determined. In the exemplary XML event trigger described above, the paths are "service_request.sales_manager" and "service_request.company_id."
[0091] At stage 706, a segmented index entry for determined paths is created. The segmented index refers to the path, the word or phrase being referenced at that path, and ID of the trigger that contains the path, e.g., the path definition in the trigger is service_request.sales_manage EQ "John" would translate into the index entry of
"service_request.sales_manaerΛjohnΛ000912345. Such an indexing scheme allows for a large number of triggers that can be evaluated at high-speeds. [0092] FIG. 7B is a flow diagram of a method for executing an XML event trigger. At stage 708, paths that have changed as a result of an index or modification request are determined. For example, the paths are determined that will change based on the modification request - UPDATE service_request.sales_manager = "Michael" WHERE service_request.sales_manager EQ "John." This request will change the value at the service_request.sales_manager section.
[0093] At stage 710, using the trigger index previously explained, a search is performed and determine whether there are any triggers associated with the above determined paths. If there are no triggers, the method ends. If a trigger is found, load the trigger definition at stage 712 if any triggers were found matching the determined path. At stage 714, a determination is made if the input data structure of the loaded trigger definition matches the "WHEN" clause.
If there is no match the method ends. If there is a match, at stage 716, the
PERFORM clauses, if any, are executed. At stage 718, the NOTIFY clauses are executed.
RELATIONAL MERGE OF XML/HIERARCHICAL DATA STRUCTURES
[0094] FIG. 8 is flow diagram of a method for merging semi-structured data.
The method allows a user to merge XML records into a separate, merged XML record for a high-speed query search. In one embodiment, the method provides a declarative approach to specify a relational merge or join of XML records. The following is an exemplary query command to specify a relational merge or join to assist in explaining the method.
SELECT person* ALIAS A
WHERE person/Zname// CONTAINS "JOE"
MERGE MANDATORY(
SELECT * AS $A//birthday/wishlist ALIAS B
WHERE //category CONTAINS $A//interest) MERGE OPTlONAL(
SELECT * AS $A//birthday/wishlist ALIAS C
WHERE *manufact* EQ "redwing"
AND //type EQ "hiking"
AND //size EQ "10"
AND //class EQ $A//interest)
[0095] The above query command is to retrieve/create a birthday wishlist for a person called "joe" by merging information from three different XML document types (person, book, boot). Specifically, the method retrieves all person records where the person's name section contains "joe". A birthday wish list section is added to the retrieved person records. The birthday wishlist can be created by matching the person's interests against any section called "category" defined in any other documents in an XML repository. Alternatively, the birthday wishlist can contain all data found by matching the person's interests to any section in any document called "class," where the manufacturer is "redwing", having attributes type of "hiking", size of "10" and class of "$A". [0096] The command may use the following syntax rules. The identifier "//" is used as a global or wildcard identifier. The identifier " is used to indicate the ending of one element and the beginning of the another element. The parenthesis after the keywords "OPTIONAL", MANDATORY", "MERGE" are mandatory. The use of "$A" or an variable having "$" used in the context of a WHERE clause is used to pull that value from a previously resolved master record and substitutes its value into the WHERE clause. In one embodiment, only scalar items or simple arrays can be specified.
[0097] If the named alias resolves to an array set then each value from the array will be interpolated. For example, in the above exemplary query command, $A is bound to the previously defined Alias A and indicates that the result of the nested SELECT will be merged into and will become part of the record originally named by the ALIAS A.
[0098] The following are sample data records to explain a relational join method. If there are more than one records matching the join criteria, they will be returned as array elements in the output/container object (temporary storage data structure). This ability to create an array of return values by merging data from multiple records is useful for joining or merging hierarchical data structures.
SAMPLE DATA
Record #R1
<person> <name>
<first>Joe</first>
<last>Ellsworth</last> </name>
<birthday><day>sept 23</dayx/birthday> <interest>web script</interest> <interest>java</interest> <interest>shoe</interest> </person>
Record #R2
<book>
<title>Perl programming for beginners</title>
<price>99.85</price>
<category>web script programming</category> </book>
Record #R3
<book>
<title>Java programming for beginners</title> <author>Jack Smith</author> <publisher>Redmond</publisher> <category>java</java>
<category>beginners programming</category> </book>
Record #R4 <boot>
<class>shoe</class>
<manufact>redwing</manufact> <type>hiking</type> <size>10</size> <vendor>wallmart</vendor> </boot>
Resulting Record <person> <name>
<f i rst> J oe</f i rst> <last>Ellsworth</last> </name> <birthday>
<day>sept 23</day> <wishlist> <book>
<title>Perl programming for beginners</title> <price>99.85</price>
<category>web script programming</category> </book> <book>
<title>Java programming for beginners</title> <author>Jack Smith</author> <publisher>Redmond</publisher> <category>java</java>
<category>beginners programming</category> </book> <boot>
<class>shoe</class> <manufact>redwing</manufact> <type>hiking</type> <size>10</size> <vendor>wallmart</vendor> </boot> </wishlist> </birthday>
<interest>web script</interest> <interest>java</java> </person>
[0099] The relational join method consistent with the invention will now be explained with regards to the above sample data. At stage, 802 the query request is parsed to generate a standard parse tree. At stage 804, the different type of discrete records are analyzed by analyzing the parse tree. In one
embodiment, the type of records is determined by Aliases. For the above example, a record Type 1 is identified by the alias A as a master record (select person* ALIAS A where person/Zname// CONTAINS "JOE'). A record Type 2 is identified by the alias B as a merge type record (select *As $A//birthday/wishlist ALIAS B WHERE //category contains $A//category). A record Type 3 is identified by the alias C as a merge type record (select *AS $A//birthday/wishlist ALIAS C ...).
[00100] At stage 805, the relations and merge criteria are identified between various requested records. Regarding the above sample data, the mapping filed for Record Type 1 is person/Zinterest. The criteria for mapping Record Type 1 to Record Type 2 is //category contains person/Zinterest. The criteria for mapping Record Type 1 to Record Type 3 is //class EQ //interest. [00101] At stage 806, the primary record type is identified. The primary record type is the base record type into which the secondary records are merged to generate the final output record. In one embodiment, the primary record type is identified based on the cardinality of data for each record type. For example, the record type with the lowest cardinality is chosen as the primary record type. In the above example data set, assume Record Type 1 has 10 potential values, Record Type 2 has 100,000 potential values, and Record Type 3 has 200,000 potential values. The primary record type will be Record Type 1 because it has the lowest cardinality, i.e., 10. [00102] At stage 808, the primary records are loaded. In one embodiment, all records are searched for primary record types that matches the specified WHERE clause (e.g., SELECT person * WHERE person//name// contains
"joe"). A primary result set out of these records is built. These records can be stored as the result set in a temporary file in memory.
[00103] At stage 810, merge specifications for a merge are built by evaluating each primary record at stage 808. A merge search is built based on the merge criteria defined at stages 802 and 804. The values for each record are extracted into a match set based on the mapping built at stage 804. In the above example, #R1 all possible values are extracted from the field "person/interest" ( "web script", "Java", "shoe"). Each value is added to the merge search. The result is a complete set of discrete search values which represents a logical join of those data values across the entire set, e.g., {'person/interest' : 'recs' : {{'web_script' : [#R1], 'Java' : [#R1], 'shoe' : ['#R1 ']}, match : [ {'op' : 'contains', 'path' : '//class'}, {'op' : 'contains', 'path' : '//category'}]}. It may be infeasible to keep the resulting set or the entire set of master records in memory so the filter here is applied one at a time as the master records are pulled and evaluated by the search filter. [00104] At stage 812, a super set is built of all secondary records that match any of the merge criteria for primary records by evaluating each of the merge search specification built at stage 806. For example, record #R1 - it translates into the following specific searches:
i. //category CONTAINS 'web script'
ii. //category CONTAINS 'Java'
iii. //category CONTAINS 'shoe'
iv. //class CONTAINS 'web script'
v. //class CONTAINS 'Java'
vi. //class CONTAINS 'shoe'
[00105] Each secondary record is stored in a set of secondary result set files as temp files on disk or in memory. The secondary result sets are stored by search phrase. In the above example, there is a result set for each one of the searches for stages 802 through 808. A query optimization can be performed to build the secondary result sets by evaluating the other WHERE clause criteria (e.g. type EQ "hiking" and size EQ "10" and manufact EQ "redwing") and determining whether that returns fewer records. The objective is to build the smallest possible secondary result set, since the cost in execution time and memory of merging the primary and secondary result set can be very high. [00106] At stage 814, the primary and secondary records are merged. Each primary record at stage 806 is evaluated. That is, the extract / Identify merge criteria (e.g. //category and //class) and candidate values (e.g. web script, Java, shoe) are determined from the primary record. The intermediate structure/place in the output record is then created to store the merged records ( e.g. person/birthday/wishlist). Auxiliary records which match the merge criteria and its candidate values identified above are located and retrieved from the secondary result set. These records are evaluates with WHERE clause (if any) associated with the auxiliary record (e.g. where type EQ "hiking" AND size EQ "10" AND *manufact* EQ "redwing". Filter out any auxiliary records which do not satisfy the where clause (refer to the SQL processing section for an explanation of how the where clause gets evaluated are then filtered out ). Output result structures are constructed by inserting the new records into the
merge record. If all previous filter criteria, merge criteria and where clause evaluations were successful, then add this record to the output result set. [00107] As described in detail above, methods and system consistent with the invention provide a query engine and index engine to provide flexible indexing and high-speed searching of semi-structured data such as XML. The foregoing description of an implementation of the invention has been presented for purposes of illustration and description. Modifications and variations are possible in light of the above description or may be acquired from practicing the invention. For example, the foregoing description is based on XML data, however, other types of semi-structured data such as HTML may be used to implement the invention. Furthermore, the foregoing description is based on a client-server architecture, but other types of architectures may be employed such as a peer-to-peer architecture consistent with the invention. Moreover, although the described implementations include software, the invention may be implemented as a combination of hardware and software or in hardware alone. Additionally, although aspects of the invention are described as being stored in memory, other types of computer-readable media, such as secondary storage devices, like hard disks, floppy disks, or compact disc ROM (CD-ROM); a carrier wave from the Internet; or other forms of RAM or ROM. The scope of the invention is therefore defined by the claims and their equivalents.