WO2010085523A1 - Stockage d'un graphique - Google Patents
Stockage d'un graphique Download PDFInfo
- Publication number
- WO2010085523A1 WO2010085523A1 PCT/US2010/021579 US2010021579W WO2010085523A1 WO 2010085523 A1 WO2010085523 A1 WO 2010085523A1 US 2010021579 W US2010021579 W US 2010021579W WO 2010085523 A1 WO2010085523 A1 WO 2010085523A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- query
- database
- processor configured
- primitive
- primitives
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G06T11/26—
Definitions
- the invention relates to the organization and use of information. More particularly, the invention relates to a graph store.
- the Web also affords a new form of communication. Those who grew up with hypertext, or have otherwise become accustomed to it, find the linear arrangement of textbooks and articles confining and inconvenient. In this respect, the Web is clearly better than conventional text.
- the invention addresses the problem of providing a system that has a very large, e.g. multi-gigabyte, database of knowledge to a very large number of diverse users, which include both human beings and automated processes.
- Managing a very large database is one of them.
- Connecting related data objects is another.
- Providing a mechanism for creating and retrieving metadata about a data object is a third.
- various approaches have been used to solve different parts of this problem.
- the World Wide Web for example, is an attempt to provide a very large database to a very large number of users. However, it fails to provide reliability or data security, and provides only a limited amount of metadata, and only in some cases.
- Large relational database systems tackle the problem of reliability and security very well, but are lacking in the ability to support diverse data and diverse users, as well as in metadata support.
- the ideal system should permit the diverse databases that exist today to continue to function, while supporting the development of new data. It should permit a large, diverse set of users to access this data, and to annotate it and otherwise add to it through various types of metadata. Users should be able to obtain a view of the data that is complete, comprehensive, valid, and enhanced based on the metadata.
- the system should support data integrity, redundancy, availability, scalability, ease of use, personalization, feedback, controlled access, and multiple data formats.
- the system must accommodate diverse data and diverse metadata, in addition to diverse user types.
- the access control system must be sufficiently flexible to give different users access to different portions of the database, with distributed management of the access control. Flexible administration must allow portions of the database to be maintained independently, and must allow for new features to added to the system as it grows.
- a collaborative database must, of necessity, support the creation or modification of schema long after data have been entered. While the relational model is quite general, current implementations map tables more-or-less directly into btree-based storage. This structure yields optimal performance but renders applications quite brittle. This invention supports a 'Schema Last' approach.
- a conventional table-of-tuples implementation is problematic, even on a modern column store.
- the starting point, a table of tuples and indexes with compound keys which are permutations of subject-predicate-object, is well studied and subject to obvious limitations of index size and self-join performance.
- Attempting to optimize an existing relational store for this tuple access pattern, while possible, is burdened both by compatibility with a relational model that is far more general than needed, and by an SQL interface in which it is difficult to say what is really meant.
- a new database design is implemented in which everything in the database is modeled with primitives, including the links and nodes for a graph store.
- a query syntax provides a nested tree of constraints with a single global schema.
- An aspect of the invention concerns the process of flushing the log preemptively as well as replicating the database.
- Various optimization techniques for queries are also described.
- Figures 1A and 1 B are block schematics diagram showing a graph using primitives according to the invention.
- Figure 2 is a block schematic diagram showing a graph from primitive model according to the invention
- Figure 3 is a block schematic diagram showing a query model according to the invention
- Figure 4 is a block schematic diagram showing relaxed ACID according to the invention.
- Figure 5 is a further block schematic diagram showing relaxed ACID according to the invention.
- Figure 6 is a block schematic diagram showing query optimization according to the invention.
- FIGS. 7A and 7B are block schematic diagrams showing replication according to the invention.
- FIGS. 8A and 8B are block schematic diagrams showing virtual time according to the invention.
- Figure 9 is a block schematic diagram showing replication and virtual time according to the invention.
- Figure 10 is a block schematic diagram showing index flushing according to the invention.
- FIG. 11 is a block schematic diagram showing versioning according to the invention.
- Figure 12 is a block schematic diagram showing versioning according to the invention.
- Figure 13 is a further block schematic diagram showing versioning according to the invention.
- Figure 14 is a further block schematic diagram showing versioning according to the invention.
- FIGS. 15A and 15B are block schematic diagrams showing iterators according to the invention.
- FIGS. 16A and 16B are block schematic diagrams showing budget according to the invention.
- Figure 17 is a block schematic diagram showing transforming of sets according to the invention.
- Figures 18A-18C are block schematic diagrams showing freezing and thawing according to the invention.
- Figures 19A and 19B are block schematic diagrams showing cursors according to the invention.
- Figure 20 is a block schematic diagram showing suspended cursors according to the invention.
- Figure 21 is a block schematic diagram showing cursor paging according to the invention.
- Figure 22 is a block schematic diagram showing cursors and virtual time according to the invention.
- Figure 23 is a block schematic diagram showing a scheduler according to the invention.
- Figure 24 is a further block schematic diagram showing a scheduler according to the invention.
- Figure 25 is a block schematic diagram showing a concentric graph according to the invention.
- Figure 26 is a block schematic diagram showing poison queries according to the invention.
- FIG. 27 is a block schematic diagram showing SMP according to the invention.
- Figure 28 is a block schematic diagram of a machine in the exemplary form of a computer system within which a set of instructions may be programmed to cause the machine to execute the logic steps of the invention.
- a primitive model is provided in which everything in a database is modeled with the primitives, including links and nodes.
- a query syntax provides a nested tree of constraints with a single global schema.
- An aspect of the invention concerns the process of flushing the log preemptively.
- Various optimization techniques are also provided.
- An embodiment of the invention is implemented in an application found at freebase.com, which is a data search and query facility that is powered by the proprietary tuple store, i.e. the graph store described herein, and which is referred to as graphd.
- the present embodiment of graphd is a C/Unix server that processes commands in a simple template-based query language.
- An embodiment of the invention provides a computer implemented method and apparatus for establishing a primitive-based graph database.
- a processor is configured to providing a plurality of primitives that are identified by globally unique identifiers (GUIDs) which consist of a database id and a primitive id (100). Once written, primitives are read only. The primitives collectively establish a log-structured or append-only database.
- the processor assigns primitive ids to primitives in the database sequentially as the primitives are written (110) and provides a plurality of fields for each primitive (120). Any or all of the fields may be null.
- the fields comprise a left field comprising a guid representing a first end of a relationship arrow; a right field comprising a guid representing a second end of said relationship arrow; a type field comprising a guid that is used in conjunction with said left field and right field to specify a type of a relationship; a scope field comprising a guid that identifies a creator of a given primitive; and a value field comprising a string that carries any of literal values, strings, numbers, and dates.
- the processor responds (130) to user interaction with the database to establish one or more primitives that comprise either nodes without a left field or a right field, or links which always have a left field and which can also have a right field.
- Nodes represent identities and carry no other data; links represent properties of an identity, where the links comprise either a literal value or a relationship.
- a node and its associated links comprise an object with fields (properties) described by one or more types.
- the processor modifies a primitive (140) to change the primitive's value by writing a new primitive carrying a modification and marking said new primitives as a replacement for the primitive that it is replacing.
- the processor deletes a primitive (150) by writing a new primitive which marks the primitive to be deleted as deleted.
- the processor removes deleted or versioned primitives (160) during query execution.
- Graphd primitives are identified by Globally Unique Identifiers (GUIDs) which consist of a database id and a primitive id.
- GUIDs Globally Unique Identifiers
- a database primitive ids are assigned sequentially as primitives are written.
- 9202a8c04000641f8000000000006567 is the GUID which corresponds to the person known as "Arnold Schwarzenegger”.
- the front part, 9202a8c04000641f8 is the database id and the back part, 6567, is the primitive id.
- Each graphd primitive consists of a number of fields:
- Type A guid, used in conjunction with left and right to specify the type of a relationship.
- Scope A guid, identifying the creator of a given primitive.
- Value A string used to carry literal values, strings, numbers, dates, etc.
- primitives are read only.
- Graphd is a log-structured or append- only store.
- To modify a primitive for example by changing the value, one writes a new primitive carrying the modification and marks it as a replacement for the old.
- To delete a primitive one writes a new primitive which marks the primitive one wishes to delete as deleted. Deleted or versioned primitives are weeded out during query execution.
- a log-structured database makes it easy to run queries as of a certain date.
- primitives may be regarded as either nodes, i.e. things without a left or right, or as links which always have a left and usually a right.
- Nodes such as...6567, are used to represent identities, and carry no other data.
- Links are used to represent properties of an identity, either a literal value, for example "height,” or a relationship, for example "employs/employed by.”
- a relationship is similar, but has a right instead of a value.
- the property such as /type/object/type, which identifies an object as being an instance of a particular type, is represented by a guid ...c. That Arnold is typed as a person is indicated by a primitive whose left is ...6567, type is ...c, and right is ...1237.
- a node and its associated links are regarded as an "object" with fields (properties) described by one or more types.
- objects map naturally into the dictionary-based objects supported by dynamic languages, such as Python and Perl.
- An embodiment of the invention illustrated in Figure 3, provides a query model that operates in connection with a processor configured for positing a query to the database (200).
- the query comprises a template expression that comprises a parenthesized, comma-separated list of zero or more constraints.
- One or more of the constraints define which primitives match the query, and one or more of the constraints define instructions that detail what information is returned to query matches and/or how the information is sorted and formatted;
- each of a plurality of parenthesized nesting levels of templates is evaluated to a list of primitive components that match the query. If a constraint holds subconstraints, the subconstraints are evaluated to nested lists of primitives that match a particular subconstraint. Multiple constraints combine disjunctively.
- a plurality of resulting query trees are generated (220) which are expanded to yield the query results.
- MQL looks up the GUIDs for "/people/person/heightjneters” and 7topic/en/arnold_schwarzenegger" and produces the following graphd query:
- the graph query language is template based: the syntax of the query parallels the structure of the desired result.
- Each parenthesized expression corresponds to a constrained set of primitives.
- Nested expressions are related to each other via one of the linkage fields. So, the outermost expression,
- the traditional approach to create a database is to start with a set of tables into which data are stored.
- the schema of the database defines the structure of the tables.
- the tables correspond very closely to the underlying storage mechanisms, e.g. hard disks.
- One problem with this approach is that the design of the data structure has profound implications for the design of an application for accessing data in said data structure. Certain things get fast, certain things get slow, certain information is redundant, there must be carefully structured update routines to keep the database in sync, etc.
- One approach is to collect the data and then create the schema last. This makes it easy to change the tables in the data structure.
- the system stores data denormalized in a certain way, it is simple to encode the fact that the data are denormalized in the structure of the database itself, such that the data could be broken into two tables at a later time without having to make that structure explicit.
- the first question concerns what the data are about.
- data are collected about identities, where an identity is a mathematical notion that corresponds to some identifiable entity. For example, there is a single thing that corresponds to Arnold Schwarzenegger and there are lots of data that are related to that identity.
- the identity in this example is the thing that, to the extent that anyone can agree to anything at all, is this person called Arnold Schwarzenegger who can be described in many different ways. There is not any further reduction to be made to an identity. It is the atomic particle of information.
- the data are related to that single identity to which one can add types, i.e.
- An embodiment therefore comprises an identity-based data store, i.e. a graph store.
- C-code implements the graph
- the graph does not know about any of the types. It does not know about people or authors or publications or any of the types that contributors may come up with. All that the graph knows about are nodes that represent identities, and links that represent data about identities (see Figures 1A and 1 B).
- sets of identities e.g. everyone whose first name is Scott.
- the table has a column that represents the name Scott and corresponding row IDs, which comprise an array of numbers. Accordingly, there is a single column that comprises the most minimal representation for a set because the column comprises a linear collection of numbers. Representing sets in this way has several advantages. For example, it is easy to perform intersections at unions.
- an entity such as a person
- various bits of data concerning this person are represented by links to the node.
- the nodes and links are all primitives. When the primitives are written, each primitive has an identifier and, as the primitives are written, it is possible to build up an index that consists of the input sets that are used to drive an optimizer (discussed below).
- the indexes are typically tables, e.g. hashed, structured, dictionaries.
- a higher-level type has a significant amount of structure.
- the structure is a node in the graph.
- a node is encoded into tuples that represent the links.
- An ID is assigned to links as they are written. In an embodiment, the IDs increase sequentially. Information about each link is recorded in an index. The indexes detail, for example, all of the links of a particular type.
- the system can keep track of things, for example, which are names in a dictionary. Primitives are thus added in numerical order and the system appends them onto an array, which provides a sorted set representation that underlies the system.
- the ID is the manifestation of, for example, the identity of an individual. The only commonality through all the nodes is they all have unique identities. Thus, the only thing that is unique about anything in the system is the ID number.
- the node corresponds to an identity in the real world. It is an identity of something that is to be described in the system. Anything accessible in the system must have a node to represent it and, optionally, it can have links that add data about it, such as its relationship to other entities. Accordingly, there is a node, which comprises an identity of some entity, and the identity can have with it various links. These tuples, and each of the links, can define a type with which they can be associated.
- the object is called Arnold
- the Arnold column can be part of any set of relational information.
- the resulting graph allows the system to keep track in an index of links of type name.
- the system adds the ID for the primitive to all the sets that are appropriate. There is one set for each kind of linkage. As discussed above, the primitives have a left and a right, and can have a value and have a type. An object is a primitive that has a type but does not have a value. Every link, i.e. a left, has a type that says what sort of link it is. From the standpoint of the database, the system does not know anything about what the type means. All that is known is that the system is asked to constrain the type, e.g. find all of the spouses of the object, e.g. Arnold. Some of the links have values which are literal actual strings that are stored in the link, e.g.
- Some of the links do not have any values, such as an instance of link, which means that an identity is an instance of a particular type.
- Arnold is a person.
- indexing I accordance with the invention it is possible keep track of all of these instances in a simple, uniform manner.
- This typically comprises an array of IDs that may have a further property, such as being sorted in ascending order, e.g. because as they are created they are added to the index.
- Graphd supports a subset of the traditional database ACID guarantees (see Figures 4 and 5): it is optimized for collaborative wiki-style access. We assume a long cycle of read/write interactions, and so provide no built-in read- write atomicity. Instead, we guarantee only that writes of connected subgraphs of primitives are atomic (and durable). In the sense that it never becomes visible to a user, it may be the case that q user's write collides with someone else's. However, the user's input is always preserved, and, if the user desires, it can easily be re-instated by browsing the modification history.
- a processor is configured to capture and preserve user input during a long cycle of read/write interactions (300) and guarantee that writes of connected subgraphs of primitives are atomic and durable (310). In this way, the user's input is always preserved, and, if the user desires, it can easily be re-instated by browsing the modification history.
- the performing of callbacks includes providing a graph server having a time ordered output buffer chain for each connection to each of a plurality of clients comprising data waiting to be sent back in order to a client (400).
- a callback is callback to each individual output buffer in the output buffer chain (410).
- the call back comprises a function pointer to a function within the graph server. The call back indicate how up-to-date the data should be at this point in time.
- the database maintains a counter as it executes write operations (420). In coordination with the processor, the database executes the write operations, and updates the counter to a current state. A callback value is then compared to the counter (430), and the callback is satisfied when the counter indicates that the database is up-to-date.
- a typical query is for a specific node. Initially, nothing is known about it. There is a link, which refers to the node with its left, and its type, which is its GUID, i.e. a number that corresponds to the node for the type. The node has other links and the type for such links may be, for example, such things as date of birth, etc. There is thus a query, which asks for a node with a link of a particular type whose value is, for example a name, such as Scott; and which asks for another link of another particular type whose value is, for example, a birth date, such as 1964.
- a query in one embodiment comprises a nested loop join.
- each link has a particular type that can be looked up.
- the type corresponds to the name.
- the result is a much, much smaller set.
- a separate search can be made to satisfy both of the links in the query, resulting in two lists or sets of candidates.
- the system then intersects these two sets or lists of candidates.
- These sets are candidates for the links, and it is necessary to get the left of the tuple.
- the sets for the first link are sorted, straight out of the index. Thus, it is possible to perform an intersection very rapidly for the first link.
- the result is a set of candidates that are represented by the left.
- the lists may comprise 1 ,000 Scotts for one link and 20,000 people who were born on that date for another link.
- One way to complete the query is to take a candidate from a first list of candidates and check the candidate against the other list of candidates.
- Another way to complete the query is to take a candidate from the other list of candidates and check the candidate against the first list of candidates. So how to pick? One could select the list having the smallest number. But, the main concern is minimizing calculations, for example by estimating the size of the result set. That is, if an intersection is performed, how big is it, really? Because, for example, if the results of the intersection are almost the same size either way, it is probably a waste of time to even do it. It would be just as fast to pick one list and check the candidates as they come up for the other list. Whereas if the results of the intersection are tiny, then it is really valuable to select one list over the other because this technique reduces the number of candidates which go into the nested join.
- a first step is to decide which of the lists containing sets of candidates to use as a producer and which to of the lists to use as a checker.
- One way to do this is to compute, or try to compute, the first few items in each set of candidates and record how much this costs in terms of an abstract count. Based on the cost, it is possible to decide whether to use one list as a producer and check on the other list or vice versa. Because real queries are complex and go down many levels, the cost comparison is quite valuable.
- this aspect of the invention provides a cost-based optimization that is applied to a universe of nested sets of identity constraints.
- sets of identities represent a constraint having particular types and values that are combined into the sets.
- Each node is annotated in the query with a set of candidates, which is typically much smaller than everything that could be queried. In a sense, this embodiment of the invention runs a race and decides which list is more efficient as a producer than the other.
- this embodiment of the invention performs a cost- based optimization that is applied to a universe of nested sets of identity constraints, where sets of identities represent a constraint having particular types and values that are combined into the sets.
- a processor implemented step formulates and presents a database query in a template-based query language (500).
- the query terms comprise primitives which comprise any of nodes in a graph and links that express relationships between the primitives. Each primitive has a type and a value.
- a query results in a node returning corresponding first and second lists of candidates. The first list for a first link of a first type having a first value, and the second list for a second link of a second type having a second value.
- a separate search is performed to satisfy each of the links in the query (510) by taking a candidate from the first list of candidates and checking the candidate from the first list against the second list of candidates, and simultaneously taking a candidate from the second list of candidates and checking the candidate from the second list against the first list of candidates.
- At least a first few items in each of the first and second lists of candidates are computed (520), and how much this costs is recorded for each list in terms of an abstract count. It is then determined which of the first and second lists containing sets of candidates to use as a producer and which of the first and second lists to use as a checker based upon said abstract count (530).
- a resulting a list of candidates is returned for each link (540), and the sets of candidates are intersected to resolve said query (550).
- FIG. 7A and 7B Another embodiment of the invention concerns replication. As shown in Figures 7A and 7B, the system comprises a master database against which all writes occur (600). This embodiment of the invention assigns sequential
- each replica machine establishes a connection with the master database as a replica.
- the master database sends every primitive that is written to it to the replicas (630).
- a key feature of the invention that allows this to work is that, where the write on the master may fail because some constraint is not satisfied, if the write succeeds on the master, then it is known to succeed on the replica.
- the replica only needs to write three primitives. It is not concerned about if it would fail halfway through a query or not.
- the middle tier In a situation where it is desired to have the middle tier (640) talk directly to a replica to establish a connection, the question becomes: "What is done with the write?" In one embodiment of the invention, there are two things that could be done. One of the things that could be done is to have the middle tier know when there is a write, and then establish a separate connection so that the write comes back down to the replicas. A disadvantage of this approach is that the middle tier then must know a lot more, such as where the master is, especially because there is a tendency to change masters from time to time. It is preferred to tell the middle tier to go find a database, where any one is as good as another. However, the middle tier is provided with one connection to whatever database it is using.
- the replica When it sends a write to the replica, the replica cannot process it, but it has a second connection to the master.
- the middle tier forwards the write along to the master (650). Accordingly, all of the replicas have additional connections.
- the master does the writes (660) and then the resulting primitives come back out. If there is a replica of a replica, the replication protocol passes the address of the master along, so the replica of a replica also talks directly to the master. Thus, when primitive writes come in, they are written, a response comes back, and then the response goes back out to the middle tier.
- That virtual time can be provided as a parameter that indicates it is desired to read something at least as new as the virtual time.
- the request waits if the replica is not at least caught up to that point. Thus, one could perform a write and get back an ID, e.g.231.
- ID e.g.231.
- the system waits to answer the request until it gets the desired primitives, e.g. those with an ID of at least 231.
- a processor configured is to use sequential IDs as a virtual time that allows query results to be guaranteed consistent as of any desired virtual time (700), and a read request is pended until a replica is at least caught up to the entry point encapsulated into the virtual time (710).
- the preferred embodiment is a write through arrangement, with nothing stored in a buffer.
- the virtual time mechanism described above comprises a virtual time technique that is associated with the fundamental structure of the data.
- An embodiment provides a unique entry point, i.e. touch, that pretends that the system is up to date without actually writing anything. All it does is ask for the current virtual time. Even if one goes all the way back to the master, the result is still out of date almost instantly as more writes come in because writes are happening all the time. If a write is performed to one replica and a touch is performed to another replica, there is some risk that the system would not catch up with itself, unless it was implemented by looking at the master.
- a processor configured to create an index of primitives on the database (900).
- the primitives are immediately written to a log file of primitives, one after the other, without indexing, on the database (910).
- An up-to-date cache of primitives to be written to the index is maintained (920).
- Flush of the index from the cache to the database is delayed (930), and the index is written from the cache to the database (940).
- a series of callbacks is performed to update the index.
- the intermediate results are relatively bulky. They are subpieces of the table. What is typically done with the intermediate results is to read something out of the table, and look it up in another table. It is the looking up operation, i.e. the random access that hurts the locality of reference. Once the system has filtered through and got all of the authors, there actually is a set of authors. This comes out of the organization of the identity-organized database, and particularly the tuples are issued. Rather than having just the tuple indices, there are indices of all of the lefts of everything. Thus, the state of anything that could be the subject of a left is instantly known just from the identity. Everything is associated with the identity, so it is not necessary to look for other aspects of that identity elsewhere because, once the system hits one aspect of the identity, it hits everything about that identity. This depends on the specificity and the linkage.
- the individual data records are tuples consisting of a value, a name, and pointers to other records.
- the primitive has a value, i.e. it has a name that is not a text string, it has a data type that says something about the meaning of the thing that makes up the value; and there are four pointers, i.e. scope, type, left, and right, where left and right are meaningless and are used to construct the overall meaning of the records in the view of the application.
- Type and scope are more restrictive in that the type of a link between the left and right GUID describes the nature of the relationship.
- indexing scheme herein described treats type plus left and type plus right as things that are desirable to index because people want to look them up. Thus, if there are three indexes, there is a term lookup where a list of things that relate to something is found.
- Arnold has a name of Arnold, which connects a featureless node that is identified with the person Arnold and with the language that it has been labeled in, e.g. English. It is known that this is the node Arnold because there is a primitive that points to Arnold with one end. The one side points to the namespace. Suppose the left points to the actual Arnold node, then the right points to English, and together this amounts to the fact that, on a featureless node, this is in fact named Arnold in English. There are three primitives to make up the assertion that Arnold's name in English is Arnold. This is similar to the primitives in the assertion that Arnold is, in fact, a person.
- the example above comprises a tuple in which Arnold has a type, which is the display name that is seen by the user; and in which there is a scope, which is the attribution to the contributor.
- Arnold On the right, it is shown to be in English, and on the left it points to the Arnold thing.
- the Arnold thing is another tuple, but the Arnold thing is almost completely featureless, other than the scope. It stands for the thing in the real world, which is Arnold. Even that Arnold's name is Arnold in English is an assertion of a fact. On a higher level, if there was another node that represented Maria Shriver, there would be another set of three of these things.
- primitives also point to other primitives that they version (1100). If one wants to rename Arnold to Bob, another display name is written, and a link is put through the Arnold node in the same way that the Arnold tag was named. As soon as Bob is inserted, Arnold is no longer valid.
- the retrieve from the database can be positioned whether or not a node has been versioned. It can be asked: "Does anybody else function in the same way?" Then it can be asked whether anybody wants to be the left hand, or the right hand, or the scope, and so on. If a double database query is performed and one of the nodes that is about to be retrieved has been versioned, it is known that the system cannot return this value. It must find another one.
- this aspect of the invention provides an audit trail. If there are indexes pointing to something else with a previous pointer, that other thing, i.e. the something else, at some point asks: "Does anybody point to me?” That is quickly answerable because the system keeps an index of who points to whom that can be used to trace these relationships down in reverse. This is referred to as a reverse index. Take, for example, a sequence of three in time, the first one having the second one point to it, and the third one pointing to a third one. The second one has a pointer that says: “Here is the previous.” It is known not to go to that because it is previous and the system does not do that unless an investigation is being performed.
- Figure 12 illustrates how this works underneath.
- the original is the one that has no previous pointer.
- the original can be polled to find a table that the system maintains that is not in the primitives. This is part of the index information that the system is compiling.
- This index lists all primitives in the lineage, or everybody that has version (1210).
- Arnold, Bob, renamed to Charles would all point to the same lineage table, and Arnold would know that he is the original because there is no predecessor. There are no pointers to subsequent individuals.
- a key point of the invention is that the primitives are nested, stacked, or combined to formulate a query.
- the example above starts with the uppermost piece, which in this case is Arnold, and the uppermost piece turns into the set of everything that matches it. Then, in each element of the set of the things connected to the specific element of the top set there is a respective instance of pointers.
- the upper primitive is Arnold.
- the system tags all of the Arnolds, and asks for Arnolds with an instance of Maria, for example, i.e. all the instances that do not have Maria get pruned out. Different constraints more effectively prune the set than do other constraints.
- the system tries two different strategies (1400): one strategy takes an Arnold instance and sees if it can be proven to be an instance of person (1410); and the other strategy takes the first instance and sees if it is easy to prove that it has the name Arnold (1420).
- the system can process sets of different sizes on both sides (1430), but they might yield results faster depending on, for example, proving that Arnold is a person is going to be more successful than proving that a person is Arnold (1440). For example, the system may try for five matching results, where whichever strategy gets the first five wins becomes the strategy for the whole query. In this example, the system processes by Arnold, i.e. it looks up the instance of person, because Arnold is a common name.
- the iterator trees notice such things as the fact that there is a table, or that everybody who is valuable is Arnold. It is necessary to identify something that has an entry in the table. It is known what this type is because there is a table of things that have that type, and it is necessary to find something that is in all tables returned by the query.
- One aspect of the invention provided an "and iterator" that intersects each of these tables. There are several strategies for doing this, depending on how big the tables are and how much they actually intersect. Thus, an iterator can be added to each constraint (1500). All of these iterators know about their particular piece of the hierarchy. They do not know everything.
- the iterators are used to build up larger branches until the tree is complicated and has copies of the iterators. They are not just copies; they are actually preserved links. If there is a copy of a branch and it learns about what the branch actually looks like, e.g. how much the query terms intersect or what is a good strategy for computing this particular branch, then the iterators remember this. They have a common background storage, e.g. a cache, that they all use to optimize processing. Once they do this again, e.g. a third time, one branch already knows how to resolve a term. The result is a complicated, multi-armed iterator.
- the iterator At the top level, there is an iterator that is determining which of three sub-arms to lead with to produce IDs, and that then checks them against the other arms or branches.
- the iterator runs three races in parallel: one race between three candidates in parallel. Each of these is, in turn, also trying to figure out how to do its production. Thus, each of the arms, in turn, also runs races in parallel and so on. With all of this iterating going on, at some point there must be a resolution. Because the determinations get easier and easier as the system goes down the tree, eventually there are only, e.g. five values, or it is known that Winona Ryder only appeared in 20 movies, and it returns a table of the 20 movies.
- this aspect of the invention tries to resolve the query terms at the lowest level first across the various branches where the iterators are resolving these iterations (1700).
- the query terms get resolved in a positive way (1710), they get passed up to the next level, which then starts to run its test also.
- the system can either call the search off or it could keep passing itself up, branch by branch, to the top of the tree until there is a resolution.
- there is a budget function (1730) that may determine that the system is putting too many units into the query. In such case, the system stalls the resources going into the search and does not search anymore.
- an embodiment of the invention comprises the use of a strategy that employs multiple optimizations at the same time.
- the result sets are a subset of all the possible results of a constraint.
- the system is not always asking for everything that matches a query term. For example, a query might ask for five movies that starred Winona Ryder. If there were 30 such movies, it is still important to find five quickly and then not look for any more. That is, the search hit the constraint of returning only five movies.
- Optimizers are data structures that are interpreted by code. They contain conduits to the database proper, onto the disk, which is mapped to the memory. Data structures point to the memory, but first come to the mapped disk. This means that if a read does not finish quickly, it blocks write access to the disk because it is not possible to change the data structures while still pointing to them. It is not desirable for any one call to block access by everybody else.
- a scheme is employed to freeze the iterators. That is, the iterators are pickled into string form and then the string is stored somewhere (1800). Then, the system thaws the string later on, once it wants to resume the query (1810). The iterator is thus thawed from its string form back into the iterator shape, restoring all the pointers, and then the system begins running the operation again (1820).
- Cursors Another aspect of the invention concerns the use of cursors.
- a client desires to get a very large result set from the database, for example all of the people, where there are half a million people in the database, the client would not want them all at once because the client would not want to digest that large an output. It would take too long to give the client the answer and digesting the answer would take too long and require too much memory.
- One feature in the database in this embodiment returns a desired portion of the total result, such as 100 at a time, or 1 ,000 at a time. If there are more results available, the system indicates this fact and the client can give the system a token, i.e. a cursor.
- the cursor mechanism is used to temporarily suspend long-lived accesses.
- the client pulls up 100 results from the result set, and then the query is suspended. During this period of suspension, the users can write keep working in the database.
- the database is not waiting for the entire result, e.g. 3,000, to be returned to the client.
- the client gets everything that was in the database when the query was started. If there were 200 people and the client asked for 100 of the 200, then the client receives the 100 plus the cursor. If someone adds one to the database while outputting of the results is suspended, and then the client goes back to the database, e.g. five minutes later, with the cursor to get the other 100 results, the client gets the other 100 results, but only the original 200 results and not item number 201.
- This aspect of the invention provides a way of relaxing the consistency guarantee so that many people can use the database at the same time.
- a client can take a long time to finish retrieving query results without blocking database access by other people.
- this embodiment of the invention provides a load balancing technique that operates by relaxing the consistency guarantee in return for letting many people use the database at the same time.
- it is not desirable to block other people from writing. What is important to the client at the point that the query is posited is to get the results for the query at that time. If the client retrieves a result of all of the people in the database as of a certain point in time, it is possible to identify the point in time and then, later on, present a query to identify what has been added to the database since the previous query.
- a distributed system includes a write master (1900).
- the data from the write master is replicated out into multiple read servers and the clients connect to the read servers (1910).
- the clients do not write directly to the write master. They access the database and, if the client writes something, the read server that it is talking to forwards the write through to the write master, which then updates the database.
- One strategy provides a status cursor (1930), where all the needed information is encoded inside the cursor, and by sending that cursor to another read server, that other read server is given enough information to find the records that the querying client is paging through.
- the cursor in this case includes information about the chosen optimization. As discussed above, where resolution of a query involves an optimization, e.g. trying to intersect two sets, one for each query term, there is a contest.
- the optimization strategy is rolled up into the cursor so that the strategy can be sent to another read server that did not produce the cursor.
- the information is based on the database only, it is not based on some internal state of the read server.
- the other strategy stores a state (1940), shares that state or binds the user to one read server, and makes sure that the user always talks to that same read server.
- One strategy in a presently preferred embodiment of the invention is an opportunistic mix of stateless and stateful.
- the result set is too large to return to the client. It would take a long time to send it by TCP. It would be a very large and difficult to return to the client in a URL or as an HTTP result.
- the set could be stored locally for a while, depending upon how much traffic there is. If the client returns to the results, and if not too much time has passed, the client can look the results up and restart the query much more quickly. But if the system loses the results, it takes too long to return them, or the client is talking to a different read server, then the system reconstructs the query, and then resumes it.
- the cursor starts at a certain location and thus indicates the position of client in the table.
- Figure 21 provides an example of cursor paging, where a processor configured to return a desired portion of a total query result which is less than the total query result (2100). A cursor is provided with the returned portion of the total query result (2110). The cursor indicates that additional results exist. The cursor temporarily suspends retrieval of the additional query results (2120) and execution of write operations is allowed during the temporary suspension (2130). Retrieval is resumed and additional results are returned upon presentation of the cursor (2140).
- Figure 22 provides an example of cursors and virtual time. As shown, a query is presented to identify primitives that have been added since a previous query which was accompanied by a cursor (2200).
- the primitive when the primitive is written, it is stored in an index sequential access database that has all the primitives (2300), and at that point the system can recover. Then the indexes are updated (2310). These updates may or may not go through. They may or not actually have hit the disk by the time the power is turned off, but it is known how up to date the indexes are.
- the indexes are updated in kind of jerky movements - 100 things at a time, 200 things at a time - and the system remembers how far it was. For example, one machine may be up to primitive number, e.g. 512, and the primitive table that contains everything is up to 680 from the disk. In this case, it is necessary to take the primitives 513 to 680 and feed them through this mechanism again.
- the graph server as a process has an output buffer for each connection to each client process. This is not a single output buffer. It is an output buffer chain.
- the client process can have sent multiple quants; the system writes and then gets all the replies back. That is more efficient than doing it one at a time because it is not necessary to wait a whole roundtrip.
- Each time connection has an output buffers chain of character data that waits to be sent back in order to the client.
- An embodiment of the invention shown in Figure 24, a ttaches a callback to the individual output buffers that is a function pointer to a function within the graph (2400). The callback knows how up-to-date the data should be at this point.
- the code wrote a reply, indicating that the data are written data to disk, and it knows this if that "OK" goes out, indicated that at least that much data was written to the disk.
- the process might understand that the system is up to primitive 480; while there is a process that knows it must be up to 512; and there is another process that knows it must be up to 514, and so on.
- the callback must be up-to-date up to 480. It goes to the database proper and it is the database itself that has a similar counter rollover as it writes to disk. The database knows that it is not up-to-date yet. It starts writing everything to disk, waiting for all the data to be identically encoded. Then, it updates its counter to its current state, which might be, for example, 530. At this point, the callback is satisfied because it is now up-to-date. Thus, these characters can get sent to the client. The system is not making false promises anymore, although it was making false promises when it wrote the "OK.” It is through the callback that it is possible to guarantee that what was written is, in fact, written.
- the graph database is a collection of small fixed-sized structures we call “primitives.” Dealing with the graph database means creating and querying primitives.
- the graph repository protocol mainly alternates synchronous request from client to server with synchronous replies from server to client.
- This grammar is part of interface requests that match networks of links and nodes in some manner. They're used with the "read” command to access primitives; a variation is used with the "write” command to create new primitives.
- a presently preferred template grammar is described in detail in Appendix "4" hereto/ "
- Uncopying tokenizer is written in such a way to almost never copy the data in query strings. This makes things fast, because it doesn't shuffle data structures about quite as much; it uses less memory; and it uses memory mostly in more efficient fixed block sizes, rather than allocating many small objects.
- Appendix "7" hereto addresses a presently preferred design for multiple "concentric" graphd's where one graphd instance contains a superset of the data in another running instance.
- a graph database contains another graph database (2500).
- a master database is provided (2510) and at least one replica database core comprising an identity of said master database is also provided (2520).
- a concentric replica database to said replica database core is provided (2530) to store any relatively long-lived data which relates to the replica database core. For each transaction in a replica stream, an equivalent set of writes is produced an d applied to the concentric replica database (2540).
- the concentric replica database comprises a superset of the master database.
- a processor is configured for identifying poison queries by storing a hash of a currently executing query into a shared memory that is used to determine that a crash has occurred during a write (2610).
- the parent process reading the hash identifies a query that was running at the time of the crash (2620).
- a dictionary is provided which maps a hash version of a query to a death-count and date (2630), where the death-count is a number of crashes associated with a query represented by the hash and the date is the date of a most recent crash. The death-count and date is adjusted as poison queries arrive (2640).
- antidote messages are broadcast to all connections (2650). New connections receive antidote messages for all poison queries immediately after connecting (2660).
- Appendix "10" hereto addresses a presently preferred graphd design for a multi processor CPU.
- a parallel processing method for a database includes a database against which all writes occur (2700).
- a plurality of processors are provided, each processor establishing a connection with the database (2710), where each processor writes every primitive it processes to the database (2720).
- Figure 28 is a block schematic diagram of a machine in the exemplary form of a computer system 1600 within which a set of instructions may be programmed to cause the machine to execute the logic steps of the invention.
- the machine may comprise a network router, a network switch, a network bridge, personal digital assistant (PDA), a cellular telephone, a Web appliance or any machine capable of executing a sequence of instructions that specify actions to be taken by that machine.
- the computer system 1600 includes a processor 1602, a main memory 1604 and a static memory 1606, which communicate with each other via a bus 1608.
- the computer system 1600 may further include a display unit 1610, for example, a liquid crystal display (LCD) or a cathode ray tube (CRT).
- LCD liquid crystal display
- CRT cathode ray tube
- the computer system 1600 also includes an alphanumeric input device 1612, for example, a keyboard; a cursor control device 1614, for example, a mouse; a disk drive unit 1616, a signal generation device 1618, for example, a speaker, and a network interface device 1620.
- an alphanumeric input device 1612 for example, a keyboard
- a cursor control device 1614 for example, a mouse
- a disk drive unit 1616 for example, a disk drive unit 1616
- a signal generation device 1618 for example, a speaker
- a network interface device 1620 for example, a network interface device 1620.
- the disk drive unit 1616 includes a machine-readable medium 1624 on which is stored a set of executable instructions, i.e. software, 1626 embodying any one, or all, of the methodologies described herein below.
- the software 1626 is also shown to reside, completely or at least partially, within the main memory 1604 and/or within the processor 1602.
- the software 1626 may further be transmitted or received over a network 1628, 1630 by means of a network interface device 1620.
- a different embodiment uses logic circuitry instead of computer-executed instructions to implement processing entities.
- this logic may be implemented by constructing an application-specific integrated circuit (ASIC) having thousands of tiny integrated transistors.
- ASIC application-specific integrated circuit
- Such an ASIC may be implemented with CMOS (complimentary metal oxide semiconductor), TTL (transistor-transistor logic), VLSI (very large systems integration), or another suitable construction.
- DSP digital signal processing chip
- FPGA field programmable gate array
- PLA programmable logic array
- PLD programmable logic device
- embodiments may be used as or to support software programs or software modules executed upon some form of processing core (such as the CPU of a computer) or otherwise implemented or realized upon or within a machine or computer readable medium.
- a machine-readable medium includes any mechanism for storing or transmitting information in a form readable by a machine, e.g. a computer.
- a machine readable medium includes read-only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; electrical, optical, acoustical or other form of propagated signals, for example, carrier waves, infrared signals, digital signals, etc.; or any other type of media suitable for storing or transmitting information.
- ROM read-only memory
- RAM random access memory
- magnetic disk storage media includes magnetic disk storage media; optical storage media; flash memory devices; electrical, optical, acoustical or other form of propagated signals, for example, carrier waves, infrared signals, digital signals, etc.; or any other type of media suitable for storing or transmitting information.
- the graph database is a collection of small fixed-sized structures we call “primitives”. Dealing with the graph database means creating and querying primitives.
- a link is like a node, but has two more fields - pointers to a left and a right node.
- a nonzero, globally unique ID A primitive that is modified implicitly changes its GUID. (In other words, a GUID completely identifies one "version" of a primitive.)
- a GUID has two parts:
- version-independent identity fields of objects are the same if and only if the application said so. (That is, those version-independent identity fields are unique, too.)
- the graph repository can easily (in constant time) find the current version of any GUID presented to it, even if the primitive owner of the GUID itself no longer exists in the repository.
- Each node or link created on one local system has a unique timestamp. Primitives created in sequential synchronous requests have timestamps as one would expect (the one that's created second has a timestamp that is higher than that of the one created first.)
- integer string - an UTF-8 string.
- integer - a 64-bit signed integer float - a floating point value of unspecified precision timestamp - a 64-bit time stamp value, fit for comparison with primitive's timestamps and use as a boundary in searches
- URL - an us-ascii string that encodes a URL, for example an internal URL pointing to a text or data fragment in a nearby repository GUID - a 128-bit internal GUID or null bytestring - a string of 8-bit bytes that can contain any values, including ⁇ 0
- Primitives don't have a value if and only if they also don't have a datatype.
- the rest of the information in the primitive encodes user input and should be archived. (This is the default.) If the flag is set to false, the information has been algorithmically derived from other information in the graph and should not, or need not, be archived.
- the rest of the information in the primitive is part of the current database state. (This is the default.) If the flag is set to false, the entry is a "tombstone", merely taking the place of an entry that was there previously and that has been deleted.
- Entries without the "live bit" are by default filtered out of queries, but when mirroring systems or subscribing to updates, they are useful in announcing to a listener that a primitive is no longer part of the application's idea of reality.
- Scopes can refer to other scopes in a strict hierarchy, but each primitive is in exactly one scope.
- links also contain the following: 2.1 left, right: 128-bit integers
- the integers are GUIDs intended to refer to other primitives.
- the link is thought of as "connecting" those primitives.
- the left GUID must be non-zero.
- the right GUID may or may not be zero.
- the protocol mainly alternates synchronous request from client to server with synchronous replies from server to client.
- the server can also asynchronously notify a client about an error that leads to it shutting down, but that's going to be seen as an error response to the next command by the client.
- Requests and responses are terminated by a newline outside of a string literal or a parenthesized list.
- Requests can be pipelined. Unless the application semantics demand it, a request submitter does not have to wait to receive a reply before submitting the next request. Nevertheless, replies do arrive in the order of requests; to break up requests with large responses, use paging.
- Requests can span message boundaries of the underlying transport system.
- a request or reply can be larger than what one "write" call to a TCP connection can dispose of without blocking.
- the system can impose a maximum request or response size.
- Any request can result in an error on a couple of different levels.
- the request may have been syntactically incorrect; the user may have exceeded their allowance in storage time or space; the repository may be shutting down; etc.
- Errors are sent to the client in two parts: a machine-readable code, and a more detailed human-readable error message filling in details of the code.
- a virtual time parameter includes an instance ID that doesn't match that of the database.
- readonly access to the server has been set to read-only.
- restore access to the server has been set to restore-only.
- the client should retry its request after a few seconds.
- the client should disconnect and try to reconnect after a few seconds.
- the primitive size limit is about 32k, but details depend on the amount of compression for one primitive.
- Certain parameters can be set on any request.
- request-modifiers request-modifier request-modifiers / request-modifier
- loglevels loglevel loglevels / ""
- the virtual time can be specified in three ways:
- the read-access is executed as of the state just after the primitive had been added to the system.
- the read access is executed as if it were the time of that timestamp.
- the "virtual time” is a compound “odometer reading” that characterizes how up-to-date a server is. It is useful in ensuring consistency between "read” and "write” requests in a distributed system.
- While processing the request increase the server loglevel to the specified levels.
- the keywords for the log levels are the same as for the server "log-level" configuration parameter.
- the system makes a best effort to return a timeout error and partial results.
- reply-modifiers reply-modifier / reply-modifier reply-modifiers
- the value of this modifier can be handed to a server to request that it be at least this up-to-date before attempting to fill a request.
- the results may or may not be partial.
- replies correspond synchronously to requests. (They come back in the order the requests were received.) If we ever introduce an asynchronous modifier that allows replies to overtake each other, this will be needed.
- the true cost of the request The string contains a space-separated list of name-value pairs.
- pr page reclaims a benevolent form of page fault that doesn't actually do any work because the page is still in the local cache.
- pf page faults the thing we're trying to minimize.
- dr primitive data reads - how many single primitive structs were read from disk (for example, as part of dismissing them as candiates for a qualified search).
- index size reads - how many indices were looked up with their starting address and size.
- ir index element reads - how many times a single id was read from an index.
- iw index element write - how many times an element was added to an index.
- va value allocations how many times a (possibly temporary or transient) result data structure was allocated
- mm memory maximum (in bytes) - the largest number of bytes allocated for the request at any one time
- the data returned by the list request is a list of lists and scalars.
- tuples tuple tuples / tuple
- the write request is similar to the read request, with certain restrictions and changes to the template.
- GUIDs of new primitives are returned as a tuple.
- the syntax used to write is a closely restricted subset of the syntax used to read.
- the default meta literal is "node"; it is meaningless and might as well be omitted.
- the valuetype defaults to "null”, or 1. If a value is specified in an insert request, but no valuetype, the valuetype defaults to "string”, or 2.
- value types do not influence the behavior of graphd with respect to the values in any way. They are not evaluated when comparing values or matching them. For all practical purposes, they are simply a small integer that is stored in the primitive for arbitrary use by the application.
- GUID constraint is present in a "write" request, the newly written primitive is intended to version, that is, to implicitly delete and take the place of, another primitive.
- the specified GUID may have been replaced by another primitive in the mean time (and that one, in turn, by others); whatever the most recent primitive in this lineage is versioned.
- GUID GUID
- the system If no GUID is specified, the system generates a new GUID.
- Unspecified parts of the timestamp are set to the minimum allowed by the type. (I.e. 1 for months and days, 0 for hours, minutes, seconds.)
- the timezone is optional; if it is supplied, it must be "Z", standing for UTC.
- primitive-literals recursively specify groups of links or nodes that are implicitly connected to their containing expression.
- the linkage-> or ⁇ -linkage specifies how the containing and the contained primitive are connected. It can be omitted once if the outer primitive is tagged with "->" or " ⁇ -" (which implies a form of attachment between outer and inner primitive.)
- Flag literals concern flags set for the primitive.
- entries are considered archive-worthy. By default, they're also intended as data additions to the database state, not as markers for deleted records.
- the unique modifier has two effects: it turns on the existence-testing as a precondition for the write to succeed; and it specifies what it means for something to "already exist” - do the type and name have to match? Type, name, and value? Just the value?
- Each namespace has a well-known headnode.
- Unique labels are implemented as links between the named object and the namespace headnode. When inserting the links, they are inserted with a unique qualifier that includes the namespace and the value; expressing that the value has to be unique within the namespace.
- Pat contains a single unique cluster, made up of the nodes tagged Pat and Kim. Pat and Kim both have unique tags, and the link between them is part of Pat's unique set.
- Pat and Kim must both be unique, but their connection - Pat's right - isn't part of Pat's unique set.
- key-instruction-items key-instruction-item key-instruction-items / key-instruction-item
- the key literal designates those parts of a template that identify a location in the graph (an "lvalue"); the remainder of the template describes the state that the caller wants that location to have, with the understanding that insertion, versioning, and partial addition can be used to bring about that state.
- the "write" call makes the minimal set of versioning changes and additions needed to bring about the desired state, and returns the set of GUIDs from the finished mix of old, versioned, and new.
- Key primitives cluster just like unique primitives. Each cluster contains the largest possible connected group of primitives that each have keys and are connected with links that are covered by keys. If a read query built from the cluster parts matches, the cluster matches, and will either version or just reuse the primitive network. (Each member of the key cluster versions its own corresponding member of the matched set, if needed.)
- the comparator literal influences the way unique- or key comparators compare values while checking for existence of a value. Its syntax is exactly the same as the comparator constraint syntax.
- comparator-literal comparator-constraint
- a "cancel" request is an announcement from the client that it will not parse (i.e., throw away, skip) the results.
- a "status" request asks various modules in the graphd interior for information about their status and operation. This is intended as a testing and monitoring tool; the operations should be fast enough to not significantly affect server speed or functioning.
- status-request-items status-request-item status-request-items / "
- status-reply-items status-reply-item status-reply-items / "
- connection reply occurs in the order of the corresponding keywords in the request.
- status-reply-item status-access-rep Iy / status-connection-reply
- connection reply is a parenthesized list of per-connection data, one for each connection (or "session") to the server.
- status-connection-reply-items status-connection-reply-item status-connection-reply-items / ""
- the database reply is a parenthesized list of name/value pairs. For their meaning, consult the gstatus.1 manual page.
- status-database-reply-items status-database-reply-item status-database-reply-items / "" status-database-reply-item:
- the memory reply is just a list of strings; the strings are produced by the memory trace module of graphd, if graphd is started with the "-t" option. Otherwise, the memory reply is null.
- status-memory-reply-items status-memory-reply-items status-memory-reply-item
- the strings internally have a form dictated by the memory trace module (cm_trace()); typically, something like
- the access reply lists the current global server access level, as set with the "set" command.
- the loglevel reply lists as a parenthesized list, the loglevels and -facilities currently in effect.
- a Boolean value indicating whether the server is currently inclined to dump core on programmer error or not.
- the value can be changed with the "set" command.
- the value can be changed with the "set" command.
- the names are the same as in the "cost" request parameter; where present, they define global limits for any request in the system.
- the value can be changed with the "set" command.
- a string value is a the "build version" of graphd.
- the detailed syntax depends on the build system; currently, it is "gd/trunk/graph/src:8439:8454M", that is, the significant parts of the svn URL, followed by the svnversion output for the graphd build root.
- Any empty peer address string indicates that the corresponding connection does not exist.
- An empty address url string indicates that the corresponding address has not been configured.
- a dump request saves contents from the local database in a textual form, fit for use with a later "restore” command.
- dump-constraints dump-constraint dump-constraints
- dump-tuples dump-tuple dump-tuples / ""
- the restore request feeds state generated by the dump request back into the database.
- a restore request starts at startstate of zero, the existing database state in the server is destroyed prior to the restore. If a restore request doesn't start at zero, its startstate must be less than or equal to the highest endstate of the server into which the data is being restored. If the start state is less the highest endstate (server's horizon) then the restored values will be compared to what is in the database and the restore will fail if they are different. This allows one to retry a failed incremental restore.
- restore-request (In this context, "zero” is the first startstate returned by a dump command - it doesn't matter how it is encoded.) restore-request:
- set-options set-option set-options
- access-option "restore” ; reject write, read / "replica”; operate as a replica / "replica-sync” ; read-only replica / "read-only” ; reject write / "read-write” ; allow all
- the command When setting a loglevel, the command accepts either a single loglevel or a parenthesized list of loglevels.
- the server loglevel is set to the union of all the listed loglevels.
- loglevel-option-list loglevel-option-list loglevel-option-item / ""
- Mixed in with the loglevels can be at most one session displayname, as listed e.g. in the response to status (conn).
- loglevel-option-session ip-address ":" port ; identifying a session
- the debuglevel set is not that of the global server system, but only of that session. In other words, while that session serves its requests, the loglevel is temporarily set to the level indicated. To remove the special session loglevel, run a set loglevel command with only a session level, no specific loglevel.
- the "sync” option is a server-wide value that controls whether an "ok” response from the server immediately followed by a server shutdown or crash still guarantees that the data can be recovered from disk.
- the "core” option controls whether the server will dump core when crashing.
- a replica request starts the replication protocol.
- the request is sent by the replication client (a replica database) to the the replication server (the master database).
- replica-options replica-option replica-options / ""
- version-option number ; the id of the first primitive to send
- a replica response consists of a protocol version and the url of the write-master:
- the replication server After a positive replica-request/response, the replication server will emit replica-write commands.
- Each replica write command contains the primitives written by one or more write commands.
- the first primitive in the dump tuples will always have the TXSTART bit set.
- the client sends the server the highest protocol version it is capable of understanding.
- the server responds with the protocol version it will use, or an error if it determines that it cannot talk with the client.
- Either side of the replication protocol may drop the connection at any time. If the replication connection is dropped, the client is expected to try to re-establish it at "sensible" intervals.
- Replica databases use the master url returned in the replica response to forward write requests on to the master graphd. If the connection between a replica and the master graphd fails, the replica is expected to drop the replication connection and re-establish it with the check-master option.
- a "sync" request performs a database checkpoint and returns the new horizon of the database:
- the "sync" command When coupled with the "replica-sync” access mode, the "sync" command produces a stable disk database that can be easily copied to a backup store.
- ⁇ a is read as a for all a other than "n" on input. There are no hex or octal escapes. ( ⁇ 0 evaluates to the digit 0, ascii 0x30, not to NUL, ascii 0x00.)
- GUIDs are rendered as 32-byte hex strings: guid: 32hex hex:
- a primitive type is rendered as a string. In modern notations, it is rendered as a GUID.
- the value is rendered as a string.
- timestamp On output, the timestamp is rendered with 16bits of sub-second precision. On input, shorter timestamps are accepted. timestamp: y?yyyy [".” mm ["-" dd ["T" hh [":” mm [":” ss [".” +n ]]]]]]]]] [ 11 Z"]
- the timezone is optional; if it is supplied, it must be "Z", standing for UTC.
- valuetype A small integer in the range 1..255. If unspecified, the valuetype assigned to primitives with a value is 2; without a value, 1. valuetype: number
- Valuetype and datatype are the same underlying value.
- integers in range 1..9 are printed as the name of their corresponding datatype; datatypes outside that range are printed as numbers.
- valuetype all values print as numbers.
- a template expression is a parenthesized, comma-separated list of zero or more constraints.
- template-expression "(" [ constraints ] ")" constraints: constraints constraint / constraint
- each parenthesized nesting level of templates evaluates to a list of primitive components - GUID, name, value, and so on - that matched the query. If a constraint holds subconstraints, the subconstraints, too, evaluate to nested lists of the primitives that matched that particular subconstraint. To read more about that, see section 8, primitive constraints. Multiple constraints combine disjunctively: the combination matches primitives that match all of the subconstraints. constraint: or-constraint / constraint "
- Meta constraints other than -> ⁇ - are deprecated. The words “node” and “any” are skipped without meaning. meta-constraint:
- Matching link primitives have in their "right” field the GUID of a primitive that matches the surrounding node expression, if any. Subordinate node primitive constraints must be matched by primitives whose GUIDs appear in the link's "left" field.
- a valuetype constraint restricts by the "valuetype”, a small integer in the range 1..255 that nominally refers to the data type of the value string, but actually has almost no semantics inside graphd.
- GUID CONSTRAINT A GUID constraint addresses the matching primitive directly: guid-constraint:
- queries only return the newest, existing version of an object in the repository. Therefore, this search will find either the newest version of the object that shares application-defined identity with 1234567890abcdef1234567890abcdef, or fail because the object has been deleted.
- a timestamp constraint matches primitives created at, before, or after a certain time.
- timestamp numop timestamp / "virtual time” diffop virtual time numop:
- Omitted timestamp parts default to the minimum value possible (0 for minutes, seconds, and hours; 1 for months and days.)
- the virtual time can be specified as a string or a GUID.
- a virtual time that is a GUID is the time just after the creation of the specified GUID.
- An explicit "newest” or “oldest” overrides that default.
- the "type" of a primitive is the unique label is either null or the string value of the link connecting the primitive's typeguid to the global type name space primitive (a node with name "metaweb:global-type-namespace”.
- a value is equal to null if it is null (unset, or explicitly set to the unquoted atom null).
- a value is equal to a string if it compares equal to it using the comparator (by default, a case-insensitive comparison for US-ascii, bytewise comparison for everything else.)
- the globbing semantics implemented by the default comparator are described in detail in gr-simple-search.txt. Briefly, “foo" searches for the word “foo” anywhere in the text; anchoring, if needed, uses ⁇ to mean beginning and $ to mean end. An " * " at the end of a word means “starting with” - so "foo*” finds all words starting with foo..
- the application can expect matches with a trailing * to be reasonably fast.
- a "comparator constraint" can be specified to change the way strings are compared, both for a value comparison and for sorting; see 7.3 for details.
- the identity and matching constraints are controlled by a value-comparator; the sorting constraints are controlled by a sort-comparator.
- the sort-comparator is a list; each Nth element of the sort comparator list controls the matching for the Nth sort key. The last element of the list is used for all following keys.
- comparators have names, and one changes the default by mentioning the name of the desired comparator
- Comparators can be changed per-access, and do not affect the way primitives are stored in the database.
- comparator-constraint
- comparator names have two parts: an optional locale specifier (separated by a ";") and the comparator's locale-independent name.
- the locale-independent names are matched exactly; the locale-prefixes in a query can be more specific than the ones they match.
- an "en-UK;casemap” comparator requested by a user may match an "en;casemap” comparator installed in the system.
- the "octet" comparator is useful for strings that are not human readable, e.g. base-64 encoded byte strings. 8.3.2 Default Comparator
- the default comparator (which need not be set explicitly, but could be called “default” if it were) performs two services:
- Primitive constraints match against a list of primitives, specified either as GUIDs or as recursive template-expressions.
- primitive-constraint
- the subconstraint matches primitives that point to the outside in one of their four connecting fields
- Flag constraints regulate flags set for the primitive; right now, there are two: “live” and “archival”. In addition, there is a monadic constraint “false” that causes the read constraint it appears in not to match. For each flag, we can filter to allow only true, only false, or allow either of them ("dontcare”). flag-constraint: archival-constraint
- Records with the "live” flag set to "false” are markers that indicate the deletion of a previously existing record. By default, they are not returned in query replies.
- the query response will include primitives that are deletion markers. If it is set to "false”, only deletion markers will match the record. If it is set to "true” (the default), only live (non-deleted) primitives match the query.
- mirror owners can make a separate choice to mirror old versions, perhaps further constrained by their number or time range.
- Records with the "archival" flag set to "true” are primary sources of the information they carry. Records with the flag set to "false” are deemed ephemeral, and can either be regenerated or are just generally not considered important.
- next and previous constraints specify GUIDs of a predecessor or successor primitive.
- next-previous-constraint "next" guidop guidset
- the "next" constraint is always null.
- types can also be specified as strings, using the "type" parameter.
- Instructions concern details of how the results of a match are sorted, paginated, and returned. They don't affect what matches, just what happens to it. (Nevertheless, they can be used for optimizations while the match is going on.)
- Each "match” in that list is one instance of a sub-graph that matches the template.
- the "result" instruction controls how that list of matches is mapped to the tuple returned by a READ request. (Result instructions are not allowed in any other kind of request.)
- a result list is nested at most two parentheses deep, and contains at most one such nested list.
- the outer list or an outer scalar, is returned once per matching list.
- the inner list is repeated once for each matching list element.
- result-instruction-item result-instruction-item tuple-instruction-item / contents-instruction-item
- result-response For each result-instruction-item, the corresponding result-response-item is rendered in order of appearance in the result instruction, as described below.
- the x-response-item appears in the response in a position corresponding to that of the x-instruction-item in the read request.
- the token "contents” is replaced with a comma-separated list of the results for dependent links or nodes that were matched by the query.
- Each element of the "contents-response-item” is the result for one potential query-satisfying subgraph, sorted and paged (if specified) according to the sorting and paging criteria in the subgraph.
- contents-instruction-item "contents" contents-response-item: [result-responses] result-responses: result-response
- estimate-count-instruction-item The token "estimate-count” is replaced with an internal iterator estimate of the number of available records. This estimate will usually exceed the actual count. estimate-count-instruction-item:
- estimate-count-response-item count-response-item / "null” If no estimate is available, the estimate count may be returned as “null”, meaning "we don't know”.
- result parameters "value” and “count” operate on two different levels - the “count” affects the result list as a whole (just like sort and page do), while other result parameters determine individual, repeated elements of the list.
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Selon, l'invention, une nouvelle conception de bases de données est mise en œuvre. Dans la base de données, tout est modélisé avec des primitives, y compris les liaisons et les nœuds pour un stockage de n-uplets de graphiques. Une syntaxe d'interrogation fournit une arborescence imbriquée de contraintes avec un seul schéma global. Différentes techniques d'optimisation pour des interrogations et des techniques de duplication sont également décrites.
Applications Claiming Priority (8)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US14624009P | 2009-01-21 | 2009-01-21 | |
| US61/146,240 | 2009-01-21 | ||
| US12/690,696 | 2010-01-20 | ||
| US12/690,642 | 2010-01-20 | ||
| US12/690,727 US20100121839A1 (en) | 2007-03-15 | 2010-01-20 | Query optimization |
| US12/690,642 US20100174692A1 (en) | 2007-03-15 | 2010-01-20 | Graph store |
| US12/690,696 US8204856B2 (en) | 2007-03-15 | 2010-01-20 | Database replication |
| US12/690,727 | 2010-01-20 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2010085523A1 true WO2010085523A1 (fr) | 2010-07-29 |
Family
ID=42356195
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2010/021579 Ceased WO2010085523A1 (fr) | 2009-01-21 | 2010-01-21 | Stockage d'un graphique |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2010085523A1 (fr) |
Cited By (16)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8204856B2 (en) | 2007-03-15 | 2012-06-19 | Google Inc. | Database replication |
| US8935299B2 (en) | 2012-07-19 | 2015-01-13 | Facebook, Inc. | Identifying relevant data for pages in a social networking system |
| US9141707B2 (en) * | 2012-07-19 | 2015-09-22 | Facebook, Inc. | Context-based object retrieval in a social networking system |
| US9224103B1 (en) | 2013-03-13 | 2015-12-29 | Google Inc. | Automatic annotation for training and evaluation of semantic analysis engines |
| US9235626B2 (en) | 2013-03-13 | 2016-01-12 | Google Inc. | Automatic generation of snippets based on context and user interest |
| US9235653B2 (en) | 2013-06-26 | 2016-01-12 | Google Inc. | Discovering entity actions for an entity graph |
| US9342622B2 (en) | 2013-06-27 | 2016-05-17 | Google Inc. | Two-phase construction of data graphs from disparate inputs |
| US9454599B2 (en) | 2013-10-09 | 2016-09-27 | Google Inc. | Automatic definition of entity collections |
| US9659056B1 (en) | 2013-12-30 | 2017-05-23 | Google Inc. | Providing an explanation of a missing fact estimate |
| US9785696B1 (en) | 2013-10-04 | 2017-10-10 | Google Inc. | Automatic discovery of new entities using graph reconciliation |
| US9798829B1 (en) | 2013-10-22 | 2017-10-24 | Google Inc. | Data graph interface |
| US10002117B1 (en) | 2013-10-24 | 2018-06-19 | Google Llc | Translating annotation tags into suggested markup |
| US10089408B2 (en) | 2014-10-16 | 2018-10-02 | Adp, Llc | Flexible graph system for accessing organization information |
| US10223637B1 (en) | 2013-05-30 | 2019-03-05 | Google Llc | Predicting accuracy of submitted data |
| US10713261B2 (en) | 2013-03-13 | 2020-07-14 | Google Llc | Generating insightful connections between graph entities |
| US10810193B1 (en) | 2013-03-13 | 2020-10-20 | Google Llc | Querying a data graph using natural language queries |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20030236795A1 (en) * | 2001-05-03 | 2003-12-25 | Kemp Thomas D. | Method and system for identifying objects |
| US20040225865A1 (en) * | 1999-09-03 | 2004-11-11 | Cox Richard D. | Integrated database indexing system |
| US20060212432A1 (en) * | 1999-01-05 | 2006-09-21 | Tsai Daniel E | Distributed database schema |
| US20060218123A1 (en) * | 2005-03-28 | 2006-09-28 | Sybase, Inc. | System and Methodology for Parallel Query Optimization Using Semantic-Based Partitioning |
-
2010
- 2010-01-21 WO PCT/US2010/021579 patent/WO2010085523A1/fr not_active Ceased
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20060212432A1 (en) * | 1999-01-05 | 2006-09-21 | Tsai Daniel E | Distributed database schema |
| US20040225865A1 (en) * | 1999-09-03 | 2004-11-11 | Cox Richard D. | Integrated database indexing system |
| US20030236795A1 (en) * | 2001-05-03 | 2003-12-25 | Kemp Thomas D. | Method and system for identifying objects |
| US20060218123A1 (en) * | 2005-03-28 | 2006-09-28 | Sybase, Inc. | System and Methodology for Parallel Query Optimization Using Semantic-Based Partitioning |
Cited By (24)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8204856B2 (en) | 2007-03-15 | 2012-06-19 | Google Inc. | Database replication |
| US8935299B2 (en) | 2012-07-19 | 2015-01-13 | Facebook, Inc. | Identifying relevant data for pages in a social networking system |
| US9141707B2 (en) * | 2012-07-19 | 2015-09-22 | Facebook, Inc. | Context-based object retrieval in a social networking system |
| US10311063B2 (en) | 2012-07-19 | 2019-06-04 | Facebook, Inc. | Context-based object retrieval in a social networking system |
| US9720913B1 (en) | 2013-03-13 | 2017-08-01 | Google Inc. | Automatic generation of snippets based on context and user interest |
| US11403288B2 (en) | 2013-03-13 | 2022-08-02 | Google Llc | Querying a data graph using natural language queries |
| US10810193B1 (en) | 2013-03-13 | 2020-10-20 | Google Llc | Querying a data graph using natural language queries |
| US10713261B2 (en) | 2013-03-13 | 2020-07-14 | Google Llc | Generating insightful connections between graph entities |
| US9235626B2 (en) | 2013-03-13 | 2016-01-12 | Google Inc. | Automatic generation of snippets based on context and user interest |
| US9224103B1 (en) | 2013-03-13 | 2015-12-29 | Google Inc. | Automatic annotation for training and evaluation of semantic analysis engines |
| US11526773B1 (en) | 2013-05-30 | 2022-12-13 | Google Llc | Predicting accuracy of submitted data |
| US10223637B1 (en) | 2013-05-30 | 2019-03-05 | Google Llc | Predicting accuracy of submitted data |
| US9235653B2 (en) | 2013-06-26 | 2016-01-12 | Google Inc. | Discovering entity actions for an entity graph |
| US10614127B2 (en) | 2013-06-27 | 2020-04-07 | Google Llc | Two-phase construction of data graphs from disparate inputs |
| US9342622B2 (en) | 2013-06-27 | 2016-05-17 | Google Inc. | Two-phase construction of data graphs from disparate inputs |
| US10331706B1 (en) | 2013-10-04 | 2019-06-25 | Google Llc | Automatic discovery of new entities using graph reconciliation |
| US9785696B1 (en) | 2013-10-04 | 2017-10-10 | Google Inc. | Automatic discovery of new entities using graph reconciliation |
| US9454599B2 (en) | 2013-10-09 | 2016-09-27 | Google Inc. | Automatic definition of entity collections |
| US9798829B1 (en) | 2013-10-22 | 2017-10-24 | Google Inc. | Data graph interface |
| US10002117B1 (en) | 2013-10-24 | 2018-06-19 | Google Llc | Translating annotation tags into suggested markup |
| US10318540B1 (en) | 2013-12-30 | 2019-06-11 | Google Llc | Providing an explanation of a missing fact estimate |
| US9659056B1 (en) | 2013-12-30 | 2017-05-23 | Google Inc. | Providing an explanation of a missing fact estimate |
| US10089408B2 (en) | 2014-10-16 | 2018-10-02 | Adp, Llc | Flexible graph system for accessing organization information |
| US10783213B2 (en) | 2014-10-16 | 2020-09-22 | Adp, Llc | Flexible graph system for accessing organization information |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US8204856B2 (en) | Database replication | |
| US20100121839A1 (en) | Query optimization | |
| US20100174692A1 (en) | Graph store | |
| US20110093500A1 (en) | Query Optimization | |
| WO2010085523A1 (fr) | Stockage d'un graphique | |
| Diao et al. | Path sharing and predicate evaluation for high-performance XML filtering | |
| Cattell | Scalable SQL and NoSQL data stores | |
| Copeland | MongoDB Applied Design Patterns: Practical Use Cases with the Leading NoSQL Database | |
| CN101158958B (zh) | 基于MySQL存储引擎的融合查询方法 | |
| Khelil et al. | Combining graph exploration and fragmentation for scalable RDF query processing | |
| Valer et al. | XQuery processing over NoSQL stores. | |
| Pelgrin et al. | TrieDF: Efficient in-memory indexing for metadata-augmented RDF | |
| Frey | Indexing ajax web applications | |
| Mami | Strategies for a semantified uniform access to large and heterogeneous data sources | |
| US8745035B1 (en) | Multistage pipeline for feeding joined tables to a search system | |
| CN113032450A (zh) | 一种数据存储与检索方法、系统、存储介质、处理终端 | |
| WO2003042873A1 (fr) | Procede et systeme d'indexation et de recherche de donnees semi-structurees | |
| Faroult et al. | Refactoring SQL applications | |
| Membrey et al. | MongoDB Basics | |
| Kuć | Solr Cookbook | |
| Zheng | View Management for Graph Databases | |
| Pirzadeh | On the performance evaluation of big data systems | |
| Wang et al. | Improving the performance of precise query processing on large-scale nested data with uniHash index | |
| Karanasos et al. | The ViP2P platform: XML views in P2P | |
| Wang | COCALEREX: An engine for converting catalog-based and legacy relational databases into XML |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 10733833 Country of ref document: EP Kind code of ref document: A1 |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 10733833 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |