WO2012052785A1 - Versioned data structure - Google Patents
Versioned data structure Download PDFInfo
- Publication number
- WO2012052785A1 WO2012052785A1 PCT/GB2011/052060 GB2011052060W WO2012052785A1 WO 2012052785 A1 WO2012052785 A1 WO 2012052785A1 GB 2011052060 W GB2011052060 W GB 2011052060W WO 2012052785 A1 WO2012052785 A1 WO 2012052785A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- node
- pointer
- version
- data
- key
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2246—Trees, e.g. B+trees
Definitions
- This invention relates to methods, systems and media for data storage.
- this invention relates to methods, systems and media for data storage using fully persistent B-trees for storing the data in external memory.
- the problem of designing persistent data structures has a long and interesting history.
- the basic problem is to allow versioned updates to an "ephemeral" data structure (i.e. one without a notion of versions), such that queries can be issued to the data structure as it was at some earlier version.
- a versioned dictionary which stores keys and values with a version tree is provided.
- the version tree provided (either explicitly or implicitly) has nodes which represent versions and children of a version are "newer" versions. Updates are only permitted to leaf versions, that is, versions with no children.
- the data structure also supports range queries which return every key in a given range together with the value written in the closest ancestor to the specified version, and a clone operation that creates a new version as a child of the specified version, inheriting all its parents' keys and values, until they are overwritten in the child version.
- N be the total number of updates performed
- B be the block size of external memory.
- N v we define N v as total number of elements that are accessible in version v. Since each key is live in at least one version, we have ⁇ v N v ⁇ N ⁇ N W (for any version w).
- Partially-persistent data structures possess a version tree that is a line; that is, all versions except the leaf have exactly one child, and the version with no children is the unique "most recent" version, i.e. clone operations are only permitted on leaf nodes.
- Fully-persistent data structures have a version tree which is permitted to be any tree; that is, some versions may have more than one child, so there may be many "most recent” versions, one for each leaf. Handling the fully-persistent case is more difficult than the partially-persistent case.
- a versioned dictionary can be thought of as efficiently implementing the union of many dictionaries: the live keys at version v are the union of all the keys in ancestor versions, where if a key appears more than once, its closest ancestor takes precedence.
- the B-tree is a known data structure for storing a dictionary, i.e. a mapping from some ordered set of keys to associated data items. Bayer and McCreight describe such a data structure in Organization and maintenance of large ordered indexes, Acta Informatica, 1(3):173-189, 1972. It performs updates in 0(log B N) worst-case lOs, and queries (returning Z elements) in 0(log B N + Z/B) lOs. All of these lOs may potentially be random.
- B-trees are known to exist as a "partially-persistent" B-tree as described by Becker et al in An asymptotically optimal multiversion b-tree.
- the VLDB Journal, 5(4):264- 275, 1996 which solves the space problem. It achieves 0(log B Nv) lOs for updates and queries with O(N) space, but is only partially-versioned and does not support any other trade-offs.
- the CoW B-tree has several problems: It has poor space utilization in general, each update may cause a new path to be written, giving 0(NB log B N) space (typically, B is thousands in practice and log B N ⁇ 5), and it cannot offer fast inserts, in particular better than 0(log B N) 10s per update.
- the B+-tree, Lanka et al presented two schemes for a fully persistent B+-tree called "fat field and '"pure version block".
- the fat field scheme uses alternating layers of "version blocks” and "index blocks”.
- the pure version block scheme contains a B-tree of version blocks, and an additional leaf layer of blocks that resolve the desired versions of each block. Both of these schemes perform poorly for range queries (using approximately Q(r + log B n) lOs to access a range of r blocks in a given version), compared to the optimal time of 0(r I b + log b n) when n blocks are accessible in the version being queried.
- the present invention provides a system for storing one or more data records in a B-tree data structure comprising:
- the B-tree comprises one or more nodes
- each node comprises a version number, one or more keys and one or more pointers
- each of said pointers comprises either an internal pointer which links said nodes together or a pointer which associates the one or more keys with said one or more data records, and
- pointers are ordered lexicographically by key, then by version number in an order consistent with a depth-first search of the version tree from its root.
- the present invention provides a system for storing one or more data records in a B-tree data structure comprising: one or more nodes and an associated version tree,
- each node comprises one or more ordered entries, and each entry comprises a key, a version, and a data record or a pointer.
- the system comprises a B-tree which comprises the one or more nodes.
- the B-tree could be partially persistent, but preferably the B- tree is a fully persistent B-tree.
- the nodes comprise one or more internal nodes and one or more leaf nodes. Each version in each entry has a corresponding version tree node in the version tree.
- each entry in the one or more nodes comprises a plurality of keys. In one set of embodiments each entry in the one or more nodes comprises a plurality of versions. In one set of embodiments each entry in the one or more nodes comprises a plurality of data records. In one set of embodiments each entry in the one or more nodes comprises a plurality of pointers. Every entry need not comprise a plurality of keys, versions, data records and pointers, it is envisaged that in some embodiments some of the entries comprise any
- the key is stored only in the first entry.
- the pointer can comprise an internal pointer which links or associates said nodes together.
- the pointer can comprise a data pointer which directly links or associates said one or more keys with said one or more data records.
- the pointer can comprise a virtual pointer which indirectly links or associates said one or more keys with said one or more data records via another node.
- the virtual pointers point from an internal node to a leaf node and act as proxies for data pointers.
- the present invention provides a B-tree construction for storing data in a fully-persistent versioned storage system.
- This construction is asymptotically optimal in space, as well as for range queries and updates, compared to the existing fully-persistent constructions which are suboptimal in the disk access model (DAM) in either space, query (range query) time or update time.
- DAM disk access model
- Insertions, range queries and lookups in the tree are efficient due to an algorithm for efficiently "compiling" the nodes as the B-tree is traversed.
- Data blocks have a small number of incoming pointers, which allows data blocks to be efficiently migrated on-disk.
- Each of the pointers links a key to a data record.
- Each of said pointers could comprise a version number, but preferably only the pointers comprise a version number if it is different to the version number of the corresponding node, i.e. the pointer is a modified pointer.
- the nodes are labelled with a version number, i.e. a "canonical version" v
- only the internal pointers (k,v') with v' ⁇ v need to store the label v' with their representation.
- the internal pointers are labelled by (k,v) where k is the key and v is the version number from the chosen traversal of the version tree.
- the keys are ordered by a natural ordering, e.g. byte order or integer order, and also preferably, the versions are ordered by descending depth-first search rank.
- the entries in each node could be ordered lexicographically by key and then version, or they could be ordered lexicographically by version and then key.
- the version tree comprises one or more ordered version tree nodes labelled by user-defined labels, e.g. in order of their creating, and a mapping from the user-defined labels to the orderings of the version tree nodes.
- the ordering of the versions can be a depth-first ordering.
- Each pointer can be either an original pointer or a modified pointer, i.e. the data to which a modified pointer is pointing has been modified.
- Modified pointers are tagged with a version number from the version tree, and original pointers are assumed to have the same version as the node containing them.
- Pointers are generally labelled by the key and the version of the corresponding node, i.e. the one in which they point from.
- pointers are ordered lexicographically by key first, then by version number in an order consistent with a depth-first search of the version tree, e.g. from its root.
- the pointers being ordered allows an efficient search of the stored data.
- the ordering has the property that it is efficient to determine if one version is an ancestor of another or not.
- the pointers can comprise an internal pointer, a data pointer or a virtual pointer, depending on their position in the B-tree.
- the nodes can comprise an internal node or a leaf node depending on their position in the B-tree.
- Internal pointers link nodes together, with an internal node only being able to be linked by internal pointers.
- Internal pointers represent the structure of the B-tree, and point to other nodes strictly further away from the root than the node that contains them; data pointers point to data records; and virtual pointers, which are proxies for data pointers, point to leaf nodes. In the system of the present invention it is only necessary to have internal pointers linking nodes together and data pointers pointing from leaf nodes to data records.
- the nodes in the B-tree of the present invention form a directed acyclic graph; however in general there may be several root nodes. Root nodes are ones which are not pointed to by any other (i.e. with no 'parent'). Furthermore, any given node might be pointed to by more than one node (i.e. have multiple 'parents').
- the version tree associated with the B-tree can be mapped to the root node, and is known as a "root table". Any dictionary data structure (for example a standard B- tree) can be used to maintain this version tree, which itself does not need to be versioned.
- a root version 0 is chosen and an entry pointing to an empty root node is created; subsequently, when a new version v is created as a child of version w, a new entry in the table is created, mapping version v to the same root that version w is mapped to.
- the nodes comprise a list of modifications to the data records.
- the version in each entry comprises a version slice comprising a first version v and a second version w, wherein v ⁇ w, i.e. the version slice comprises a pair of versions, v and w, wherein w is either equal to, or is a child of, v.
- the version slice (v, w) represents a subset of the version tree, namely all those versions, y, that are weakly descendant from v and have depth first search (DFS) order greater than or equal to that of w.
- DFS depth first search
- version slices An important sub-class of version slices is the set of simple version slices, which are those of the form (v, v), i.e. the sub-tree of V rooted at v. It is easy to see that, for simple slices, (v, v) ⁇ (w, w) if, and only if, v ⁇ w. For this reason, we often refer to a simple version slice (v, v) as v.
- the "Version Slice B-tree” uses space 0(N + V), supports updates to a version v in 0(log B N v ) lOs, and supports range queries at version v returning Z elements in 0(log B Nv + Z/B) lOs. It can be seen it as a fully-versioned B-tree.
- Standard buffering techniques can be applied in order to improve the update performance.
- the version slices (v, w) are ordered lexicographically by the second version w, and then by the first version v, where versions are ordered by their rank in a depth-first search of the version tree from its root. For example the version slices are ordered such that ⁇ v,w) precedes ⁇ v',w') if and only if the depth- first search rank of w is greater than or equal to the depth-first search rank of w'.
- the invention provides a system for storing one or more data records in a B-tree data structure comprising:
- the B-tree comprises a plurality of nodes, said plurality of nodes including a plurality of root nodes,
- each node is labelled with a base version number, v n , and one or more keys and one or more pointers,
- the pointers comprise a first version number, w, a second version number, v, and wherein w ⁇ v,
- each of said pointers comprises either an internal pointer which links said nodes together or a pointer which associates the one or more keys with said one or more data records, and
- pointers are ordered lexicographically by key, then by the first version number, and then by the second version number, in an order consistent with a depth-first search of the version tree from its root.
- the invention provides a system for storing one or more data records in a B-tree data structure comprising:
- the B-tree comprises a plurality of nodes, said plurality of nodes including one or more root nodes,
- each node contains a plurality of entries (k,v,w,x), wherein each entry comprises one or more keys, k, a version slice, ⁇ v, w), and either a value, x, or a pointer, x, to another node, and
- each node is labelled with a base version slice (v, w) .
- version slices can be ordered as follows:
- the structure of the B-tree when including version slices is based on the same structure as for the first and second aspects of the invention, except that the nodes can additionally comprise a plurality of keys, k, which are associated with different version slices.
- the nodes comprise one or more root nodes, and preferably each version is associated to a root node, and in one example all the versions are associated to a unique root node.
- the tree has a unique root node, which is referred to as root.
- Every node n in a version slice B-tree can store up to B entries of the form (k,v,w,x), where k is a key, (v,w) is a version slice, and x is either a value (for leaf nodes), or a pointer to another node n' (written p(n')).
- each node is labelled with a base version, a base version slice (v, w) .
- v, w base version slice
- every entry that points to node n has version slice
- entries are ordered lexicographically by key and then by the second version slice coordinate, w.
- entry ordering is not quite total: for each given pair of values k,w there may be at most two entries: the first version coordinate v is constrained by the fact that (v, w) is a version slice, to be equal to either w or the parent of w. The ambiguity of ordering is resolved by always placing the former entry first (i.e. (k, w, w, x) before (k, v, w, x) for any v ⁇ w).
- the pointers are labelled by the key and the version slice of the corresponding entry.
- the pointers are ordered
- A is the depth-first search rank of the first version v
- B is the maximum depth-first search rank of the second version w which is a descendant of the first version v.
- the B-tree of the present invention could be used to provide data storage in any type of computer memory, e.g. internal memory, tertiary memory or offline storage, but in a preferred set of embodiments the data storage is in external memory, e.g. on a hard disk drive.
- the present invention also provides a computer readable data storage medium for storing one or more data records in a B-tree as set out in the first and second aspects of the invention.
- the data storage is provided on a database server, e.g. in data communication with a plurality of clients.
- the arrangement and organisation of the storage system is controlled by a storage engine loaded onto the CPU of the database server.
- the storage engine is capable of performing the operations as detailed below to control and organise the data records which are stored on the memory of the database server.
- the data storage system of the present invention supports the basic operations of "lookup" of a key in a particular version, and "insert" of data with a given key in a particular version.
- An "effective" node, e v (n), of node n in version v is defined as exactly the set of most recent versions of every key in node n, from the view of version v. More precisely, e v (n) is the collection of all pointers p from node n such that if pointer p has the label (k',v') (denoting the key and version) then v' ⁇ v and there is no other pointer in n with label (k',v") and v' ⁇ v" ⁇ v .
- the query first searches for a least upper bound for (k,v) and then scans right to find an entry whose version slice is ancestral to v.
- the scan terminates with failure when it encounters an entry with the wrong key or runs out of entries to examine; it terminates with success if an entry is found whose version slice coordinates are ancestral to v. Note that for internal nodes it is not insisted upon that the key is identical to k in order to follow a pointer. This is the same as for a normal B-tree: when following pointers only a least upper bound is needed.
- lookup(k,v) returns the value associated to the pointer (k,w) where version w is the closest ancestor version of version v for which the pointer (k,w) exists (n.b. ancestorship is reflexive, so (k,v) will be found if it exists). It should be noted that a variety of semantics for the lookup operation can be supported.
- the lookup operation is performed as follows. First, lookup in the "root table" the version v in order to determine the starting node n. Then repeat the following algorithm until the search terminates:
- n is an internal node, find (by binary search on key) the pointer in the effective node e v (n) whose key k' is a least upper bound for k. Repeat the search from the node to which this pointer points.
- n is a leaf node, find (by binary search on key) the pointer in the effective node e v (n) whose key is k. If there is no such pointer then the key does not exist in the dictionary and we report an error; if the pointer is a data pointer then we return the data; if it is a virtual pointer, we continue the search at the leaf node to which it points.
- the deepest path from root to leaf is no bigger than 0(log B (N)) , where N is the number of keys, and at most one virtual pointer must be followed.
- the present invention provides a method for performing a lookup operation in a B-tree as set out in the first or second aspect of the invention, the method comprising the steps of:
- the effective nodes are cached in memory, using a key that can uniquely identify it; for example the block number and the version number for which it is an effective node.
- the Applicant has found that it is not necessary to compute effective nodes while traversing the structure in a lookup.
- n is not a leaf node. If v' ⁇ v then the pointer is followed with label (k',v'); otherwise scan right (i.e. in the direction of decreasing DFS ordering on versions and increasing key) to find the first pointer whose label has a version w ⁇ v. If there are no values (k w) with w ⁇ v then the search continues to the next highest key, and so on until a suitable upper bound key is found, and the corresponding pointer followed.
- n is a leaf node. If k' ⁇ k then report an error: key k is not in the dictionary in version v. Otherwise, scan to the right as for internal nodes to find the first (k,w) such that w ⁇ v and follow that when it is found, or report an error if it is not found (note that this fails as soon as there are no more versions for key k).
- the present invention provides a method for performing a lookup operation in a B-tree as set out in the first or second aspect of the invention, the method comprising the steps of:
- node n is not a leaf node, following the pointer with label (k', v') if v' ⁇ v, otherwise scanning right to find the first pointer whose label has a version w ⁇ v, if node n is a leaf node, scanning right to find the first pointer with label (k,w) such that w ⁇ v and following the pointer when it is found.
- a method for a lookup operation for entry (k,v) in a B-tree comprises the steps of:
- the operation insert(k,v,data) will now be described.
- the operation insert(k,v,data) inserts the key k at version v and associates to it some data payload.
- the data payload is a pointer (and hence has fixed size). It can be extended to deal with variable length pointers using well- known techniques.
- v is a leaf node of the version tree; from now on it is assumed that v is indeed a leaf.
- An insertion proceeds as follows: first perform the operation lookup(k,v) to find an appropriate leaf node n at which to insert the data. If n does not contain a pointer (k,v) then a pointer (k,v) is added to n that points to data.
- n contains a pointer for the key k in version v; in this case either it can be replaced with the new data pointer (overwriting the previous version v), assigned a new version v'>v (v' is a descendent of v) to the inserted key, or an error reported; the precise choice is up to the application.
- the present invention provides a method for performing an insert operation in a B-tree as set out in the first or second aspect of the invention, the method comprising the steps of:
- Live an entry (k, v, w, x) in node n is live in ( ⁇ ', w') if (v, w) ⁇ ( ⁇ ', w') and there is no other entry (k, v", w", x) in n with (v, w) ⁇ (v", w") ⁇ ( ⁇ ', w').
- (v,w) is a nearest ancestor to (v',w').
- Live(v, w) is written for the number of entries in n live in (v,w). Note that liveness is monotone with ancestorship: if (v, w) ⁇ ( ⁇ ', w') then live(v, w) ⁇ live(v', w'). For a node n with base version slice (v, w) we write live(n) for live (v, w) .
- entries (k, v, w, x) which are neither live in, nor lead below, a given version slice ( ⁇ ', w'), either because the two version slices are incomparable, or because (v, w) is an ancestor of ( ⁇ ', w') but not the nearest ancestor for key k.
- an insert operation into node n of a B-tree which comprises version slices comprises the steps of: inserting a data record into the data structure at node n,
- an insert operation into node n of a B-tree as set out in the first or second aspect of the invention comprises the steps of:
- node n is a root node, creating one or more root nodes to contain the pointer records, or
- the step of splitting the entries in node n between node n and any new node n' comprises moving some of the entries from node n to node n', and copying some of the entries from node n to node n'.
- the update procedure is carried out at node n, pointed to by some entry in m, which is null at the root node.
- the procedure splits full nodes on the way down towards the root using the split procedure, inserting the entry when at a leaf node (this simply inserts the entry e at the appropriate location in the node n). Inserting the new pointer at node n may exceed the capacity of the node; in this case, a sequence of split operations is applied at the node, which may cause further insertions and splits recursively up the tree.
- the two split operations are described, following which the entire insertion procedure is described.
- the two split operations are denoted as a version split and a key split.
- a key split partitions the entries by key; a version split extracts and duplicates some set of entries on the basis of versions.
- a version split of node n at version v produces a new node containing exactly the most recent version for every key in n.
- a version split of a non-root node n in version v via a parent node n p produces a node n v , a copy of e v (n), the effective node of n in version v, and inserts into node n p a pointer to node n v with label (k',v).
- the present invention provides a method for performing a version split operation of node n at version v in a B-tree as set out in the first or second aspect of the invention, the method comprising the step of:
- any data pointers in node n v are replaced by virtual pointers to node n.
- Another possibility is to migrate a pointer in node n to point to node n v , which is discussed further below.
- a root node n can be version split using the same procedure except that rather than update the parent (which does not exist), the root table entry is updated instead for version v to point to node n v .
- a key split of node n at version v creates a new node n' and divides a constant fraction of the pointers between n and n', using a pivot element.
- a key split is essentially the same as a standard B-tree node split, which is well known in the art.
- the present invention provides a method for performing a key split operation of node n at version v in a B-tree as set out in the first or second aspect of the invention, the method comprising the steps of:
- node n creates a node n", a node n" a pointer (k",v) and a pointer (k',v), wherein the pointer (7c" v) points from a parent node n p to the node n" and the pointer (k',v) points from the parent node n p to the node n" and wherein node n p also comprises a pointer (k',v') which points to the node n,
- k" is equal to the largest key in n".
- n as close to half the keys in n as possible are moved into n" thus leading to as equitable as possible a distribution between the two nodes, n and n". It will be apparent to those skilled in the art that there are alternative methods for choosing the new key k", for choosing the number of keys to move into the new node, and for whether or not to re-use the existing node n. The present invention is not limited with respect to these choices.
- a root node n can be key split using the same procedure except that since there is no parent a new node r of version v needs to be created, which contains two pointers, one each to n'" and n", and the root table entry for v to point to r also has to be updated.
- a new node n' with base version slice (v, w) is created.
- m live(n)/2 and let k m be the key of the m-th smallest distinct key in n. All the entries having the key k ⁇ k m from are moved from node n to n', and a pointer to the node n' is returned.
- Key splitting is similar to the classical B-tree node split, except live key counts are considered, and for every key moved, all entries with that key are moved.
- node n is the root node associated to base version v, creating a new root node r containing the new entry (k m ,v,v) pointing to node n' and the new entry ( ⁇ ,v,v) pointing to node n, and replacing the association between base version v and node n with an association between base version v and node r.
- a method for performing a key split operation of node n with base version slice (v,w) in a B-tree comprising version slices comprises the steps of:
- live(n) is equal to the number of entries of node n live in the base version slice of node n
- a new node n' with base version slice (v,w) is created. All entries lead below (v, w) are moved from node n to n', and all entries that are live at (v, w) but not lead below (v, w) are copied. A pointer to the node n' is returned.
- version split has been used before in the multiversion B-tree (MVBT).
- MVBT multiversion B-tree
- an insert at version i on a full node triggers a version split that simply copies all entries in the node live at version i to a new node. This is sufficient to guarantee the space and query bounds for the case of partial persistence.
- full persistence a significantly more involved notion of version splitting is needed. In particular, it is not sufficient to simply copy entries live at some subtree of the version tree; this observation is what motivated the notion of 'version slices' in the present invention.
- a method for performing a version split operation of node n at version slice (v,w) in a B-tree comprises the steps of:
- an entry with version slice (v,w) is lead below version slice (v',w') if version v is a weak descendent of v' and the depth-first search order of version w is greater than or equal to the depth-first search order of version v.
- an entry with version slice (v,w) is live at version (v',w') if version slice (v,w) ⁇ v',w'), and there is no entry in the node n having version slice (v", w") such that (v,w) ⁇ (v", w") ⁇ (v',W).
- a method for performing a split operation of node n of a B-tree comprising a version slice comprises the steps of:
- live(n) is equal to the number of entries of node n live in the base version slice of node n
- version slice v,w
- the number of entries in node n that are live at version slice (v,w) is at least two-thirds of the number of total entries in the B-tree and such that the number of entries lead below version slice (v,w) is at least a ninth of the number of total entries in the B-tree, and, if such a version slice exists, then
- live(n') is equal to the number of entries of node n' live in the base version slice of node n and
- the version slice (v,w) minimises the maximum number of entries in node n or node n and returning a pointer to node n.
- the number of entries between the split node n and the new nodes ( ⁇ ', n", etc) is balanced.
- a version slice (v,v) is equivalent to version v.
- an entry (k,v) is equivalent to entry ⁇ k,v,v).
- the present invention provides a method for performing a full insertion operation in a B-tree as set out in the first or second aspect of the invention, the method comprising the steps of:
- copies of a data pointer d can be replaced by virtual pointers to a leaf node containing d.
- a data pointer is written into a node, e.g. node n.
- subsequent copies of data pointer d are made when version-splitting node n, these may be replaced by virtual pointers to node n.
- a search for the pointer d in a version v' descendant from v might arrive at the leaf node n' and then require one extra leaf lookup to reach the node n (and hence the d).
- the present invention is not limited in respect of which node should contain the real data pointer, and so in this example the data pointer and virtual pointer could be swapped around, should it prove better to do so.
- the pointer d would be written into the node n' and a virtual pointer to node n' written into n.
- searches for pointer d in version v take one extra lookup, but searches in version v' take one fewer.
- This switch of real for virtual pointers could be triggered by usage (i.e., learned adaptively), or by user policy.
- the above aspects of the present invention provide methods for optimising the locations of one or more real data pointers and other virtual pointers. In one set of embodiments this is applied to a single real data pointer, but in other embodiments it could be applied to more than one real data pointer. Other similar algorithms could also be written to achieve the same basic goal of moving data pointers around in response to usage patterns or user policies.
- n n 2 , n k , and virtual pointers p in node n, (for i ⁇ k) pointing to the node n i+1 , and a data pointer d in node n k .
- the migration operation of the pointer d to the node is defined to mean that the pointer p-i should be replaced by the pointer d, and that the pointer d in the node n k and all the virtual pointers p, should be replaced with virtual pointers pointing to the node n- ⁇ .
- the shortcutting operation of the virtual pointers is defined to mean that for each / ' , the pointer p, should be replaced with a virtual pointer to the node n k .
- the present invention provides a method for performing a migration operation in a B-tree as set out in the first or second aspect of the invention, the method comprising the steps of:
- the present invention provides a method for performing a shortcutting operation in a B-tree as set out in the first or second aspect of the invention, the method comprising the steps of:
- either migration or shortcutting is performed whenever a chain of nodes n-i, n 2 , n k as described above is encountered whose length k is greater than some threshold, e.g. 1 .
- some threshold e.g. 1 .
- the set of nodes traversed in order to reach the final data pointer d is kept track of. If the set contains a chain then shortcutting is applied to it with probability p (e.g. with 75% probability) otherwise we apply migration to it.
- this is enabled by requiring users to explicitly declare when creating a new version whether it is limited to only have at most one child or not.
- pointer p is migrated to node n' (and to shortcut the intermediate virtual pointers between p' and p) if and only if every version intermediate between v and v including the former but excluding the latter, is restricted to have at most one child.
- the present invention is also capable of dealing with the case where no explicit caps are placed on the number of children that a version can ever have, and in which clones may be made of any existing version.
- a virtual pointer p' for key k in version v' is found in a node n pointing (possibly via other virtual pointers) to a node n containing the data pointer p in version v, the following operation is performed:
- version v is an ancestor of version v then let version v" be the first version encountered on the path within the version tree from version v to version v' (including end-points) that does not have exactly one child.
- version v' is an ancestor of version v then let version v" be the first version encountered on the path within the version tree from version v' to version v
- the migration and shortcutting operations are performed at times of low load (such as at night) in order to avoid impacting performance with the update operations themselves.
- the leaves of the B-tree are scanned through systematically, looking for opportunities to migrate or shortcut, according to some policy such as migration to clone points.
- a particular version v is focused on, with the leaves of the B-tree being scanned through to identify all nodes of version v to apply migration or shortcutting to. Buffering techniques can be applied to achieve a range of query/update tradeoffs.
- Each buffer stores a set of elements in kv- order, in the same way that nodes store elements.
- the tree is rebalanced using the split operations from before - this requires a recursive insert into the parent node (rather than its buffer).
- the split operation Before applying the split operation to a node n, it is ensured that all buffers on the path from root to n are empty, by applying a 'full flush' operation to the buffers from n upwards (and doing so recursively if any of these full flushes triggers a split). Once all the buffers on the path to the root are empty, it is safe to insert the structural change into the parent node, and continue splitting as necessary.
- the flush process is simply a delayed version of the insert procedure in the unbuffered version slice tree.
- the structural part of the tree i.e.
- Fig. 1 shows an example of a version tree in accordance with the invention
- Fig. 2 shows an example of a B-tree in accordance with the invention
- Fig. 3 shows an example of version slices based on a version v and its children
- Fig. 4 shows an example of a computer network system.
- Fig. 1 shows a version tree 1 with a root version 0 (10) which is modified to get version 1 (12).
- Version 2 (14) and version 3 (16) are created each by modifying version 1 (12).
- Version 3 (16) does not incorporate the changes made in version 2 (14), instead it is started afresh by directly modifying version 1 (12).
- version 4 (18) is unaffected by any of the changes made in version 1 (12), version 2 (14), and version 3 (16), and is created by directly modifying version 0 (10).
- each of the versions 2, 3, 4 is "most-recent" in the sense that there are no other versions descendant from them.
- Fig. 2 shows a B-tree in accordance with the present invention in which the data records are stored.
- a root table A comprises a list of the versions of the data stored in the system, i.e. 0, 1 and 2, 0 being the original version and 1 and 2 being subsequent versions. Referring back to Fig. 1 we can see that version 1 (12) is a modification of version 0 (10) and that version 2 (14) is a subsequent modification of version 1 (12).
- the root table A also comprises three internal pointers D corresponding each of the versions 0, 1 , 2 in the root table A.
- the internal pointers D are not themselves labelled with a version but take the version from the root table A. This is also the case for others pointers from other nodes.
- the internal pointers D point from the root table A to internal nodes B, with two or more internal pointers D able to point to the same internal node B, i.e. the two internal pointers D which point from the root table A to the internal node B which contains versions 1 and 2.
- the internal nodes B contain a list of keys which reference data records, the associated version for these keys and the pointers which can be followed to access the data.
- the pointers from the internal nodes B are themselves internal pointers D and point to leaf nodes C.
- the leaf nodes C are similar to the internal nodes B in that they also contain a list of keys, version numbers and pointers, but they are different in that the pointers from the leaf nodes C are either data pointers E or virtual pointers F.
- a leaf node C can comprise one or more data pointers E and zero or more virtual pointers F.
- the data pointers E point from the leaf nodes C to the data records such that a data record for a certain key with a certain version number can be accessed.
- the virtual pointers F point from a leaf node C to another leaf node C, and as such they act as proxies for data pointers E, i.e. the virtual pointer F for a certain key and version number can be followed to access a leaf node C from where a data pointer E can be followed to access the appropriate data record.
- FIG. 3 shows a version v and its three children, w1 , w2 and w3.
- the other slices are as follows: (v,w3) encompasses all versions strictly descendant from v; (v,w2) encompasses all those strict descendants of v whose DFS order is at least that of w2; (v,w1 ) encompasses all versions strictly descendant from v whose DFS order is at least that of w1 , which is the same as (w1 ,w1 ) because all such versions are necessarily in the subtree rooted at w1.
- the system comprises a database server 100 which is in data communication with a number of clients 102.
- the database server 100 includes a CPU 104 on which a storage engine 106 driving the organisation of the system is loaded.
- the database server 100 also includes a number of data storage disks 108 on which the data records are stored.
- the CPU 104 is in data communication with the storage disks 108 to control the organisation of the data records in accordance with the details of the storage engine 106, e.g. the types of operations it is able to perform.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A system for storing one or more data records in a B-tree data structure comprises a fully persistent B-tree and an associated version tree (1). The B-tree comprises one or more nodes (B, C). Each node (B, C) comprises a version number, one or more keys and one or more pointers (D, E, F). Each of the pointers (D, E, F) comprises either an internal pointer (D) which links said nodes (B, C) together or a pointer (E) which associates the one or more keys with said one or more data records. The pointers (D, E, F) are ordered lexicographically by key, then by version number in an order consistent with a depth-first search of the version tree (1) from its root (10).
Description
VERSIONED DATA STRUCTURE
This invention relates to methods, systems and media for data storage. In particular, this invention relates to methods, systems and media for data storage using fully persistent B-trees for storing the data in external memory.
The problem of designing persistent data structures has a long and interesting history. The basic problem is to allow versioned updates to an "ephemeral" data structure (i.e. one without a notion of versions), such that queries can be issued to the data structure as it was at some earlier version. To this end, a versioned dictionary which stores keys and values with a version tree is provided. The version tree provided (either explicitly or implicitly) has nodes which represent versions and children of a version are "newer" versions. Updates are only permitted to leaf versions, that is, versions with no children. The data structure also supports range queries which return every key in a given range together with the value written in the closest ancestor to the specified version, and a clone operation that creates a new version as a child of the specified version, inheriting all its parents' keys and values, until they are overwritten in the child version.
Let N be the total number of updates performed, and B be the block size of external memory. For a version v, we define Nv as total number of elements that are accessible in version v. Since each key is live in at least one version, we have∑v Nv≥N≥NW (for any version w).
Partially-persistent data structures possess a version tree that is a line; that is, all versions except the leaf have exactly one child, and the version with no children is the unique "most recent" version, i.e. clone operations are only permitted on leaf nodes. Fully-persistent data structures have a version tree which is permitted to be any tree; that is, some versions may have more than one child, so there may be many "most recent" versions, one for each leaf. Handling the fully-persistent case is more difficult than the partially-persistent case. A versioned dictionary can be thought of as efficiently implementing the union of many dictionaries: the live keys at version v are the union of all the keys in ancestor versions, where if a key appears more than once, its closest ancestor takes precedence.
The B-tree is a known data structure for storing a dictionary, i.e. a mapping from some ordered set of keys to associated data items. Bayer and McCreight describe such a data structure in Organization and maintenance of large ordered indexes, Acta Informatica, 1(3):173-189, 1972. It performs updates in 0(logB N) worst-case lOs, and queries (returning Z elements) in 0(logB N + Z/B) lOs. All of these lOs may potentially be random.
B-trees are known to exist as a "partially-persistent" B-tree as described by Becker et al in An asymptotically optimal multiversion b-tree. The VLDB Journal, 5(4):264- 275, 1996, which solves the space problem. It achieves 0(logB Nv) lOs for updates and queries with O(N) space, but is only partially-versioned and does not support any other trade-offs. Other data constructions are known to exist as "fully- persistent" as described by Driscoll et al in Persistence, amortization and randomization, In SODA '91: Proceedings of the second annual ACM-SIAM symposium on Discrete algorithms, pages 78-88, Philadelphia, PA, USA, 1991, Society for Industrial and Applied Mathematics, and by Lanka et al in Fully persistent b+-trees, SIGMOD Rec, 20(2):426-435, 1991. Lanka et al. developed two fully-persistent B-tree variants based on variants of the techniques from Driscoll et al in "Making data structures persistent", STOC '86: Proceedings of the eighteenth annual ACM symposium on Theory of computing, pages 109-121, New York, NY, USA, 1986, ACM. For both variants, either there is a large space blowup, or range queries may be far from optimal. One common problem is that all of the previously known fully-persistent data constructions have poor space or time performance.
The classic versioned analogue of the B-tree is the copy-on-write (CoW) B-tree as described by Chutani et al in "The episode file system", USENIX Annual Technical Conference, pages 43-60, 1992, and by Rodeh in "B-trees, shadowing, and clones", Trans. Storage, 3:2:1-2:27, February 2008, which is based on the path- copying technique of Driscoll et al. for making internal-memory pointer-based data structures fully-persistent. Unfortunately, this technique does not directly give efficient external-memory algorithms, due to their batch nature. The CoW B-tree has several problems: It has poor space utilization in general, each update may cause a new path to be written, giving 0(NB logB N) space (typically, B is thousands
in practice and logB N < 5), and it cannot offer fast inserts, in particular better than 0(logB N) 10s per update.
Techniques are known, e.g. such as that described by Driscoll et al, to transform any ephemeral pointer-based data structure into a fully-persistent one. However, this technique, called "node splitting", is somewhat involved. A simplified description of it in the case of a partially-persistent data structure is described in MIT OCW lecture notes, available at courses. csail.mit.edu/6.854/06/scribe/s2- persistent.pdf. This technique requires storing a set of modification entries in every node and therefore quickly becomes unwieldy.
One of the problems with known fully-persistent data constructions, e.g. as described by Driscoll et al, is that they are targeted to pointer-based data structures in internal memory. Applying this scheme to an asymptotically optimal ephemeral B-tree scheme, i.e. in an attempt to create a fully-persistent B-tree, may not result in an optimal persistent B-tree scheme in the external memory model in which the present invention is primarily interested. The invention is also interested in the query/update tradeoff for fully-versioned structures. However, Becker et al, showed a B-tree (called the MVBT) which is asymptotically optimal for the partially-persistent case, for the operations of lookup, insert and range query, when operating in the external memory model. It is not clear if their scheme could be extended to the fully-persistent case, as alluded to by Vitter in External memory algorithms and data structures: dealing with massive data. ACM Comput. Surv., 33(2):209-271, 2001: "determining whether or not a B-tree can be made fully-persistent is an interesting open problem".
For an alternative data structure, the B+-tree, Lanka et al presented two schemes for a fully persistent B+-tree called "fat field and '"pure version block". The fat field scheme uses alternating layers of "version blocks" and "index blocks". The pure version block scheme contains a B-tree of version blocks, and an additional leaf layer of blocks that resolve the desired versions of each block. Both of these schemes perform poorly for range queries (using approximately Q(r + logB n) lOs
to access a range of r blocks in a given version), compared to the optimal time of 0(r I b + logb n) when n blocks are accessible in the version being queried.
Dietz et al in Persistence, amortization and randomization, in SODA '91:
Proceedings of the second annual ACM-SIAM symposium on Discrete algorithms, pages 78-88, Philadelphia, PA, USA, 1991. Society for Industrial and Applied Mathematics presented a result that describes how to transform some of the amortized 0(1 ) time bounds as presented in Driscoll et al into worst-case
0(log log n) bounds. These are not efficient in the external-memory model. Several solutions are known for offline (or batched) constructions of persistent B- trees. Arge et al, in "l/o-efficient point location using persistent B-trees", J. Exp. Algorithmics, 8, December 2003, use a modified MVBT structure to solve IO- efficient planar point location. Goodrich et al, in "External-memory computational geometry", Proceedings of the 1993 IEEE 34th Annual Foundations of Computer Science, pages 714-723, Washington, DC, USA, 1993, IEEE Computer Society, give an offline method for constructing a persistent B-tree. The construction uses optimal space, but is offline and supports only partial persistence. Hence, these do not address the same problem as the present invention. When viewed from a first aspect the present invention provides a system for storing one or more data records in a B-tree data structure comprising:
a fully persistent B-tree and an associated version tree,
wherein the B-tree comprises one or more nodes,
wherein each node comprises a version number, one or more keys and one or more pointers,
wherein each of said pointers comprises either an internal pointer which links said nodes together or a pointer which associates the one or more keys with said one or more data records, and
wherein the pointers are ordered lexicographically by key, then by version number in an order consistent with a depth-first search of the version tree from its root.
When viewed from a second aspect the present invention provides a system for storing one or more data records in a B-tree data structure comprising:
one or more nodes and an associated version tree,
wherein each node comprises one or more ordered entries, and each entry comprises a key, a version, and a data record or a pointer. In one set of embodiments, the system comprises a B-tree which comprises the one or more nodes. The B-tree could be partially persistent, but preferably the B- tree is a fully persistent B-tree. In one set of embodiments the nodes comprise one or more internal nodes and one or more leaf nodes. Each version in each entry has a corresponding version tree node in the version tree.
In one set of embodiments each entry in the one or more nodes comprises a plurality of keys. In one set of embodiments each entry in the one or more nodes comprises a plurality of versions. In one set of embodiments each entry in the one or more nodes comprises a plurality of data records. In one set of embodiments each entry in the one or more nodes comprises a plurality of pointers. Every entry need not comprise a plurality of keys, versions, data records and pointers, it is envisaged that in some embodiments some of the entries comprise any
combination of these attributes. For example where there is a plurality of contiguous entries in a node comprising the same key, the key is stored only in the first entry.
In one set of embodiments the pointer can comprise an internal pointer which links or associates said nodes together. In another set of embodiments, which is not mutually exclusive, the pointer can comprise a data pointer which directly links or associates said one or more keys with said one or more data records. In another set of embodiments, which is not mutually exclusive, the pointer can comprise a virtual pointer which indirectly links or associates said one or more keys with said one or more data records via another node. Preferably the virtual pointers point from an internal node to a leaf node and act as proxies for data pointers.
Thus it will be appreciated by those skilled in the art that the present invention provides a B-tree construction for storing data in a fully-persistent versioned storage system. This construction is asymptotically optimal in space, as well as for range queries and updates, compared to the existing fully-persistent constructions which
are suboptimal in the disk access model (DAM) in either space, query (range query) time or update time.
Insertions, range queries and lookups in the tree, as will be described below, are efficient due to an algorithm for efficiently "compiling" the nodes as the B-tree is traversed. Data blocks have a small number of incoming pointers, which allows data blocks to be efficiently migrated on-disk.
Each of the pointers links a key to a data record. Each of said pointers could comprise a version number, but preferably only the pointers comprise a version number if it is different to the version number of the corresponding node, i.e. the pointer is a modified pointer. As the nodes are labelled with a version number, i.e. a "canonical version" v, only the internal pointers (k,v') with v'≠ v need to store the label v' with their representation. However, in a preferred set of embodiments the internal pointers are labelled by (k,v) where k is the key and v is the version number from the chosen traversal of the version tree.
Preferably the keys are ordered by a natural ordering, e.g. byte order or integer order, and also preferably, the versions are ordered by descending depth-first search rank. The entries in each node could be ordered lexicographically by key and then version, or they could be ordered lexicographically by version and then key.
In one set of embodiments the version tree comprises one or more ordered version tree nodes labelled by user-defined labels, e.g. in order of their creating, and a mapping from the user-defined labels to the orderings of the version tree nodes. The ordering of the versions can be a depth-first ordering.
Each pointer can be either an original pointer or a modified pointer, i.e. the data to which a modified pointer is pointing has been modified. Modified pointers are tagged with a version number from the version tree, and original pointers are assumed to have the same version as the node containing them. Pointers are generally labelled by the key and the version of the corresponding node, i.e. the one in which they point from. In one set of embodiments pointers are ordered
lexicographically by key first, then by version number in an order consistent with a depth-first search of the version tree, e.g. from its root.
The pointers being ordered allows an efficient search of the stored data. The pointers, as well as being ordered lexicographically by key, then by version number in an order consistent with a depth-first search could, for example, be additionally ordered by the block number. In one set of embodiments in which the pointers are ordered by version number, the ordering has the property that it is efficient to determine if one version is an ancestor of another or not.
Furthermore, the pointers can comprise an internal pointer, a data pointer or a virtual pointer, depending on their position in the B-tree. The nodes can comprise an internal node or a leaf node depending on their position in the B-tree. Internal pointers link nodes together, with an internal node only being able to be linked by internal pointers. Internal pointers represent the structure of the B-tree, and point to other nodes strictly further away from the root than the node that contains them; data pointers point to data records; and virtual pointers, which are proxies for data pointers, point to leaf nodes. In the system of the present invention it is only necessary to have internal pointers linking nodes together and data pointers pointing from leaf nodes to data records.
Although they are not a necessary part of the system, virtual pointers are useful in that they enable the system to guarantee that there is exactly one data pointer pointing to any given data record. This in turn allows the impact on the B-tree of moving these data records around to be minimised. Thus it can be seen that if virtual pointers are not used, the bound on the number of pointers to each data record is lost. As for standard B-trees, the nodes in the B-tree of the present invention form a directed acyclic graph; however in general there may be several root nodes. Root nodes are ones which are not pointed to by any other (i.e. with no 'parent'). Furthermore, any given node might be pointed to by more than one node (i.e. have multiple 'parents').
The version tree associated with the B-tree can be mapped to the root node, and is known as a "root table". Any dictionary data structure (for example a standard B- tree) can be used to maintain this version tree, which itself does not need to be
versioned. When the system is first created a root version 0 is chosen and an entry pointing to an empty root node is created; subsequently, when a new version v is created as a child of version w, a new entry in the table is created, mapping version v to the same root that version w is mapped to. Preferably the nodes comprise a list of modifications to the data records.
An important construct is the version slice described by two versions (v, w), where w is either equal to, or is a child of, v. In one set of embodiments of the system, the version in each entry comprises a version slice comprising a first version v and a second version w, wherein v < w, i.e. the version slice comprises a pair of versions, v and w, wherein w is either equal to, or is a child of, v. The version slice (v, w) represents a subset of the version tree, namely all those versions, y, that are weakly descendant from v and have depth first search (DFS) order greater than or equal to that of w. An important sub-class of version slices is the set of simple version slices, which are those of the form (v, v), i.e. the sub-tree of V rooted at v. It is easy to see that, for simple slices, (v, v) < (w, w) if, and only if, v <w. For this reason, we often refer to a simple version slice (v, v) as v.
The "Version Slice B-tree" uses space 0(N + V), supports updates to a version v in 0(logB Nv) lOs, and supports range queries at version v returning Z elements in 0(logB Nv + Z/B) lOs. It can be seen it as a fully-versioned B-tree.
Standard buffering techniques can be applied in order to improve the update performance.
In one set of embodiments the version slices (v, w) are ordered lexicographically by the second version w, and then by the first version v, where versions are ordered by their rank in a depth-first search of the version tree from its root. For example the version slices are ordered such that {v,w) precedes {v',w') if and only if the depth- first search rank of w is greater than or equal to the depth-first search rank of w'.
From a further aspect the invention provides a system for storing one or more data records in a B-tree data structure comprising:
a fully persistent B-tree and an associated version tree,
wherein the B-tree comprises a plurality of nodes, said plurality of nodes including a plurality of root nodes,
wherein each node is labelled with a base version number, vn, and one or more keys and one or more pointers,
wherein the pointers comprise a first version number, w, a second version number, v, and wherein w < v,
wherein each of said pointers comprises either an internal pointer which links said nodes together or a pointer which associates the one or more keys with said one or more data records, and
wherein the pointers are ordered lexicographically by key, then by the first version number, and then by the second version number, in an order consistent with a depth-first search of the version tree from its root.
Also from a further aspect the invention provides a system for storing one or more data records in a B-tree data structure comprising:
a B-tree and an associated version tree,
wherein the B-tree comprises a plurality of nodes, said plurality of nodes including one or more root nodes,
wherein each node contains a plurality of entries (k,v,w,x), wherein each entry comprises one or more keys, k, a version slice, {v, w), and either a value, x, or a pointer, x, to another node, and
wherein each node is labelled with a base version slice (v, w) .
As with the ancestorship partial ordering on versions, version slices can be ordered as follows:
(v, w) < (ν', w') if and only if v < v' and DFS(w) < DFS(w').
Two version slices (v, w) and (ν', w') are said to be comparable if either (v, w) < (v',w), or vice-versa.
The structure of the B-tree when including version slices is based on the same structure as for the first and second aspects of the invention, except that the nodes can additionally comprise a plurality of keys, k, which are associated with different version slices. In one set of embodiments the nodes comprise one or more root
nodes, and preferably each version is associated to a root node, and in one example all the versions are associated to a unique root node.
In one set of embodiments the tree has a unique root node, which is referred to as root. Every node n in a version slice B-tree can store up to B entries of the form (k,v,w,x), where k is a key, (v,w) is a version slice, and x is either a value (for leaf nodes), or a pointer to another node n' (written p(n')).
In one set of embodiments each node is labelled with a base version, a base version slice (v, w) . Preferably every entry that points to node n has version slice
(v, w) . Entries in a node have version slice comparable to the base version of the node. Updates and queries at version v will reach node n only if (v, w) < v.
Within a node, in one set of embodiments, entries are ordered lexicographically by key and then by the second version slice coordinate, w. The canonical ordering on keys and, and the DFS ordering described above for versions; this lexicographic order is called the entry order. Note that entry ordering is not quite total: for each given pair of values k,w there may be at most two entries: the first version coordinate v is constrained by the fact that (v, w) is a version slice, to be equal to either w or the parent of w. The ambiguity of ordering is resolved by always placing the former entry first (i.e. (k, w, w, x) before (k, v, w, x) for any v≠ w).
In one set of embodiments, the pointers are labelled by the key and the version slice of the corresponding entry. Preferably the pointers are ordered
lexicographically by key, then by the first version number, and then by the second version number, in an order consistent with a depth-first search of the version tree from its root.
In one set of embodiments the data structure includes a mapping from the first version, v, to the interval l(v)=[A,B], wherein A is the depth-first search rank of the first version v, and B is the maximum depth-first search rank of the second version w which is a descendant of the first version v. This allows fast "ancestor" queries, though any other ordering that permits ancestor queries is allowed.
The B-tree of the present invention could be used to provide data storage in any type of computer memory, e.g. internal memory, tertiary memory or offline storage, but in a preferred set of embodiments the data storage is in external memory, e.g. on a hard disk drive.
Therefore the present invention also provides a computer readable data storage medium for storing one or more data records in a B-tree as set out in the first and second aspects of the invention. In one set of embodiments the data storage is provided on a database server, e.g. in data communication with a plurality of clients. The arrangement and organisation of the storage system is controlled by a storage engine loaded onto the CPU of the database server. For example, the storage engine is capable of performing the operations as detailed below to control and organise the data records which are stored on the memory of the database server.
The data storage system of the present invention supports the basic operations of "lookup" of a key in a particular version, and "insert" of data with a given key in a particular version. The following notation for versions will be used: v' < v if either v'=v, or v' is an ancestor of v, i.e. v is more recent than, or is equal to, v'.
An "effective" node, ev(n), of node n in version v is defined as exactly the set of most recent versions of every key in node n, from the view of version v. More precisely, ev(n) is the collection of all pointers p from node n such that if pointer p has the label (k',v') (denoting the key and version) then v' < v and there is no other pointer in n with label (k',v") and v' < v" < v .
If the pointers are ordered within a node lexicographically first by key then by version using the DFS ordering, then there is a linear time algorithm to compute the effective node ev(n): scan the pointers in node n in reverse order from the highest to lowest order; for each key k encountered take the first (i.e. highest) version w encountered such that tv < v, and ignore all subsequently encountered versions of that key. Clearly, if the ordering is lexicographic first on key then on version using reverse DFS, a forward scan produces the same results.
To search for an entry (k, v) in the data structure, we invoke a query based on the root node and the entry (k, v). The query first searches for a least upper bound for (k,v) and then scans right to find an entry whose version slice is ancestral to v. The scan terminates with failure when it encounters an entry with the wrong key or runs out of entries to examine; it terminates with success if an entry is found whose version slice coordinates are ancestral to v. Note that for internal nodes it is not insisted upon that the key is identical to k in order to follow a pointer. This is the same as for a normal B-tree: when following pointers only a least upper bound is needed.
With the result of this scan either the value found (if at a leaf node) is returned, or the next node (if at an internal node) is recursed down to. The operation lookup(k,v) will now be described. lookup(k,v) returns the value associated to the pointer (k,w) where version w is the closest ancestor version of version v for which the pointer (k,w) exists (n.b. ancestorship is reflexive, so (k,v) will be found if it exists). It should be noted that a variety of semantics for the lookup operation can be supported.
The lookup operation is performed as follows. First, lookup in the "root table" the version v in order to determine the starting node n. Then repeat the following algorithm until the search terminates:
If n is an internal node, find (by binary search on key) the pointer in the effective node ev(n) whose key k' is a least upper bound for k. Repeat the search from the node to which this pointer points.
If n is a leaf node, find (by binary search on key) the pointer in the effective node ev(n) whose key is k. If there is no such pointer then the key does not exist in the dictionary and we report an error; if the pointer is a data pointer then we return the data; if it is a virtual pointer, we continue the search at the leaf node to which it points.
Since the nodes form an acyclic graph, this process must terminate. In fact with the insertion method to be described below, the deepest path from root to leaf is no
bigger than 0(logB (N)) , where N is the number of keys, and at most one virtual pointer must be followed.
When viewed from a further aspect the present invention provides a method for performing a lookup operation in a B-tree as set out in the first or second aspect of the invention, the method comprising the steps of:
for a given version number, v, and key k,
computing an effective node, ev(n), that contains, for all entries ancestral to the given version, the pointer with the latest version not later than the given version number,
determining the starting node, n, from the version v in a root table, for each node, performing within the effective node ev(n) of n with respect to v, a binary search on key for at least an upper bound / 'for k, and
in the case where k' is associated to a pointer to another node, continuing the binary search in that node,
in the case where k' is associated to a data record, and k=k returning the data record;
otherwise reporting that key k is not found. In one set of embodiments, the effective nodes are cached in memory, using a key that can uniquely identify it; for example the block number and the version number for which it is an effective node.
In an alternative set of embodiments, the Applicant has found that it is not necessary to compute effective nodes while traversing the structure in a lookup.
Instead the following procedure can be used:
Find (by binary search) the key-version pair (k',v') which is a least upper bound for (k,v) with respect to the normal ordering on keys and reverse DFS ordering on versions. Clearly if k=k' and v=v' then the pointer is followed, meaning that either data item is returned if it is a data pointer, or the search is continued at the node to which it points.
Otherwise,
Suppose n is not a leaf node. If v' < v then the pointer is followed with label (k',v'); otherwise scan right (i.e. in the direction of decreasing DFS ordering on
versions and increasing key) to find the first pointer whose label has a version w < v. If there are no values (k w) with w < v then the search continues to the next highest key, and so on until a suitable upper bound key is found, and the corresponding pointer followed.
· Suppose n is a leaf node. If k'≠k then report an error: key k is not in the dictionary in version v. Otherwise, scan to the right as for internal nodes to find the first (k,w) such that w < v and follow that when it is found, or report an error if it is not found (note that this fails as soon as there are no more versions for key k). When viewed from a further aspect the present invention provides a method for performing a lookup operation in a B-tree as set out in the first or second aspect of the invention, the method comprising the steps of:
performing a binary search to find the key-version pair (k', v') which is a least upper bound for (k,v) with respect to the normal ordering on keys and reverse depth-first search ordering on versions,
following the pointer if k=k' and v=v and returning the data record if the pointer is a data pointer or if the pointer is not a data pointer, repeating the step of performing the binary search at the node to which the pointer points,
if node n is not a leaf node, following the pointer with label (k', v') if v' < v, otherwise scanning right to find the first pointer whose label has a version w < v, if node n is a leaf node, scanning right to find the first pointer with label (k,w) such that w < v and following the pointer when it is found.
The different orderings on the keys and their associated versions within a node will require different search procedures; the present invention is not limited with respect to such choices. In either of the two search paradigms described above, the complexity of the portion of a lookup corresponding to a given node is at worst equal to the number of pointers in the node.
In the set of embodiments which comprise version slices, a method for a lookup operation for entry (k,v) in a B-tree comprises the steps of:
performing a search from the root node corresponding to version v to find the first entry (k',v',w'), where (k,w)≤ (k',w'),
searching right to find the first entry (k', v',w') with v' < v, and
returning the value found if at a leaf node, or recursing down to the next node if at an internal node. Preferably the lookup operation further comprises following the pointer if k=k' and v=v and returning the data record if the pointer is a data pointer or if the pointer is not a data pointer, repeating the step of performing the binary search at the node to which the pointer points. The operation insert(k,v,data) will now be described. The operation insert(k,v,data) inserts the key k at version v and associates to it some data payload. For the purpose of this invention, we assume that the data payload is a pointer (and hence has fixed size). It can be extended to deal with variable length pointers using well- known techniques.
Only allow insertions at version v are allowed if v is a leaf node of the version tree; from now on it is assumed that v is indeed a leaf. An insertion proceeds as follows: first perform the operation lookup(k,v) to find an appropriate leaf node n at which to insert the data. If n does not contain a pointer (k,v) then a pointer (k,v) is added to n that points to data. Otherwise, n contains a pointer for the key k in version v; in this case either it can be replaced with the new data pointer (overwriting the previous version v), assigned a new version v'>v (v' is a descendent of v) to the inserted key, or an error reported; the precise choice is up to the application.
When viewed from a further aspect the present invention provides a method for performing an insert operation in a B-tree as set out in the first or second aspect of the invention, the method comprising the steps of:
inserting a data record into the data storage,
performing the lookup operation as set out above for a pointer (k,v) to find a leaf node n at which to insert a key,
inserting the pointer (k, v) to node n if node n does not contain the pointer (k,v) to point to the inserted data record,
replacing the pointer (k,v) in node n with a new data pointer if the node n contains the pointer (k,v
assigning a new version v' to the inserted key, wherein v'>v.
Only the cases of an insert will be discussed for the set of embodiments which comprise version slices; deletions can be achieved using a special tombstone value, and the corresponding space reclaimed in an amortized sense using standard global rebuilding techniques. As for a normal B-tree, the tree is navigated down, splitting full nodes until a leaf node is reached, at which the entry can simply be added. When a node n becomes full during an insert operation, it is split: a new node n' is created and some entries moved into it. For the purposes of choosing a 'good' node split, two useful quantities are defined: the number of entries live in a version slice (written live(v, w)), and the number that lead below a version slice (written LB(v, w)).
Lead below: an entry is lead below a given version slice if it is descendant from it in a version slice sense: (k, v, w, x) is lead below (ν', w') if, and only if, (ν', w') < (v, w).
Live: an entry (k, v, w, x) in node n is live in (ν', w') if (v, w) < (ν', w') and there is no other entry (k, v", w", x) in n with (v, w) < (v", w") < (ν', w'). In other words, (v,w) is a nearest ancestor to (v',w').
Live(v, w) is written for the number of entries in n live in (v,w). Note that liveness is monotone with ancestorship: if (v, w) < (ν', w') then live(v, w) < live(v', w'). For a node n with base version slice (v, w) we write live(n) for live (v, w) .
The concepts of lead and live are almost complementary: an entry (k, v, w, x) can be both lead below and live in (ν', w') only if (v, w) = (ν', w'). On the other hand there are entries (k, v, w, x) which are neither live in, nor lead below, a given version slice (ν', w'), either because the two version slices are incomparable, or because (v, w) is an ancestor of (ν', w') but not the nearest ancestor for key k.
In one set of embodiments an insert operation into node n of a B-tree which comprises version slices, comprises the steps of:
inserting a data record into the data structure at node n,
performing a search from the root node corresponding to version v to find the entry (k',v',w'), where (k, v)≤ (k',v') where the entries are ordered
lexicographically by key, then by the first version number, and then by the second version number, in an order consistent with a depth-first search of the version tree from its root,
searching right to find the first entry (k', v',w') with v' < v,
following the pointer if k=k' and v=v and returning the data record if the pointer is a data pointer or if the pointer is not a data pointer, repeating the step of performing the binary search at the node to which the pointer points,
inserting the pointer (k,v) to node n to point to the inserted data record.
In another set of embodiments an insert operation into node n of a B-tree as set out in the first or second aspect of the invention, comprises the steps of:
inserting a data record into the data structure at node n,
creating one or more new nodes (η', n", etc), if node n becomes full following the insertion of the data record, and splitting the entries in node n between node n and node n', n", etc, and
inserting records pointing to the one or more new nodes into any nodes pointing to node n, or
in the case that node n is a root node, creating one or more root nodes to contain the pointer records, or
optionally associating the new root nodes to one or more new versions. Preferably the step of splitting the entries in node n between node n and any new node n' comprises moving some of the entries from node n to node n', and copying some of the entries from node n to node n'.
Using the update procedure it can be demonstrated that enough free space in the nodes to give the desired update bound can be guaranteed. The update procedure is carried out at node n, pointed to by some entry in m, which is null at the root node. The procedure splits full nodes on the way down towards the root using the split procedure, inserting the entry when at a leaf node (this simply inserts the entry e at the appropriate location in the node n).
Inserting the new pointer at node n may exceed the capacity of the node; in this case, a sequence of split operations is applied at the node, which may cause further insertions and splits recursively up the tree. The two split operations are described, following which the entire insertion procedure is described. The two split operations are denoted as a version split and a key split. A key split partitions the entries by key; a version split extracts and duplicates some set of entries on the basis of versions. A version split of node n at version v produces a new node containing exactly the most recent version for every key in n. More precisely, a version split of a non-root node n in version v via a parent node np, where np contains a pointer to node n labeled with (k',v'), produces a node nv, a copy of ev(n), the effective node of n in version v, and inserts into node np a pointer to node nv with label (k',v).
When viewed from a further aspect the present invention provides a method for performing a version split operation of node n at version v in a B-tree as set out in the first or second aspect of the invention, the method comprising the step of:
creating a pointer (k',v) and a node nv, wherein the pointer (k',v) points from a parent node np to the node nv, wherein node np also comprises a pointer (k', v') which points to the node n, and wherein the node nv is a copy of the effective node ev(n) of the node n.
In one set of embodiments, any data pointers in node nv are replaced by virtual pointers to node n. Another possibility is to migrate a pointer in node n to point to node nv, which is discussed further below. A root node n can be version split using the same procedure except that rather than update the parent (which does not exist), the root table entry is updated instead for version v to point to node nv. A key split of node n at version v creates a new node n' and divides a constant fraction of the pointers between n and n', using a pivot element. A key split is essentially the same as a standard B-tree node split, which is well known in the art. A key split of the non-root node n via the parent node np, where np contains a pointer to n labelled with (k',v'), produces nodes n" and n"' in the same version as nodes n, and distributes the pointers of node n between node n" and node n'" such
that all the keys in n" are less than all the keys in n'". Furthermore, it replaces the pointer to n in np with pointers to n" and n'" labeled, respectively, with (7c" v) and (k v), where v is the version of n, and k" is greater than or equal to all keys in n" and less than all keys in n'".
When viewed from a further aspect the present invention provides a method for performing a key split operation of node n at version v in a B-tree as set out in the first or second aspect of the invention, the method comprising the steps of:
creating a node n", a node n" a pointer (k",v) and a pointer (k',v), wherein the pointer (7c" v) points from a parent node np to the node n" and the pointer (k',v) points from the parent node np to the node n" and wherein node np also comprises a pointer (k',v') which points to the node n,
distributing the pointers from node n between the node n" and the node n" such that all the keys in n" are less than all the keys in n'".
In one set of embodiments, node n is re-used by letting n"'=n so that some set of keys is moved out from n to n". In one set of embodiments k" is equal to the largest key in n". In one set of embodiments as close to half the keys in n as possible are moved into n" thus leading to as equitable as possible a distribution between the two nodes, n and n". It will be apparent to those skilled in the art that there are alternative methods for choosing the new key k", for choosing the number of keys to move into the new node, and for whether or not to re-use the existing node n. The present invention is not limited with respect to these choices. A root node n can be key split using the same procedure except that since there is no parent a new node r of version v needs to be created, which contains two pointers, one each to n'" and n", and the root table entry for v to point to r also has to be updated.
In the set of embodiments which comprise version slices, to key split a node n with base version slice (v, w), a new node n' with base version slice (v, w) is created. Let m = live(n)/2 and let km be the key of the m-th smallest distinct key in n. All the entries having the key k < km from are moved from node n to n', and a pointer to the node n' is returned. Key splitting is similar to the classical B-tree node split, except live key counts are considered, and for every key moved, all entries with that key are moved.
ln one set of embodiments a method for performing a key split operation of node n with base version v in a B-tree comprising version slices comprises the steps of: creating a node n' with base version v,
moving all entries with k≤ km, where km is the mt distinct key in node n, where m = live(n,v)/2,
recursively inserting a pointer to node n' in node n,
if there exists a node p containing a pointer entry (k,v,w) pointing to node n, inserting into node p a new entry (km,v,w) pointing to node n',
otherwise if node n is the root node associated to base version v, creating a new root node r containing the new entry (km,v,v) pointing to node n' and the new entry (∞,v,v) pointing to node n, and replacing the association between base version v and node n with an association between base version v and node r.
In another set of embodiments a method for performing a key split operation of node n with base version slice (v,w) in a B-tree comprising version slices comprises the steps of:
creating a node n' with base version slice (v,w),
moving all entries having a key k≤ km, where km is the mt distinct key value in node n, where m = live(n)/2,
wherein live(n) is equal to the number of entries of node n live in the base version slice of node n, and
returning a pointer to node n'.
To version split a node n at a version slice (v,w), a new node n' with base version slice (v,w) is created. All entries lead below (v, w) are moved from node n to n', and all entries that are live at (v, w) but not lead below (v, w) are copied. A pointer to the node n' is returned.
The notion of version split has been used before in the multiversion B-tree (MVBT). In the MVBT, an insert at version i on a full node triggers a version split that simply copies all entries in the node live at version i to a new node. This is sufficient to guarantee the space and query bounds for the case of partial persistence. For full persistence, a significantly more involved notion of version splitting is needed. In particular, it is not sufficient to simply copy entries live at some subtree of the
version tree; this observation is what motivated the notion of 'version slices' in the present invention.
In another set of embodiments a method for performing a version split operation of node n at version slice (v,w) in a B-tree comprises the steps of:
creating a new node n' with base version slice (v,w),
moving all "lead below" entries at version slice (v,w) from node n into node n
copying all "live" entries at version slice (v,w) which are not lead below entries from node n into node n', and
returning a pointer to node n'.
Preferably an entry with version slice (v,w) is lead below version slice (v',w') if version v is a weak descendent of v' and the depth-first search order of version w is greater than or equal to the depth-first search order of version v. Preferably an entry with version slice (v,w) is live at version (v',w') if version slice (v,w)≤v',w'), and there is no entry in the node n having version slice (v", w") such that (v,w)<(v", w")≤ (v',W). In one set of embodiments a method for performing a split operation of node n of a B-tree comprising a version slice comprises the steps of:
if the number of entries of node n live in the base version slice of node n is at least two-thirds of the maximum capacity of node n, then
creating a node n' with base version slice (v,w),
moving all entries having a key k≤ km, where km is the mt distinct key value in node n, where m = live(n)/2,
wherein live(n) is equal to the number of entries of node n live in the base version slice of node n, and
returning a pointer to node n',
otherwise searching for a version slice (v,w) such that the number of entries in node n that are live at version slice (v,w) is at least two-thirds of the number of total entries in the B-tree and such that the number of entries lead below version slice (v,w) is at least a ninth of the number of total entries in the B-tree, and, if such a version slice exists, then
creating a new node n' with base version slice (v,w),
moving all "lead below" entries at version slice (v,w) from node n into node n',
copying all "live" entries at version slice (v,w) which are not lead below entries from node n into node n
returning a pointer to node n followed by
creating a node n" with base version slice (v',w
moving all entries having a key k≤ km, where km is the mt distinct key value in node n where m = live(n')/2,
wherein live(n') is equal to the number of entries of node n' live in the base version slice of node n and
returning a pointer to nodes n' and n".
otherwise creating a new node n' with base version slice (v,w),
moving all "lead below" entries at version slice (v,w) from node n into node n
copying all "live" entries at version slice (v,w) which are not lead below entries from node n into node n
returning a pointer to node n wherein the version slice (v,w) minimises the maximum number of entries in node n or node n and returning a pointer to node n. Preferably the number of entries between the split node n and the new nodes (η', n", etc) is balanced.
A version slice (v,v) is equivalent to version v. Hence an entry (k,v) is equivalent to entry {k,v,v).
The above described operations can be combined in various ways to produce a correct modlist B-tree whose lookups still give the right answer. The following operation is a full insertion procedure, which depends on a constant B (the capacity of a node) and a parameter a (preferably with a = 2/3):
· Find the leaf node n for pointer (k,v), keeping track of the set of nodes traversed between the root node and node n. First insert or replace the data pointer as appropriate in node n.
Suppose that the procedure is at node n of version v which has the parent node np, and it has just inserted some number of new pointers. If the number of pointers in node n is less than constant B then the procedure finishes and exits.
Otherwise, if v'≠v
the procedure performs a version split of node n in version v and in the following step consider n'=ev(n); if v -v then consider n'=n in the following:
If the number of pointers in node n' is larger than α*β then a key split of node n' is performed to give a new node n" and new key k".
The above two splitting processes give rise to one or two new pointers that must be inserted into the parent node np. These are inserted and then the above step of a key split of the node np is performed if the necessary condition is satisfied. When viewed from a further aspect the present invention provides a method for performing a full insertion operation in a B-tree as set out in the first or second aspect of the invention, the method comprising the steps of:
performing the insert operation as set out above in leaf node n,
performing a version split as set out above of node n in version v if the number of pointers in node n is less than a constant B and if v'≠v,
performing a key split as set out above for node n' to give a new node n" and new key k" if the number of pointers in node n' is larger than αχβ,
inserting the new pointers into a parent node np,
performing a key split as set out above for the node np if the number of pointers in node np is larger than α*β,
A person skilled in the art of algorithm design will appreciate that there are various modifications one can make to the above method and algorithm. For example, instead of first inserting at the leaf node and then recursively splitting up the tree, the split could be made on the way down the tree, by first version splitting the root node, if it is fuller than B-2, then key-splitting if the result is itself bigger than axB, then moving down the tree towards the root node (and so on). This has the advantage of pre-allocating space for any new key-version pairs so that a split of the node n cannot trigger the split of its parent node; this in turn allows the set of nodes locked when doing an insert to be limited, which is important in order to serve multiple threads concurrently.
In one set of embodiments as described above copies of a data pointer d can be replaced by virtual pointers to a leaf node containing d. In particular, when data is
inserted into the modlist B-tree for the first time (e.g. in version v), a data pointer is written into a node, e.g. node n. Then when subsequent copies of data pointer d are made when version-splitting node n, these may be replaced by virtual pointers to node n. In this way, a search for the pointer d in a version v' descendant from v might arrive at the leaf node n' and then require one extra leaf lookup to reach the node n (and hence the d). However, the present invention is not limited in respect of which node should contain the real data pointer, and so in this example the data pointer and virtual pointer could be swapped around, should it prove better to do so. In this case the pointer d would be written into the node n' and a virtual pointer to node n' written into n. Now searches for pointer d in version v take one extra lookup, but searches in version v' take one fewer. This switch of real for virtual pointers could be triggered by usage (i.e., learned adaptively), or by user policy.
Therefore it can be seen that the above aspects of the present invention provide methods for optimising the locations of one or more real data pointers and other virtual pointers. In one set of embodiments this is applied to a single real data pointer, but in other embodiments it could be applied to more than one real data pointer. Other similar algorithms could also be written to achieve the same basic goal of moving data pointers around in response to usage patterns or user policies.
In one set of embodiments, suppose there is a sequence of nodes n n2, nk, and virtual pointers p, in node n, (for i<k) pointing to the node ni+1, and a data pointer d in node nk. The migration operation of the pointer d to the node is defined to mean that the pointer p-i should be replaced by the pointer d, and that the pointer d in the node nk and all the virtual pointers p, should be replaced with virtual pointers pointing to the node n-ι. The shortcutting operation of the virtual pointers is defined to mean that for each /', the pointer p, should be replaced with a virtual pointer to the node nk. When viewed from a further aspect the present invention provides a method for performing a migration operation in a B-tree as set out in the first or second aspect of the invention, the method comprising the steps of:
replacing a virtual pointer p^ in a node which points to a node n2 with a data pointer d,
replacing the data pointer d in the node nk and all the virtual pointers p, with virtual pointers pointing to the node n^.
Also from a further aspect the present invention provides a method for performing a shortcutting operation in a B-tree as set out in the first or second aspect of the invention, the method comprising the steps of:
replacing the virtual pointer p, in a node n, with a virtual pointer to a node nk.
The operation of randomised migration will now be described. In this method, either migration or shortcutting is performed whenever a chain of nodes n-i, n2, nk as described above is encountered whose length k is greater than some threshold, e.g. 1 . For example, when performing a lookup for data the set of nodes traversed in order to reach the final data pointer d is kept track of. If the set contains a chain then shortcutting is applied to it with probability p (e.g. with 75% probability) otherwise we apply migration to it.
For the sake of efficient range queries (queries for contiguous sets of keys in a particular version), it is desirable to minimise the fragmentation of ranges of data pointers that can occur when data pointers are replaced by virtual pointers, for example in the above described operations. If a data pointer is first created in version v, then the present invention is able to migrate that pointer to the first descendant of v having a number of children other than zero: 0 children
corresponding to a 'head' version; >1 children corresponding to a 'clone point'. In one set of embodiments this is enabled by requiring users to explicitly declare when creating a new version whether it is limited to only have at most one child or not. In this embodiment, if a virtual pointer p' for key k in version v' is found in a node n' pointing (possibly via other virtual pointers) to a node n containing the data pointer p in version v, then pointer p is migrated to node n' (and to shortcut the intermediate virtual pointers between p' and p) if and only if every version intermediate between v and v including the former but excluding the latter, is restricted to have at most one child.
The present invention is also capable of dealing with the case where no explicit caps are placed on the number of children that a version can ever have, and in
which clones may be made of any existing version. In this case, when a virtual pointer p' for key k in version v' is found in a node n pointing (possibly via other virtual pointers) to a node n containing the data pointer p in version v, the following operation is performed:
· If version v is an ancestor of version v then let version v" be the first version encountered on the path within the version tree from version v to version v' (including end-points) that does not have exactly one child.
If version v' is an ancestor of version v then let version v" be the first version encountered on the path within the version tree from version v' to version v
(including end-points) that does not have exactly one child.
If neither version is an ancestor of the other, then let version v" be the deepest common ancestor of both.
Find the node n" containing a virtual pointer p" to n for version v" of key k; if n"=n' then do nothing, otherwise migrate p to the node n" and shortcut all relevant pointers (at least at n and n1) to point to n".
For any of the above methods and embodiments of migration and shortcutting, there are several options about when and how frequently the migration is done. The present invention covers any and all policies regarding how this is done. In one set of embodiments, the migration and shortcutting operations are performed at times of low load (such as at night) in order to avoid impacting performance with the update operations themselves. In one set of embodiments, the leaves of the B-tree are scanned through systematically, looking for opportunities to migrate or shortcut, according to some policy such as migration to clone points. In one set of embodiments, a particular version v is focused on, with the leaves of the B-tree being scanned through to identify all nodes of version v to apply migration or shortcutting to. Buffering techniques can be applied to achieve a range of query/update tradeoffs.
By adding buffers and varying the degree of nodes, significantly better amortized 10 bounds can be achieved for updates without greatly sacrificing the asymptotic query time.
Fix a constant 0 < ε < 1 , and consider a B-tree where internal nodes have degree 1 + Βε. Every internal node has a buffer of size 0(1 ) blocks, i.e. enough to store 0(B) buffered updates to its subtree. Delayed updates in buffers are propagated down ('flushed') to their child buffers whenever a buffer overflows. When the buffer of a node overflows, there exists a child such that Ω(Β1_ε) elements can be flushed from the buffer to the child buffer. The remaining elements stay in the buffer. When updates reach a leaf, the tree is rebalanced.
This scheme can be modified as follows. Each buffer stores a set of elements in kv- order, in the same way that nodes store elements. When updates reach a leaf node, the tree is rebalanced using the split operations from before - this requires a recursive insert into the parent node (rather than its buffer). Before applying the split operation to a node n, it is ensured that all buffers on the path from root to n are empty, by applying a 'full flush' operation to the buffers from n upwards (and doing so recursively if any of these full flushes triggers a split). Once all the buffers on the path to the root are empty, it is safe to insert the structural change into the parent node, and continue splitting as necessary.
The flush process is simply a delayed version of the insert procedure in the unbuffered version slice tree. In particular, the structural part of the tree (i.e.
everything except the buffer contents) is exactly a version slice tree. For a child ni of a node n, we say an element is pending for ni if a lookup for the element in the structural part of the tree would have proceeded to ni. When the buffer of a node n is flushed (because it is overfull), choose a child ni of n that has the largest number of pending elements, rewrite the buffer at n to remove these elements, and insert the elements into the buffer of ni, recursively flushing if necessary. Upon a split, the buffer contents need to be rewritten and divided between the new nodes based on the structural contents of the new nodes. As for the unbuffered versioned B-tree, it can be guaranteed that each non-root leaf node reachable at version v has at least B/3 elements live at v. This implies that the height of the tree reachable for version v is hv = 0(logB£ Nv).
Some embodiments of the invention will now be described by way of example and with reference to the accompanying drawings, which are an outline of a system in accordance with the invention. Fig. 1 shows an example of a version tree in accordance with the invention;
Fig. 2 shows an example of a B-tree in accordance with the invention; Fig. 3 shows an example of version slices based on a version v and its children; and
Fig. 4 shows an example of a computer network system.
Fig. 1 shows a version tree 1 with a root version 0 (10) which is modified to get version 1 (12). Version 2 (14) and version 3 (16) are created each by modifying version 1 (12). Version 3 (16) does not incorporate the changes made in version 2 (14), instead it is started afresh by directly modifying version 1 (12). Likewise, version 4 (18) is unaffected by any of the changes made in version 1 (12), version 2 (14), and version 3 (16), and is created by directly modifying version 0 (10). In this case each of the versions 2, 3, 4 is "most-recent" in the sense that there are no other versions descendant from them. Fig. 2 shows a B-tree in accordance with the present invention in which the data records are stored. A root table A comprises a list of the versions of the data stored in the system, i.e. 0, 1 and 2, 0 being the original version and 1 and 2 being subsequent versions. Referring back to Fig. 1 we can see that version 1 (12) is a modification of version 0 (10) and that version 2 (14) is a subsequent modification of version 1 (12). The root table A also comprises three internal pointers D corresponding each of the versions 0, 1 , 2 in the root table A.
The internal pointers D are not themselves labelled with a version but take the version from the root table A. This is also the case for others pointers from other nodes. The internal pointers D point from the root table A to internal nodes B, with two or more internal pointers D able to point to the same internal node B, i.e. the two internal pointers D which point from the root table A to the internal node B which contains versions 1 and 2.
The internal nodes B contain a list of keys which reference data records, the associated version for these keys and the pointers which can be followed to access the data. The pointers from the internal nodes B are themselves internal pointers D and point to leaf nodes C.
The leaf nodes C are similar to the internal nodes B in that they also contain a list of keys, version numbers and pointers, but they are different in that the pointers from the leaf nodes C are either data pointers E or virtual pointers F. A leaf node C can comprise one or more data pointers E and zero or more virtual pointers F.
The data pointers E point from the leaf nodes C to the data records such that a data record for a certain key with a certain version number can be accessed. The virtual pointers F point from a leaf node C to another leaf node C, and as such they act as proxies for data pointers E, i.e. the virtual pointer F for a certain key and version number can be followed to access a leaf node C from where a data pointer E can be followed to access the appropriate data record.
As an example of version tree slicing, consider Figure 3, which shows a version v and its three children, w1 , w2 and w3. In addition to the simple slices (v,v), (wi,wi), the other slices are as follows: (v,w3) encompasses all versions strictly descendant from v; (v,w2) encompasses all those strict descendants of v whose DFS order is at least that of w2; (v,w1 ) encompasses all versions strictly descendant from v whose DFS order is at least that of w1 , which is the same as (w1 ,w1 ) because all such versions are necessarily in the subtree rooted at w1.
A example of a computer network system on which the present invention could be implemented is shown in Fig. 1. The system comprises a database server 100 which is in data communication with a number of clients 102. The database server 100 includes a CPU 104 on which a storage engine 106 driving the organisation of the system is loaded. The database server 100 also includes a number of data storage disks 108 on which the data records are stored. The CPU 104 is in data communication with the storage disks 108 to control the organisation of the data records in accordance with the details of the storage engine 106, e.g. the types of operations it is able to perform.
Claims
1. A system for storing one or more data records in a B-tree data structure comprising:
a fully persistent B-tree and an associated version tree,
wherein the B-tree comprises one or more nodes,
wherein each node comprises a version number, one or more keys and one or more pointers,
wherein each of said pointers comprises either an internal pointer which links said nodes together or a pointer which associates the one or more keys with said one or more data records, and
wherein the pointers are ordered lexicographically by key, then by version number in an order consistent with a depth-first search of the version tree from its root.
2. A system for storing one or more data records in a B-tree data structure comprising:
one or more nodes and an associated version tree,
wherein each node comprises one or more ordered entries, and each entry comprises a key, a version, and a data record or a pointer.
3. A system as claimed in claim 2, comprising a B-tree which comprises the one or more nodes.
4. A system as claimed in claim 3, wherein the B-tree is a fully persistent B- tree.
5. A system as claimed in claim any preceding claim, wherein each entry comprises a plurality of keys and/or a plurality of versions and/or a plurality of data records and/or a plurality of pointers.
6. A system as claimed in any preceding claim, wherein each of said pointers comprises either an internal pointer which links said nodes together, a data pointer which associates the one or more keys with said one or more data records or a virtual pointer which indirectly links said one or more keys with said one or more data records via another node.
7. A system as claimed in any preceding claim, wherein the keys are ordered by a natural ordering, e.g. byte order or integer order.
8. A system as claimed any preceding claim, wherein the versions are ordered by descending depth-first search rank.
9. A system as claimed in any preceding claim, wherein the entries in each node are ordered lexicographically by key and then version.
10. A system as claimed in any of claims 1 to 8, wherein the entries in each node are ordered lexicographically by version and then key.
1 1 . A system as claimed in any preceding claim, wherein the version tree comprises one or more ordered version tree nodes labelled by user-defined labels, and a mapping from the user-defined labels to the orderings of the version tree nodes.
12. A system as claimed in any preceding claim, wherein the pointers are labelled by the key and the version of the corresponding node.
13. A system as claimed in claim 12, wherein the pointers are ordered lexicographically by key, then by version number in an order consistent with a depth-first search of the version tree.
14. A system as claimed in any preceding claim wherein the nodes comprise a list of modifications to the data records.
15. A system as claimed in any preceding claim, wherein the nodes comprise one or more internal nodes and one or more leaf nodes.
16. A system as claimed in any preceding claim, wherein for a plurality of contiguous entries in a node comprising the same key, the key is stored only in the first entry.
17. A system as claimed in any preceding claim, the version in each entry comprises a version slice comprising a first version v and a second version w, wherein \ < w.
18. A computer readable data storage medium for storing one or more data records in a system as claimed in any preceding claim.
19. A method for performing a lookup operation in a data storage system as claimed in any of claims 1 to 17, the method comprising the steps of:
for a given version number, v, and key k,
computing an effective node, ev(n), that contains, for all entries ancestral to the given version, the pointer with the latest version not later than the given version number,
determining the starting node, n, from the version v in a root table, for each node, performing within the effective node ev(n) of n with respect to v, a binary search on key for at least an upper bound / 'for k, and
in the case where k' is associated to a pointer to another node, continuing the binary search in that node,
in the case where k' is associated to a data record, and k=k returning the data record;
otherwise reporting that key k is not found.
20. A method for performing a lookup operation in a data storage system as claimed in any of claims 1 to 17, the method comprising the steps of:
performing a binary search to find the key-version pair (k',v') which is a least upper bound for (k,v) with respect to the normal ordering on keys and reverse depth-first search ordering on versions,
following the pointer if k=k' and v=v and returning the data record if the pointer is a data pointer or if the pointer is not a data pointer, repeating the step of performing the binary search at the node to which the pointer points, if node n is not a leaf node, following the pointer with label (k', v') if v' < v, otherwise scanning right to find the first pointer whose label has a version w < v, if node n is a leaf node, scanning right to find the first pointer with label (k,w) such that w < i/and following the pointer when it is found.
21 . A method for performing an insert operation in a data storage system as claimed in any of claims 1 to 17, the method comprising the steps of:
inserting a data record into the data storage,
performing a lookup operation a pointer (k,v) to find a leaf node n at which to insert a key comprising the steps of:
performing a binary search to find the key-version pair (k', v') which is a least upper bound for (k,v) with respect to the normal ordering on keys and reverse depth-first search ordering on versions,
following the pointer if k=k' and v=v and returning the data record if the pointer is a data pointer or if the pointer is not a data pointer, repeating the step of performing the binary search at the node to which the pointer points,
if node n is not a leaf node, following the pointer with label (k', v') if v' < v, otherwise scanning right to find the first pointer whose label has a version w < v, if node n is a leaf node, scanning right to find the first pointer with label (k,w) such that w < v and following the pointer when it is found;
the method further comprising the steps of:
inserting the pointer (k, v) to node n if node n does not contain the pointer (k,v) to point to the inserted data record,
replacing the pointer (k,v) in node n with a new data pointer if the node n contains the pointer (k,v
assigning a new version v' to the inserted key, wherein v'>v.
22. A method for performing a version split operation of node n at version v in a data storage system as claimed in any of claims 1 to 17, the method comprising the step of:
creating a pointer (k',v) and a node nv, wherein the pointer (k',v) points from a parent node np to the node nv, wherein node np also comprises a pointer (k', v') which points to the node n, and wherein the node nv is a copy of the effective node ev(n) of the node n.
23. A method for performing a key split operation of node n at version v in a data storage system as claimed in any of claims 1 to 17, the method comprising the steps of:
creating a node n", a node n" a pointer (7c" v) and a pointer (k', v), wherein the pointer (7c" v) points from a parent node np to the node n" and the pointer (k',v) points from the parent node np to the node n" and wherein node np also comprises a pointer (k',v') which points to the node n,
distributing the pointers from node n between the node n" and the node n" such that all the keys in n" are less than all the keys in n'".
24. A method for performing a full insertion operation in a data storage system as claimed in any of claims 1 to 17, the method comprising the steps of:
performing an insert operation in leaf node n comprising the steps of:
inserting a data record into the data storage,
performing a lookup operation a pointer (k,v) to find a leaf node n at which to insert a key comprising the steps of:
performing a binary search to find the key-version pair (k',v') which is a least upper bound for (k,v) with respect to the normal ordering on keys and reverse depth-first search ordering on versions,
following the pointer if k=k' and v=v and returning the data record if the pointer is a data pointer or if the pointer is not a data pointer, repeating the step of performing the binary search at the node to which the pointer points,
if node n is not a leaf node, following the pointer with label (k',v') if v' < v, otherwise scanning right to find the first pointer whose label has a version w < v, if node n is a leaf node, scanning right to find the first pointer with label (k,w) such that w < v and following the pointer when it is found;
the insert operation further comprising the steps of:
inserting the pointer (k, v) to node n if node n does not contain the pointer
(k,v) to point to the inserted data record,
replacing the pointer (k, v) in node n with a new data pointer if the node n contains the pointer (k,v assigning a new version v' to the inserted key, wherein v'>v;
the method further comprising the step of:
performing a version split of node n in version v if the number of pointers in node n is less than a constant B, wherein B is the capacity of a node, and if v'≠v, wherein the version split comprises the step of:
creating a pointer (k',v) and a node nv, wherein the pointer (k',v) points from a parent node np to the node nv, wherein node np also comprises a pointer (k', v') which points to the node n, and wherein the node nv is a copy of the effective node ev(n) of the node n;
the method further comprising the step of:
performing a key split for node n' to give a new node n" and new key k" if the number of pointers in node n' is larger than αχβ, wherein a is a constant, and wherein the key split comprises the steps of:
creating a node n", a node n" a pointer (7c", v) and a pointer ( c', v), wherein the pointer (7c" points from a parent node np to the node n" and the pointer (7c', points from the parent node np to the node n" and wherein node np also comprises a pointer (k',v') which points to the node n,
distributing the pointers from node n between the node n" and the node n" such that all the keys in n" are less than all the keys in n"'
the method further comprising the steps of:
inserting the new pointers into a parent node np,
performing a key split as for the node np if the number of pointers in node np is larger than σχβ, the key split comprises the steps of:
creating a node n", a node n" a pointer (k", v) and a pointer (k', v), wherein the pointer (7c", v) points from a parent node np to the node n" and the pointer (7c', v) points from the parent node np to the node n" and wherein node np also comprises a pointer (k',v') which points to the node n,
distributing the pointers from node n between the node n" and the node n" such that all the keys in n" are less than all the keys in n"'
25. A method as claimed in claim 24, wherein a = 2/3.
26. A method for performing a migration operation in a data storage system as claimed in any of claims 1 to 17, the method comprising the steps of: replacing a virtual pointer p-, in a node n-, which points to a node n2 with a data pointer d,
replacing the data pointer d in the node nk and all the virtual pointers p, with virtual pointers pointing to the node n^.
27. A method for performing a shortcutting operation in a data storage system as claimed in any of claims 1 to 17, the method comprising the steps of:
replacing the virtual pointer p, in a node n, with a virtual pointer to a node nk.
28. A method for performing an insert operation into node n of a B-tree as claimed in any of claims 1 to 17, the method comprising the steps of:
inserting a data record into the data structure at node n,
creating one or more new nodes (η', n", etc), if node n becomes full following the insertion of the data record, and splitting the entries in node n between node n and node n', n", etc, and
inserting records pointing to the one or more new nodes into any nodes pointing to node n, or
in the case that node n is a root node, creating one or more root nodes to contain the pointer records, or
optionally associating the new root nodes to one or more new versions.
29. A method as claimed in claim 28, wherein the step of splitting the entries in node n between node n and any new node n' comprises moving some of the entries from node n to node n', and copying some of the entries from node n to node n'.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/GB2012/050341 WO2012110812A1 (en) | 2011-02-15 | 2012-02-15 | Data storage system |
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GB1017867.1 | 2010-10-22 | ||
| GBGB1017867.1A GB201017867D0 (en) | 2010-10-22 | 2010-10-22 | Data storage system |
| GB1102606.9 | 2011-02-15 | ||
| GBGB1102606.9A GB201102606D0 (en) | 2011-02-15 | 2011-02-15 | Data storage system |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2012052785A1 true WO2012052785A1 (en) | 2012-04-26 |
Family
ID=44872430
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/GB2011/052060 Ceased WO2012052785A1 (en) | 2010-10-22 | 2011-10-24 | Versioned data structure |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2012052785A1 (en) |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP3091447A4 (en) * | 2013-12-31 | 2016-12-21 | Huawei Tech Co Ltd | METHOD FOR MODIFYING ROOT N UDS AND MODIFICATION APPARATUS |
| CN106575306A (en) * | 2014-08-29 | 2017-04-19 | Netapp股份有限公司 | Method and device for persisting data on non-volatile memory for fast update and instant recovery |
| CN110262826A (en) * | 2019-03-05 | 2019-09-20 | 上海博泰悦臻网络技术服务有限公司 | Vehicle-mounted software configuration updating management method, server-side, client and processing end |
| CN110377306A (en) * | 2019-07-18 | 2019-10-25 | 上海擎感智能科技有限公司 | For the management method and device of mobile unit upgrade package, medium, server |
| DE102021108455B4 (en) | 2020-06-30 | 2024-05-08 | Hewlett Packard Enterprise Development Lp | Creating snapshots of a key-value index |
-
2011
- 2011-10-24 WO PCT/GB2011/052060 patent/WO2012052785A1/en not_active Ceased
Non-Patent Citations (11)
| Title |
|---|
| ARGE ET AL.: "1/o-efficient point location using persistent B-trees", J. EXP. ALGORITHMICS, 8 December 2003 (2003-12-08) |
| CHUTANI ET AL., THE EPISODE FILE SYSTEM USENIX ANNUAL TECHNICAL CONFERENCE, 1992, pages 43 - 60 |
| DRISCOLL ET AL.: "Making data structures persistent", STOC '86: PROCEEDINGS OF THE EIGHTEENTH ANNUAL A CM SYMPOSIUM ON THEORY OF COMPUTING, 1986, pages 109 - 121 |
| DRISCOLL ET AL.: "Persistence, amortization and randomization", SODA '91: PROCEEDINGS OF THE SECOND ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 1991, pages 78 - 88 |
| GOODRICH ET AL.: "External-memory computational geometry", PROCEEDINGS OF THE 1993 IEEE 34TH ANNUAL FOUNDATIONS OF COMPUTER SCIENCE, 1993, pages 714 - 723, XP010125726, DOI: doi:10.1109/SFCS.1993.366816 |
| LANKA ET AL., FULLY PERSISTENT B+-TREES, SIGMOD REC., vol. 20, no. 2, 1991, pages 426 - 435 |
| LANKA S ET AL: "FULLY PERSISTENT B+-TREES", SIGMOD RECORD, ACM, NEW YORK, NY, US, vol. 20, no. 2, 1 June 1991 (1991-06-01), pages 426 - 435, XP000364657, ISSN: 0163-5808, DOI: 10.1145/119995.115861 * |
| ORGANIZATION AND MAINTENANCE OF LARGE ORDERED INDEXES, ACTA INFORMATICA, vol. 1, no. 3, 1972, pages 173 - 189 |
| RODEH: "B-trees, shadowing, and clones", TRANS. STORAGE, February 2008 (2008-02-01), pages 3 - 2,1-2,27 |
| THE VLDB JOURNAL, vol. 5, no. 4, 1996, pages 264 - 275 |
| VITTER: "External memory algorithms and data structures: dealing with massive data", ACM COMPUT. SURV., vol. 33, no. 2, 2001, pages 209 - 271, XP002492352, DOI: doi:10.1145/384192.384193 |
Cited By (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP3091447A4 (en) * | 2013-12-31 | 2016-12-21 | Huawei Tech Co Ltd | METHOD FOR MODIFYING ROOT N UDS AND MODIFICATION APPARATUS |
| US10289710B2 (en) | 2013-12-31 | 2019-05-14 | Huawei Technologies Co., Ltd. | Method for modifying root node, and modification apparatus |
| CN106575306A (en) * | 2014-08-29 | 2017-04-19 | Netapp股份有限公司 | Method and device for persisting data on non-volatile memory for fast update and instant recovery |
| CN106575306B (en) * | 2014-08-29 | 2020-07-07 | Netapp股份有限公司 | Method and apparatus for persisting data on non-volatile memory for fast update and instant recovery |
| CN110262826A (en) * | 2019-03-05 | 2019-09-20 | 上海博泰悦臻网络技术服务有限公司 | Vehicle-mounted software configuration updating management method, server-side, client and processing end |
| CN110377306A (en) * | 2019-07-18 | 2019-10-25 | 上海擎感智能科技有限公司 | For the management method and device of mobile unit upgrade package, medium, server |
| DE102021108455B4 (en) | 2020-06-30 | 2024-05-08 | Hewlett Packard Enterprise Development Lp | Creating snapshots of a key-value index |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US7734714B2 (en) | Spatial Sieve Tree | |
| Beckmann et al. | A revised R*-tree in comparison with related index structures | |
| Whang et al. | The Multilevel Grid File-A Dynamic Hierarchical Multidimensional File Structure. | |
| US10102253B2 (en) | Minimizing index maintenance costs for database storage regions using hybrid zone maps and indices | |
| US7917474B2 (en) | Systems and methods for accessing and updating distributed data | |
| US7103588B2 (en) | Range-clustered tables in a database management system | |
| WO2020234719A1 (en) | Indexing for evolving large-scale datasets in multi-master hybrid transactional and analytical processing systems | |
| CN110083601A (en) | Index tree constructing method and system towards key assignments storage system | |
| JPH07191891A (en) | Computer method and storage structure for storage of, and access to, multidimensional data | |
| KR20110010736A (en) | Paging Hierarchical Data | |
| Hadian | Interpolation-friendly B-trees: Bridging the gap between algorithmic and learned indexes | |
| US20080033907A1 (en) | Business object search using multi-join indexes and extended join indexes | |
| US20180165332A1 (en) | Indexing dynamic hierarchical data | |
| WO2012052785A1 (en) | Versioned data structure | |
| Eslami et al. | Memento Filter: A Fast, Dynamic, and Robust Range Filter | |
| Petrov | Algorithms behind modern storage systems: Different uses for read-optimized b-trees and write-optimized lsm-trees | |
| Freeston | A well-behaved file structure for the storage of spatial objects | |
| CN117131012B (en) | Sustainable and extensible lightweight multi-version ordered key value storage system | |
| US7693850B2 (en) | Method and apparatus for adding supplemental information to PATRICIA tries | |
| Tzouramanis et al. | Benchmarking access methods for time-evolving regional data | |
| WO2012110812A1 (en) | Data storage system | |
| Stenbacka | Extending a data management system with multiversion indexing, branching, and snapshots | |
| Di Pasquale et al. | Fully Dynamic Balanced and Distributed Search Trees with Logarithmic Costs. | |
| Kriegel et al. | A flexible and extensible index manager for spatial database systems | |
| US20250378115A1 (en) | Modeling graphs on distributed storage |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 11776010 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 11776010 Country of ref document: EP Kind code of ref document: A1 |