US20170060924A1

US20170060924A1 - B-Tree Based Data Model for File Systems

Info

Publication number: US20170060924A1
Application number: US15/084,401
Authority: US
Inventors: Jeremy Fitzhardinge
Original assignee: Exablox Corp
Current assignee: Exablox Corp
Priority date: 2015-08-26
Filing date: 2016-03-29
Publication date: 2017-03-02
Also published as: US20170063990A1; US10474654B2

Abstract

Methods and systems for organizing data are provided. An example method includes providing an object store to store objects. Each of the objects represents fragments of the data is are associated with an address. The method further allows associating a B-tree with the object store. The B-tree includes nodes, wherein each of the nodes includes keys, and wherein each of the keys is associated with at least one object from the object store. Values for each of the keys are generated based at least partially on objects from the object store. If the size of an object from the object store is less than a pre-determined size, a value of the object is stored in a particular node of the B-tree, with the particular nodes including a particular key associated with the object. Otherwise, the method includes storing the address associated with the object in the particular node of the B-tree.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims benefit of U.S. provisional application No. 62/210,385 filed on Aug. 26, 2015. The disclosure of the aforementioned application is incorporated herein by reference for all purposes.

TECHNICAL FIELD

This disclosure relates generally to data processing and, more specifically, to methods and systems for providing a B-tree based data model for organizing file systems.

BACKGROUND

The approaches described in this section could be pursued but are not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
In computer systems, user data are typically organized as file systems. In general, a file system may be viewed as a directed graph of objects, wherein nodes and leaves of the directed graph represent files and directories. Directories may further include subdirectories and files. In a multi-user computer system, each file or directory is assigned attributes that regulate user permissions for viewing, editing, and creation of files and directories. Attributes of directories and files are kept in the directed graph as objects. Large files in a file system may be split into a chain of blocks. Therefore, during a lifetime of a file system, the directed graph may be developed to include very long paths or chains of referring objects to form the root of the directed graph to leaves. Therefore, any modifications to the file system, such as a modification or creation of a new file or directory, require traveling the path through nodes to find a place for adding, creating, or modifying a new node. Having long and unbalanced paths from the root to leaves may require excessive input/output operations that may be time consuming and lead to unnecessarily redundant storage consumption.

SUMMARY

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
The technology described herein includes methods for organizing data. An example embodiment can provide a B-tree based data model constructed on top of an object store. The object store may include immutable, content-addressable, distributed objects representing fragments of user content.
According to the example embodiment, a method includes providing an object store to store objects. The objects can represent fragments of the data. Each object can be associated with an address. The method can further include associating a B-tree with the object store. The B-tree can include nodes, with each of the nodes including keys. Each key can be associated with at least one object from the object store. The method allows generating, based at least partially on objects from the object store, values for each key.
In some embodiments, the address of an object in the object store is based on content of the fragment of the data. In some embodiments, the method determines whether a size of an object from the object store is less than a pre-determined size. If the result of the determination is positive, a value of the object is stored in a particular node of the B-tree. The particular node includes a particular key associated with the object. If the result of the determination is negative, the address of the object is stored in the particular node of the B-tree.
In some embodiments, each object includes one of a metadata object or a data object. Metadata objects store at least a number of references to further objects from the data objects and an identification number associated with a file or a directory. The data object represents a continuous fragment of the file or a directory entry.
In some embodiments, the key associated with the metadata object includes at least three fields. The first field includes an indication of the metadata object. The second field includes the identification number. The third field includes a metadata index representing a distinct type of the metadata object.
In some embodiments, the type of the metadata object includes one of: attributes associated with the file or the directory, a symbolic link to the file or the directory, and an extended file attribute associated with the file or the directory. In some embodiments, the key associated with the fragment of the file includes at least three fields. The first field includes an indication of the data object. The second field includes the identification number. The third field includes an offset of the continuous fragment of the file from the beginning of the file.
In some embodiments, the key associated with the directory entry includes at least three fields. The first field includes an indication of the data object. The second field includes the identification number. The third field includes a hash calculated based on a literal name of the directory entry.
In some embodiments, calculating the hash includes applying a hash function to the literal name to obtain a preliminary hash. The hash function can include, for example, crc32c, SipHash, or SipHash-order3. The preliminary hash can be shifted by a pre-determined base number to obtain the hash. If the hash matches an existing hash, then the hash is incremented by 1.
According to another example embodiment of the present disclosure, the steps of the method for organizing data are stored on a machine-readable medium comprising instructions, which, when implemented by one or more processors, perform the recited steps.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements.

FIG. 1 illustrates an example system for organizing computer data.

FIG. 2 illustrates an example of two collections sharing mutual data objects.

FIG. 3 illustrates an example B-tree.

FIG. 4 displays a table containing types of variables.

FIG. 5A illustrates attributes describing a superblock entry in a data model.

FIG. 5B illustrates attributes of a key used in a B-tree based data model.

FIG. 6 displays attributes of a metadata, attributes of a file entry, and additional attributes of a directory entry in a B-tree based data model.

FIG. 7 illustrates internal and external forms for extended attributes used in a B-tree data model.

FIG. 8 illustrates a key, internal form, and external form for values for a file chunk.

FIG. 9 illustrates a key and forms for values for a directory entry.

FIG. 10 is a process flow diagram showing a method for organizing data.

FIG. 11 shows a diagrammatic representation of a computing device for a machine in the example electronic form of a computer system, within which a set of instructions for causing the machine to perform any one or more of the methodologies discussed herein can be executed.

DETAILED DESCRIPTION

The following detailed description includes references to the accompanying drawings, which form a part of the detailed description. The drawings illustrate exemplary embodiments. These exemplary embodiments, which are also referred to herein as “examples,” are described in enough detail to enable those skilled in the art to practice the present subject matter. The embodiments can be combined, other embodiments can be utilized, or structural, logical and electrical changes can be made without departing from the scope of what is claimed. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope is defined by the appended claims and their equivalents.
The technology described herein allows organizing data using a B-tree built on top of an object store. In some embodiments, the object store includes immutable content-addressable distributed objects. The objects may represent fragments of data (for example, fragments of files and directories and metadata including attributes of files and directories).
According to an example embodiment, the method for organizing data includes providing an object store to store objects. The objects represent fragments of the data. Each object is associated with an address. The method can further include associating a B-tree with the object store. The B-tree includes nodes, with each node including keys. Each key can be associated with at least one object from the object store. The method allows generating, based at least partially on objects from the object store, values for each of the keys.
FIG. 1 illustrates an example system 100 for organizing data, according to various example embodiments. The illustrated system 100 includes applications 110, a directed graph 120, and an object store 130. In various embodiments, applications may include operational systems, user applications, graphical user interface, and other software for creating, managing, and presenting the data.
In various embodiments, the object store 130 includes key-value entries (objects). Each of the key-value entries includes an identifier (a key) and a payload (also referred to as a value or an object content) representing a chunk of the data (for example, a chunk of user content.) In various embodiments, objects are designated as either “data” or “metadata.” Payloads of “data” objects include only uninterpreted bytes. Payloads of “metadata” objects have an internal structure and may refer to other objects. In some embodiments, the payloads of “metadata” objects include keys of the objects to which the “metadata” objects refers.
In some embodiments, objects in the object store 130 are organized by a graph structure 120 (also referred as a directed graph or a collection). The graph structure 120 is a directed graph in which each node is an immutable content-addressable object. The collection includes a specific object designated as a root object.
In various embodiments, objects are at least partially content-addressable. This implies that some portions of the identifiers of objects are functions of object contents. A key for the object representing a chunk of data can be calculated using a smart hash (Smash) which is a function of bytes in the chunk. In some embodiments, each version of data (a snapshot) corresponds to a new graph structure. Two graph structures representing two different snapshots can share mutual objects.
FIG. 2 is a block diagram showing two collections 100 and 200 as having mutual objects. Collection 100 includes root objects 110 and objects 122, 124, 126, 128, 130, and 230. The collection 200 includes root object 210, objects 222, 224, 226, 228, 230, 126, 130, and 232. The collections 200 and 100 share at least objects 126, 130, and 230. In the example of FIG. 2, the collection 200 is an increment of graph 100 (shown in FIG. 2). Each of the root objects is the root of a specific unique immutable graph. In order to get the effect of a modification or mutation, a new graph is constructed that includes the required changes. If the new graph is an incremental change from a previous graph, then the new graph is likely to share a large number of its objects with the previous graph. The only change required by the new objects is a new root.
In some embodiments, the objects in object store 130 include entities such as identifier node (“mode”), extended attributes (“xattr”), a symbolic link (“symlink”), and chunks of files and directories. In a collection (for example, collection 100 or 200), each file or directory is represented by a chain of at least three objects: “mode” object, “xattr” object, and a “data” object. An “mode” object refers the “xattr” object and the “data” object. An “mode” object includes an identification number (“inum”) associated with the file or directory, number of links, file type and permission, creation and modification of file or directory, and other attributes. An “xattr” object includes extended file attributes depending on a type of a file system, such as information concerning an author of a text, a checksum, encoding, and so forth. An extended attribute may include a name of the attribute and a value associated with the attribute's name. The “data” object is an object containing a chunk of data (for example, a chunk of a file). In some embodiments, an address of an object in object store 130 is a Smash, which is a hash function of the content of the object.
In some embodiments, the graph structure 120 includes a B-tree (also referred to as a B-tree model). In various embodiments, using a B-tree model for referencing metadata objects and data objects may allow accessing files by “mode” number in order to support persistent file handles. The B-tree may provide tools for managing efficient large directories having up to 1,000,000 entries, hard links, and a large amount of small files. The B-tree model allows flexible data layouts for non-streaming workloads.
The result is that B-tree collections tend to have fewer, larger objects with many smaller file system entities such as “modes,” “symlinks,” “xattrs,” and small amounts of data packed together. The downside includes an increased conceptual complexity resulting in a structure that is unlike the user-visible file system namespace, and in some cases, results in increased input/output (I/O) resource consumption due to additional read-modify-write operations to repack updates within an otherwise unchanged object.
All leaves in a B-tree have the same distance from the root with wide fanouts. The left to root path length is bounded to a small number, typically no more than 4-5 levels, even for bigger file systems. This property results in efficient access patterns to look data up and minimal I/O amplification when writing.
In some embodiments, a “libebtree” library is used for the B-tree implementation. In some embodiments, keys in B-tree nodes have a fixed size. In other embodiments, values in B-tree nodes have variable sizes. In some embodiments, much of the overall structure of a file system is placed into the design of the key space. In various embodiments, values associated with keys are then used to store small items efficiently packed into larger storage objects while larger objects are directly stored in object store 130.
Packing small items such as “mode” attributes, extended attributes, “symlinks,” and small file content into B-tree nodes can significantly reduce the number of objects required. For example, a small file with a “xattr” can be represented by 3 objects (“mode,” “xattr,” and “data”) in a collection described in FIG. 2. In the B-tree model, the small file can be packed into objects along with many other “inodes/xattrs/data” objects, consuming less than one object for the entire file.
In some embodiments, data storage is not limited by a fixed block size. Block sizes can be chosen to match the workload. Non-sequential I/O patterns can be matched to small blocks in order to avoid excess I/O amplification, whereas larger streaming writes are better served by large blocks. The B-tree model design allows using file blocks from 1 kilobyte to 1 Megabyte.
An example B-tree 300 is shown in FIG. 3. The example B-tree 300 includes at least a root 302, nodes 304, 306, and 308, and a leaf 310. The root 302 includes at least a key 308 (k1). The node 304 includes at least keys 312 (k2), 314 (k3), and 316 (k4). The node 306 includes at least keys 318 (k5) and 320 (k6). The node 308 includes at least keys 322 (k7) and 324 (k8). The example leaf 310 includes at least key 326 (k9). The nodes 304 and 306 are referred by the root 302 and refer further nodes (for example, node 308). The key k1 of root 302 divides all further nodes into two subtrees. The first subtree starts with the node 304. Keys k2, k3, and k4 are all less than key k1. The second subtree starts with the node 306 as a subtree. Keys k5 and k6 are greater than key k1. Node 304 may further divide B-tree 300 and refer at least four further subtrees, while node 306 may further divide B-tree into at least three further subtree. The example node 304 refers at least node 308. Keys k7 and k8 in node 308 are greater than key k3 and less than key k4 of the node 304. The leaf 312 does not refer any further nodes. Each node and leaf of B-tree may also include value fields (also referred to as payloads) for each key in the node or the leaf to store data inline, i.e. inside the B-tree.
FIG. 4 is a table 400 of types of variables that can be used for generating keys of B-trees, according to various example embodiments. FIG. 5A is a table showing superblock 510 containing attributes associated with an example file system modeled via a B-tree. In some embodiments, the superblock 510 can be implemented as a single, specific object in object store 130. The superblock 510 includes a magic number that identifies the superblock in the object store 130, unique identifier for the file system, update sequence of the file system, key of the root object in a B-tree, maximum object size in a B-tree, maximum object size of data in object store 130, and maximum size of data that can be included in a B-tree. In some embodiments, the identifier of file system and the update sequence can be used as references for debugging, as forensics to identify root objects for a given collection, and to determine the chronology of the root objects.
FIG. 5B shows fields of a key 520 located at a node in a B-tree, according to various example embodiments. The key 520 includes three fields: a “data” field 522, an “inum” field 524, and an “index/offset” field 526. The “data” field identifies type of object: data or metadata. In some embodiments, “data” field 522 is equal to 0 for metadata and 1 for data. “Inum” field 524 indicates which file, directory, or metadata entry the key 520 presents. “Index/offset” field 526 relates to a particular chunk of data or an index of metadata. The field 526 is an index if key 520 associated with metadata (“data” field 522 is 0) and offset of a particular file chunk if key 520 relates to a file (“data” field 522 is 1).
In some embodiments, each distinct type of metadata has its own metadata index. A list of metadata indexes is shown in table 610 in FIG. 6. The metadata index 612 in table 610 is the value placed in “index/offset” field 526 of key 520 if the key 520 relates to a metadata (that is when “data” field 522 is 1). In some embodiments, metadata index 610 (and corresponding “index/offset” field 526) is 1 if the metadata are POSIX file attributes and 2 if metadata is a “symlink” target path. The target data for a “symlink” is the literal bytes of the target. In some embodiments, the literal bytes of the target are stored in a value associated with a key 510 inside a node of a B-tree. If key 520 relates to a metadata containing an extended attribute of a file, index 612 is a hash of a name and value of the extended attribute.
FIG. 6 also illustrates table 620 containing attributes 614 of a file or a directory and table 630 containing additional attributes 616 for a directory. In various embodiments, the file attributes 614 include regular POSIX attributes. Additional attributes 616 for directories include a hash function (“hashnf”) and a key for the hash function (“hashkey”). In some embodiments, “hashnf” is 0 if the hash function is cyclic redundancy check “crh32c,” 1 if the hash function is SipHash, and 2 if the hash function is SipHash with 3 characters of ordering.
In some embodiments, root directory has a constant “Inode” number of 1000. The root directory has no parent. The root includes an “. . . ” entry pointing back to the root directory. All new directories, including the root directory, start with attribute “nlink” equal to 2.
Referring back to table 610, in some embodiments, “xattrs” are indexed by name using a “namehash” scheme which is further described in FIG. 9 and “crc32c” as the hash function. The value of a key associated with “xattrs” can be in one of the two forms: internal or external.
FIG. 7 illustrates internal form 710 and external form 720 for a value of a key associated with an extended attribute. Both forms include an “inline” field 712 and second field 714. In internal form 710, the “inline” field 712 is 1 and the second field 714 includes a literal name (and a value) of an extended attribute. In external form 720, the “inline” field 712 is equal to 0 and the second field 714 includes a smart hash of an object in an object store, wherein the object includes the extended attribute. If the size of the extended attribute is larger than “maxinlinesize” (described in FIG. 5A), then the external form 720 is stored in a B-tree. If the extended attribute does not exceed “maxinlinesize,” then the value for the extended attribute is stored inline in a B-tree using form 710. The size of the value for extended attribute may be as large as “maxdataobjsize” (described in FIG. 5A) so that the extended attribute values are never become more than one object in size.
FIG. 8 shows a key 810 for a chunk of file, an internal form 820 for value of the file chunk, and an external form 830 for value for the file chunk, according to some example embodiments. Key 810 includes a “data” field 522 set to 1, an “Inum” field 524, and “offset” field 526. “Inum” field is the metadata identifier of the file. “Offset” field is the offset of the file chunk from the beginning of file. Chunk (also referred to as a fragment or a continuous fragment) represents a byte extent of file content. Two chunks of the same file are not allowed to overlap. Gaps between chunks and gaps between the end of the last chunk and file size are implicitly zero-filled.
In some embodiments, value of the file chunk can be stored inline in a B-tree and externally in object store 130. Internal form 820 is used when the chunk is stored directly in a B-tree. Internal form 820 including a “BTC_DATA_INLINE” field 822 is set to 1 and a second field 824 containing literal byte data represents the chunk. The file chunk is stored inline in a node of a B-tree if the size of the chunk is less than “maxinlinesize” (described in FIG. 5A).
External form 830 is used when the file chunk is stored externally in object store 130. External form 830 includes a “BTC_DATA_EXTERN” field 832 set to 2, “size” field 834 representing size of the file chunk, and Smash of an object in object store 130, and the object corresponding to the file chunk. The external form is stored in a node of a B-tree if size of the chunk is larger than “maxinlinesize”(described in FIG. 5A).
In some embodiments, dividing files in chunks (also referred herein as chunking) is performed in accordance with one or more policy. While introducing a policy for chunking one may consider following facts:

- Large size chunks may amortize per chunk or per object overhead at cost of increased IO amplification if the whole object is not needed.
- The object store may perform deduplication of objects based on object ID. Boundaries of an object determine the object ID. Therefore, if two identical byte sequences are chunked differently, then these two byte sequences may not deduplicate against each other.
- Compression of data in a file may cause a wide variance between a logical file content and resulting objects if the file data is highly compressible. For example, for common patterns like “all zeros” the difference between the logical file content and resulting objects can be in hundredfold.

In some embodiments, a “block-aligned chunking” policy is applied. The block-aligned policy causes a chunk structure of a file to reflect write patterns to the file. If the file is written with streaming sequential writes then the chunks end up being as large as possible, which minimizes the per-object overhead and metadata. If the file is written non-sequentially, the chunks reflect the IO sizes of the writes to minimize read-modify-write overhead of updating a portion of a chunk's object.
The following two parameters are relevant for “block-aligned chunking” policy.

- maxdataobjsize parameter defined for a collection. The maxdataobjsize parameter constrains the maximum size of the object used to store data; and
- block size parameter defined per a file. A block size parameter controls how chunks are split and aggregated. The block size parameter includes a power of 2.

In some embodiments, when data is written to a file, the data is accumulated in a memory waiting for a subsequent write. Adjacent writes are merged in an object up to the maxdataobjectsize.
In some embodiments, existing data for non-sequential writes, if present, is removed and replaced with the new data. The block size logically subdivides the file into a sequence of block sized and aligned segments. A chunk can span multiple blocks, but not more than one chunk can be located within a block. Therefore, if a write is not block-aligned, any existing chunk is split at the block boundaries, and a new chunk is inserted. The new chunk contains a combination of the old and new data. If the overwrite is already block aligned, then the new data is simply written. If the overwrite aligns with an existing chunk then the chunk is simply replaced without affecting the surrounding chunks.
In some embodiments, if it is determined that a write results in more than one chunk within a single block region, the chunks are merged to maintain the invariant of no more than one chunk per block.
In some embodiments, an additional constraint is introduced that forces chunks to be always split at “maxdataobjsize” boundaries in the file. This constrain may allow two files which are substantially similar to deduplicate against each other by allowing chunking to resynchronize at “maxdataobjsize” boundaries.
In some embodiments, a “fingerprint chunking” policy can be applied. The “fingerprint chunking” policy is intended to maximize the opportunities for deduplication by making the chunk structure a function of the file content rather than the write patterns. When “fingerprint chunking” is applied, the same byte sequence may result in the same chunk structure, and, therefore, the same object IDs. In some embodiments, Rabin fingerprint algorithm can be used to select content-dependent chunk division points.
The “fingerprint chunking” policy may work best for streaming sequential writes. Non-sequential overwrites are especially expensive, as the new writes need to be merged with existing data, and the new data re-chunked according to the fingerprinting.
In some embodiments, a compression can be applied while chunking the file data. In some embodiments, the user data (as large as possible) are fed into a compression algorithm at write time in order to create a specific output size. The compression can reduce not only the data size in bytes, but also in objects. Like “fingerprinting”, compression is a content-dependent transformation, and is easiest to apply to streaming writes. Using compression for non-sequential writes, and, especially, overwrites, is expensive because the read-modify-write cycle also requires decompression and recompression.
In some embodiments, an “inlining” policy is applied. The inlining may allow small data to be directly embedded within a B-tree, rather than requiring small external objects in the object store. The inlining precludes deduplication for small files and results in a minimum amount of space savings from deduping small files.
A B-tree collection possesses a global “maxinlinesize” setting. Maxinlinesize is defined at creation time of the B-tree collection to set the upper bound on the largest chunk that may be inlined. Typically, maxinlinesize is about 4k. As mentioned above, each file is also associated with an “inlinesize” which defines the chunk size that is inlined. The inlining is useful for small files or large sparse files with lots of small spans. The default “inlinesize” is 1k.
FIG. 9 shows a key 910 for a directory entry and value 920 for the directory entry. The directories are indexed by name using a “namehash” scheme similar to extended attributes “xattr.” In some embodiments, a region of the B-tree key space is reserved starting from a “base” for a certain number of slots. Slots are populated with entries as the entries are added. In some embodiments, it is assumed that the number of slots is very large with respect to expected number of entries, so the likelihood of single collisions is low. B-tree key 910 includes “data” filed field equal to 1, “Inum” field 524 representing a metadata identifier of the directory, and “offset” field 526 which includes a “diroffset” value. “Diroffset” is a hash function of a literal name of a directory entry (also referred to as a directory chunk). The value of directory entry 522 includes a “BTC_DIRENT” field 922 set to 0, a name field 924 holding the name of the directory entry, a “direnttype” field 926 holding the type of entry, and “inum” field 928 holding a metadata identifier of the directory entry.
Each entry in slots has a form 930 as shown in FIG. 9. The “tag” field is used by directories for the chunk type field. An entry with a zero-length name (the name field is a single 0x00 byte) is a tombstone 940. A tombstone 940 is a deleted entry, which is required so that a lookup algorithm can find later names with a colliding hash.
In some embodiments, when a new directory of extended attribute entry is inserted in the B-tree, the name of the entry is hashed with a hash function to find a corresponding key, which is a slot in reserved slots. If the located slot is occupied by another name due to a hash collision, the key is incremented until it finds an available slot, which can be either a completely unused slot or occupied by a tombstone.
In some embodiments, when an entry is looked up in the B-tree, the name is hashed to determine the corresponding slot. If the determined slot is unused, the entry does not exist. If the determined slot is occupied and the name does not match, the key is incremented until the name is found or an unused slot is found. Any tombstones encountered are then ignored and skipped over.
In some embodiments, when an entry is deleted, the lookup algorithm is used to find the slot for the name in order to form B-tree. If the slot exists and the next slot is unused, then the entry is deleted. If the next slot is occupied, then it is assumed that a hash collision occurs and the entry is replaced by a tombstone (a zero-length name, and no payload). If there is a series of tombstones followed by an empty entry, the tombstones can be deleted.
In some embodiments, three types of hash function can be used:
1) crc32c. It is a resource inexpensive 32-bit hash. crc32c is neither strong nor collision resistant. Resulting names will have no apparent order and can be attacked to cause collisions, resulting in a denial-of-service attack. crc32c is used for “xattrs” since “xattr” is not generally writable from untrusted sources, and the number of “xattr” is not large.
2) SipHash. It is an efficient 64-bit keyed hash. Each directory has its own randomly generated key which is not exposed outside of the filesystem. Any attacker would need to know the key to be able to cause collisions at a higher rate than a random chance.
3) SipHash-order3. It is a variant of SipHash that truncates the hash to 4 bytes and prepends 3 bytes of the name to the start of the hash. With a “normal” mix of names in a directory, this may result in a directory that is nearly lexically sorted, with entries having a common prefix being mixed randomly. This may help to accommodate applications which tend to operate on directories in an alphabetical order.
In some embodiments, a case-insensitive variant of a hash function can be used to support case-insensitive names of directories. For example, the hash function may be marked as “CICP” (that is “case-insensitive, case-preserving”). Using a case-insensitive hash function allows looking up names without regard to case of characters of the names while keeping original capitalization of characters of the names. Applying a case-insensitive hash function allows performing filesystem operations on large directories to be efficient. For example, without case-insensitive lookups, Samba package is needed to scan the whole directory to perform case-insensitive matching. Names of directories are required to be properly formed with UTF-8 coding.
In some embodiments, when an Inode loses the last name (that is nlink=0 or the Inode is “deleted”), the Inode may still be in use. While the file is still open, it functions normally. This means that the bulk of the work related to deleting the file is deferred until the file is closed.
If there is a crash occurring before the close happens, then the Inode state could be left as stray garbage, which is Inodes with no names. Such an Inode is effectively inaccessible. At the same time the Inode can be still present within the B-tree, so a garbage collection stores Inode data alive.
In some embodiments, to solve the issue of having an Inode without a last name, an “orphan directory” is provided. The “orphan directory” is a nameless directory with Inum=1. When a file is unlinked and the file “nlink” count goes to zero, an entry is added to the orphan directory and the file's nlink remains 0. When mounted, it traverses the entries in the orphan directory and releases all associated data.
FIG. 10 is a process flow diagram showing a method 1000 for organizing data, according to an example embodiment. The method 1000 can be implemented using a computer system. The example computer system is shown in FIG. 11.
The method 1000 may commence with providing an object store to store objects in block 1010. The object represents fragments of the data and can be assigned addresses. In block 1020, method 1000 can proceed with associating a B-tree with the object store. The B-tree includes nodes. Each of the nodes includes keys. Each of the keys is associated with at least one object from the object store. In block 1030, method 1000 generates, based at least partially on objects from the object store, values for the each of the keys. In block 1040, method 1000 allows determining that a size of an object from the object store is less than a pre-determined size. In block 1050, if the result of the determination is positive, a value of the object is stored in a particular node of the B-tree, wherein the particular node includes a particular key associated with the object. In block 1060, if the result of the determination is negative, the address of the object is stored in the particular node of the B-tree.
FIG. 11 shows a diagrammatic representation of a computing device for a machine in the exemplary electronic form of a computer system 1100, within which a set of instructions for causing the machine to perform any one or more of the methodologies discussed herein can be executed. In various exemplary embodiments, the machine operates as a standalone device or can be connected (e.g., networked) to other machines. In a networked deployment, the machine can operate in the capacity of a server or a client machine in a server-client network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. The machine can be a server, a personal computer (PC), a tablet PC, a set-top box (STB), a personal digital assistant (PDA), a cellular telephone, a digital camera, a portable music player (e.g., a portable hard drive audio device, such as an Moving Picture Experts Group Audio Layer 3 (MP3) player), a web appliance, a network router, a switch, a bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while only a single machine is illustrated, the term “machine” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
The example computer system 1100 includes a processor or multiple processors 1102, a hard disk drive 1104, a main memory 1106, and a static memory 1108, which communicate with each other via a bus 1110. The computer system 1100 may also include a network interface device 1112. The hard disk drive 1104 may include a computer-readable medium 1120, which stores one or more sets of instructions 1122 embodying or utilized by any one or more of the methodologies or functions described herein. The instructions 1122 can also reside, completely or at least partially, within the main memory 1106 and/or within the processors 1102 during execution thereof by the computer system 1100. The main memory 1106 and the processors 1102 also constitute machine-readable media.
While the computer-readable medium 1120 is shown in an exemplary embodiment to be a single medium, the term “computer-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term “computer-readable medium” shall also be taken to include any medium that is capable of storing, encoding, or carrying a set of instructions for execution by the machine and that causes the machine to perform any one or more of the methodologies of the present application, or that is capable of storing, encoding, or carrying data structures utilized by or associated with such a set of instructions. The term “computer-readable medium” shall accordingly be taken to include, but not be limited to, solid-state memories, and optical and magnetic media. Such media can also include, without limitation, hard disks, floppy disks, NAND or NOR flash memory, digital video disks, RAM, ROM, and the like.
The exemplary embodiments described herein can be implemented in an operating environment comprising computer-executable instructions (e.g., software) installed on a computer, in hardware, or in a combination of software and hardware. The computer-executable instructions can be written in a computer programming language or can be embodied in firmware logic. If written in a programming language conforming to a recognized standard, such instructions can be executed on a variety of hardware platforms and for interfaces to a variety of operating systems. Although not limited thereto, computer software programs for implementing the present method can be written in any number of suitable programming languages such as, for example, C, Python, JavaScript, Go, or other compilers, assemblers, interpreters or other computer languages or platforms.
Thus, systems and methods for methods for organizing data are disclosed. Although embodiments have been described with reference to specific example embodiments, it may be evident that various modifications and changes can be made to these example embodiments without departing from the broader spirit and scope of the present application. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.

Claims

What is claimed is:

1. A computer-implemented method for organizing data, the method comprising:

providing an object store to store objects, each of the objects representing a fragment of the data and being associated with an address;

associating a B-tree with the object store, the B-tree including nodes, wherein each of the nodes includes keys, wherein each of the keys is associated with at least one object from the object store; and

generating, based at least partially on objects from the object store, values for each of the keys.

2. The method of claim 1, wherein the address of an object in the object store is based on content of the fragment of the data.

3. The method of claim 1, further comprising:

determining that a size of an object from the object store is less than a pre-determined size;

if a result of the determination is positive, storing a value of the object in a particular node of the B-tree, the particular node including a particular key associated with the object; and

if a the result of the determination is negative, storing the address of the object in the particular node of the B-tree.

4. The method of claim 1, wherein each of the objects includes at least one of the following:

a metadata object storing at least a number of references to further objects from the data objects and an identification number associated with a file or a directory; and

a data object representing at least one of the following: a continuous fragment of the file or a directory entry.

5. The method of claim 4, wherein the key associated with the metadata object includes at least a first field including an indication of the metadata object, a second field including the identification number, and a third field including a metadata index representing a distinct type of the metadata object.

6. The method of claim 5, wherein the type of the metadata object includes at least one of the following: attributes associated with the file or the directory, a symbolic link to the file or the directory, and an extended file attribute associated with the file or the directory.

7. The method of claim 4, wherein the key associated with the fragment of the file includes at least a first field including an indication of the data object, a second field including the identification number, and a third field including an offset of the continuous fragment of the file from a beginning of the file.

8. The method of claim 4, wherein the key associated with the directory entry includes at least a first field including an indication of the data object, a second field including the identification number, and a third field including a hash calculated based on a literal name of the directory entry.

9. The method of claim 8, wherein calculating the hash includes:

applying a hash function to the literal name to obtain a preliminary hash, the hash function including at least one of the following: crc32c, SipHash, and SipHash-order3; and

shifting the preliminary hash by a pre-determined base number to obtain the hash.

10. The method of claim 9, further comprising:

determining that the hash matches an existing hash; and

based on the determination, incrementing the hash by 1.

11. A system for organizing data, the system comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor, the memory storing instructions, which, when executed by the at least one processor, perform a method comprising:

providing an object store to store objects, each of the objects representing a fragment of the data and associated with an address;

generating, based at least partially on objects from the objects, values for each of the keys.

12. The system of claim 11, wherein the address of an object in the object store is based on content of the fragment of the data.

13. The system of claim 11, wherein the method further comprising:

14. The system of claim 11, wherein each of the objects includes at least one of the following:

15. The system of claim 14, wherein the key associated with the metadata object includes at least a first field including an indication of the metadata object, a second field including the identification number, and a third field including a metadata index representing a distinct type of the metadata object.

16. The system of claim 15, wherein the type of the metadata object includes at least one of the following: attributes associated with the file or the directory, a symbolic link to the file or the directory, and an extended file attribute associated with the file or the directory.

17. The system of claim 14, wherein the key associated with the fragment of the file includes at least a first field including an indication of the data object, a second field including the identification number, and a third field including an offset of the continuous fragment of the file from beginning of the file.

18. The system of claim 14, wherein the key associated with the directory entry includes at least a first field including an indication of the data object, a second field including the identification number, and a third field including a hash calculated based on a literal name of the directory entry.

19. The system of claim 18, wherein calculating the hash includes:

applying a hash function to the literal name to obtain a preliminary hash, the hash function including at least one of the following: crc32c, SipHash, and SipHash-order3;

shifting the preliminary hash by a pre-determined base number to obtain the hash;

determining that the hash matches an existing hash; and

based on the determination, incrementing the hash by 1.

20. A non-transitory computer-readable storage medium having embodied thereon instructions, which, when executed by one or more processors, perform a method for organizing data, the method comprising:

associating a B-tree with the object store, the B-tree including nodes, wherein each of the nodes includes keys, wherein each of the keys is associated with at least one object from the object store;

generating, based at least partially on objects from the objects, values for the each of the keys;

if a result of the determination is negative, storing the address of the object in the particular node of the B-tree.