Decentralized storage with Camlistore

By Nathan Willis
April 23, 2014

Reducing reliance on proprietary web services has been a major target of free-software developers for years now. But it has taken on increased importance in the wake of Edward Snowden's disclosures about service providers cooperating with government mass-surveillance programs—not to mention the vulnerability that many providers have to surveillance techniques whether they cooperate or not. While some projects (such as Mailpile, ownCloud, or Diaspora) set out to create a full-blown service that users can be in complete control of, others, such as the Tahoe Least-Authority Filesystem, focus on more general functionality like decentralized data storage. Camlistore is a relative newcomer to the space; like Tahoe-LAFS it implements a storage system, but its creators are particularly interested in its use as a storage layer for blogs, content-management systems (CMSes), filesharing, and other web services.

Camlistore is a content-addressable storage (CAS) system with an emphasis on decentralized data storage. Specifically, the rationale for the project notes that it should be usable on a variety of storage back-ends, including Amazon's S3, local disk, Google Drive, or even mobile devices, with full replication of content between different locations.

Content addressability means that objects can be stored without assigning them explicit file names or placing them in a directory hierarchy. Instead, the "identity" of each object is a hash or digest calculated over the content of the object itself; subsequent references to the object are made by looking up the object's digest—where it is stored is irrelevant. As the rationale document notes, this property is a perfect fit for a good many objects used in web services today: photos, blog comments, bookmarks, "likes," and so on. These objects are increasing created in large numbers, and rarely does a file name or storage location come into play. Rather, they are accessed through a search interface or a visual browsing feature.

The Camlistore project produces both an implementation of such a decentralized storage system and a schema for representing various types of content. The schema would primarily be of interest to those wishing to use Camlistore as a storage layer for other applications.

The project's most recent release is version 0.7, from February 27. The storage server (with several available back-ends) is included in the release, as are a web-based interface, a Filesystem in Userspace (FUSE) module for accessing Camlistore as a filesystem, several tools for interoperating with existing web services, and mobile clients for Android and iOS.

The architecture of a Camlistore repository includes storage nodes (referred to by the charming name "blob servers") and indexing/search nodes, which index uploaded items by their digests and provide a basic search interface. The various front-end applications (including the mobile and web interfaces) handle both connecting to a blob server for object upload and retrieval and connecting to a search server for finding objects.

There can be several blob servers that fully synchronize with one another by automatically mirroring all data; the existing implementations can use hard disk storage or any of several online storage services. At the blob-server level, the only items that are tracked are blobs: immutable byte sequences that are uploaded to the service. Each blob is indexed by its digest (also called a blobref); Camlistore supports SHA1, MD5, and SHA256 as digest functions. Blobs themselves are encrypted (currently with AES-128, although other ciphers may be added in the future).

Semantically speaking, a blob does not contain any metadata—it is just a bunch of bytes. Metadata is attached to a blob by associating the blob with a data type from the schema, then cryptographically signing the result. Subsequently, an application can alter the attributes of a blob by creating a new signed schema blob (called a "claim"). For any blob, then, all of the claims on it are saved in the data store and can be replayed or backed up at will. That way, stored objects are mutable, but the changes to them are non-destructive. The current state of an object is the application of all of the claims associated with a blob, applied in the order of their timestamps.

This storage architecture allows for, potentially, a wide variety of front-end clients. Index servers already exist that use SQLite, LevelDB, MySQL, PostgreSQL, MongoDB, and Google App Engine's data store to manage the indexed blobs. Since an index server is logically separate from the blob servers that it indexes, it is possible to run an index on a portable device that sports little built-in storage, and still be able to transparently access all of the content maintained in the remote storage locations. In addition, Camlistore has the concept of a "graph sync," in which only a subset of the total blob storage is synchronized to a particular device. While full synchronization is useful to preserve the data in the event that a web service like Amazon S3 unexpectedly becomes unreachable, there are certainly many scenarios when it makes sense to keep only some of the data on hand.

As far as using the blob storage is concerned, at present Camlistore only implements two models: the basic storage/search/retrieval approach one would use to manage the entire collection, and directly sharing a particular item with another user. By default, each Camlistore server is private to a single user; users can share an object by generating a signed assertion that another user is permitted to access the object. This signed assertion is just one more type of claim for the underlying blob in the database. Several user-authentication options are supported, but for now the recipient of the share needs to have an account on the originating Camlistore system.

It may be a while before Camlistore is capable of serving as a storage layer for a blog, photo-hosting site, or other web service, but when it is ready, it will bring some interesting security properties with it. As mentioned, all claims on items in the database are signed—using GPG keys. That not only allows for verification of important operations (like altering the metadata of a blob), but it means it would be possible to perform identity checks for common operations like leaving comments. Camlistore does have some significant competition from other decentralized storage projects, Tahoe-LAFS included, but it will be an interesting project to watch.

Index entries for this article
Security	Networking/Filesystems