Content-addressable storage transformer (v3 protocol extension)

alimanfoo commented 4 years ago

This issue describes a concept for zarr v3 protocol extension which enables content-addressable storage to be layered on top of any underlying store. It is a thought experiment only, not a concrete proposal. Elaboration of suggestions at #76.

Goals:

Enable verification of store contents integrity
Enable reading from store state as at a specific point in time (time travel)

The protocol extension introduces a layer of indirection to the storage protocol. This can be thought of as a transformation layer which sits above the store and modifies the key/value store operations.

Screenshot from 2020-06-26 13-34-56

When attempting to store a given (key, value) pair, the storage transformer hashes the value, and the hash is used to obtain a content-addressable key.

E.g., if storing an encoded chunk value under key 'data/foo/bar/0.0', if the hash for the value is 'abcdefghijklmnopqrstuvwxyz', then the content-addressable key would be something like 'content/a/b/c/d/efghijklmnopqrstuvwxyz'. The way that the transformer generates content-addressable keys could be configured with depth, width and hash algorithm.

The transformer would then issue a request to the underlying store to set (content-addressable-key, value).

To keep track of the location of the content, a content metadata document would be created, which records the content-addressable key, together with any other metadata such as timestamp of creation. This could be a JSON document, and could record multiple versions of the content. E.g.:

[
    {
        "address": "content/a/b/c/d/efghijklmnopqrstuvwxyz",
        "timestamp": 1593171939
    },
    {
        "address": "content/z/yxwvutsrqponmlkjihgfedcba",
        "timestamp": 32503680000
    },
    ...
]

This JSON document could be stored under the original key, in this case 'data/foo/bar/0.0'.

In other words, when the transformer receives a request set(key, value), it does the following:

Hash the value and generate a content-addressable-key
Call set(content-addressable-key, value) on the underlying store
Call get(key, value) to retrieve the content metadata document (if it exists, otherwise create a new one)
Update the content metadata document to append a new entry with the address and timestamp.
Call set(key, content-metadata-document) on the underlying store

When the transformation layer receives a request to get(key, value), it does the following:

Call get(key) on the underlying store to retrieve the content metadata document
Parse the content metadata document and identify the most recent content addressable key for the value
Call get(content-addressable-key) on the underlying store to retrieve the value
Return the value

The transformation layer could also expose an API to set the time state for reading. E.g., if time state is set to time T, then when the transformation layer receives a request to get(key, value), it does the following:

Call get(key) on the underlying store to retrieve the content metadata document
Parse the content metadata document and identify the most recent content addressable key for the value with a timestamp not later than T
Call get(content-addressable-key) on the underlying store to retrieve the value
Return the value

In order to discover that content-addressable storage transformer extension is in use, this could be declared in the zarr entry point metadata document, e.g., zarr.json like:

{
    "zarr_format": "https://purl.org/zarr/spec/protocol/core/3.0",
    "metadata_encoding": "application/json",
    "extensions": [
        {
            "extension": "http://example.org/zarr/extension/content-addressable-storage-transformer",
            "must_understand": true,
            "configuration": {
                "algorithm": "sha256",
                "depth": 4,
                "width": 1
            }
        }
    ]
}

When a zarr v3 implementation opened a hierarchy using this extension, it could recognise that when parsing the entry point metadata document, and insert the appropriate store transformer if supported. I.e., a user opening such a hierarchy would not need to know that content-addressable storage was used, the implementation would discover that for itself.

There are several potential advantages of this scheme:

The group and array metadata keys would exist in the store as normal, and so any functionality inspecting the keys to infer which groups and arrays are present in the hierarchy would still work unchanged. I.e., the transformation layer could just pass through the list, list_pre and list_dir operations to the underlying store and everything should work as normal.
The metadata that tracks the content locations would be decentralised, allowing all of the same degrees of parallelism that the normal store provides. I.e., chunks could be written in parallel, and arrays could be created in parallel, without requiring any locking or synchronisation.
Because the extension is a transformation layer, any type of underlying store could be used. It would also be fine to migrate data between different types of storage, e.g., create data on a local filesystem, then copy up to object storage.
This extension does provide a slightly stronger guarantee against the underlying store getting corrupted by partially-successful writes, because the content metadata documents would only ever get updated after a successful content write operation.

Notes:

This extension does not provide any additional support for coping with situations where two writers may be in contention for the same chunk. If two writers attempt to write the same chunk in parallel, both chunk values will end up getting stored separately under different content addresses, but the content metadata document could get overwritten by one of the writers, i.e., one of the content versions could fail to be recorded in the content metadata.

alimanfoo commented 4 years ago

Hi @jakirkham, is this the kind of thing you were thinking of?

cc @carreau

Carreau commented 4 years ago

Why not create meta-keys that are /path-to-chunk-timestamp which content is chunk-hash, or do you want to avoid non-listable stores ? If the store is listable, you can list('path-to-chunk-*') and retrieve the most recent.

agstephens commented 3 years ago

@alimanfoo: your proposal for a "Content-addressable storage transformer" is really interesting.

In the global management of ESGF climate model data, we have the need to provide checksums for sets of netCDF files with version timestamp. Around the world different nodes may take a copy of a data set (set of netCDF files) and verify the content against a manifest of checksums.

An alternative world-view would be to put the data sets into Zarr, and to manage the versioning internally. Your approach looks like it would cope with that use case.

There are 3 potential advantages:

Efficiency: less duplication of storage (because you only have to change chunks that need to change); reduced bandwidth requirements to replicate the entire data set when changes occur.
In-place fixing: an xarray/zarr interface could allow updates to be applied; new versions could be encoded in some code required to update the Zarr.
Interoperability: content can be managed on disk or object store without modification.

Have you had any interest from others to take this forward?

Carreau commented 3 years ago

We are currently still focusing on spec v3, I don't believe doing a content addressable storage will be too hard to do on top of spec v3.

jakirkham commented 3 years ago

Thanks for writing this up @alimanfoo! 😄 Yeah this is the kind of thing I was thinking about.

Agree Matthias. This would be more intended as something on top of v3.

Also to your subpoint on listable stores, yeah the idea was to capture the full listing under one JSON object. So it tries to capture the use case that motivated things like consolidated metadata and avoid listing as a requirement. Though maybe there are some wrinkles we would need to work out here.

@agstephens, I think you have more-or-less captured the motivations behind such an extension. Though I proposed the idea, clearly have not had time to push it forward myself. If this is something you'd be interested in exploring, that would be very helpful. 😄

alimanfoo commented 3 years ago

Thanks @agstephens for your comment, it's great to know this is potentially useful. As John and Matthias have said this will be easier to develop as an extension to the v3 core protocol, so we should probably stay focused on the v3 spec and implementations for the time being so we have a solid foundation to build on. But please do let us know if this extension is something you'd be interested in helping with at any point, either writing a proper spec or doing a prototype implementation.

alimanfoo commented 3 years ago

Also to your subpoint on listable stores, yeah the idea was to capture the full listing under one JSON object. So it tries to capture the use case that motivated things like consolidated metadata and avoid listing as a requirement.

Hi @jakirkham, just to say that this proposal doesn't create a full listing of all metadata objects in a single JSON document, i.e., it doesn't replace consolidated metadata. The metadata would still be scattered into lots of separate objects, one for each node in the hierarchy. All it does is create a layer of indirection when writing or reading any object in the store, that allows you to read objects as at a given time point, and also verify objects haven't got corrupted. Hope that makes sense.

jakirkham commented 3 years ago

Also learned about zchunk recently, which may be relevant here.

shoyer commented 3 years ago

A content addressable storage layer like this was part of Mandoline, a precursor to Zarr that I used years ago: https://github.com/TheClimateCorporation/mandoline

Use cases like "time-travel" were indeed the main motivation for this feature.

So I think this could be useful, but it's worth keeping in mind that the size of this metadata can add up for arrays with lots of chunks. For such cases, a separate database layer for keeping track of metadata might make sense -- the metadata is too small to make sense to put in the object stores typically used for array chunks, which are focused on throughput rather than latency.

agstephens commented 3 years ago

@shoyer your suggestion about separating out a database layer from the objects is really interesting. I have heard others (@rabernat) make similar suggestions about some files staying on POSIX.

In an idealised future where everything lives in the cloud you would need to make sure the database is as accessible as the objects themselves. So you would need an API that queried the DB, and I suppose the content could be consolidated into a single "metadata" construct to avoid time wasted in excessive queries.

joshmoore commented 3 years ago

FWIW, I've started some investigating with http://datalad.org which is based on git-annex. From my limited experience, that means you can choose whether you publish the data and/or the history to each remote.

d70-t commented 3 years ago

I am playing around a bit with zarr on top of IPFS and have a use case similar to what @agstephens wrote: distributing datasets globally where only some nodes store some of the data. I just wanted to add some thoughts here. However, I don't know if it really fits here, as it's a slightly different approach: the content addressable part is part of the filesystem and not a functionality within zarr.

IPFS basically implements a global content addressable file system. Blocks are hashed, then compiled to files (which are a list of blocks). The files are hashed again and are compiled to directories, thus forming a Merkle tree. This structure maps very well to the tree structure of a zarr dataset. In the end, the whole dataset is identified by its hash and if anything is changed, the hash of the whole thing changes. If a dataset (or a variable) would be updated, it would be possible to record a reference back to the old version within the datasets (or variables) metadata.

So by adding a content addressable layer below zarr in stead of in the middle, I think the implementation could be a bit simpler. In fact, I think it already works quite ok with the current, unmodified version of zarr. The downside of using a content addressable layer below probably is that we won't benefit from all the different store implementations which are already there. We'd essentially have to use what IPFS (or whatever content addressable filesystem) can run on. So for now, I don't see which variant would turn out to be the better one in the end.

There's however one thing which I think is missing and which would be beneficial for both variants (storage transformer and zarr on content addressable storage): it might be very useful to find a way to expose a globally unique content-id (for datasets, variables and chunks) to the user API if such a thing is present in the underlying store. This would create some possibilities for optimization:

If I have two datasets defined on some coordinate grids and want to compute e.g. the difference of some data variables, I want to check if those variables are defined on the same grid, but I don't need the actual coordinate values to compute the difference. For the end result, I'd want to add the coordinate values to the dataset again. All these operations could be done without even downloading the coordinate values from a remote server, if I know the content-id.
If I want to enhance a dataset by adding a more variables and then write a new dataset, I don't want to load and store the other, already existing variables.
If I want to apply fixes to the metadata, I still want to create a new dataset, but I don't want to copy all the data.
(this probably requires more flexible chunking) If I want to subset a dataset or concatenate multiple datasets into a larger one, I shouldn't need to read an write all the data, many operations should be possible using only content-ids of the data chunks.
(probably not directly related) Theoretically, if two datasets share some variables, it should be possible to only load them once into memory (like a read only memory map of the same file into multiple processes would work).

The end result of all of those operations shouldn't depend on the content-id being visible to higher level APIs or the user due to automatic deduplication. But in the cases listed above, a large amount of re-hashing and data transfers could be avoided by providing this information.

rabernat commented 3 years ago

We have just discovered IPFS and Filecoin and are now very interested in this conversation. Tagging @jbusecke, @cisaacstern. Perhaps some folks from Protocol Labs might be able to weight in on this and help think about the Zarr / IPFS implementation.

Stefaan-V commented 3 years ago

Hey team, the Filecoin/IPFS community is more than happy to help here. @d70-t can you send me an email at collab@protocol.ai please and we can discuss over the phone. We can report back our findings to the team here.

martindurant commented 2 years ago

Sorry to be very late to this discussion. I wonder if ReferenceFileSystem provides the amount of indirection being talked about here. It is an fsspec implementation, so already works with zarr (v2!), and each file is either a short binary embedded in the reference structure (for metadata) or a link to some bytes range of some URL. The list of keys is, at its simplest, just a dictionary.

Similarly, name hashing and write mode could be implemented as an fsspec backend without need for explicit code in zarr or even an extension. Of course, you may argue that codifying this process as an extension is exactly the point of this thread.

d70-t commented 2 years ago

I just had another thought about the idea of making a content id available to higher level APIs and tracking it through computations: If there would be a mechanism for say xarray to track that some array (or some part of an array) was moved unchanged from open_zarr through an arbitrary computation until to_zarr but potentially became part of another dataset / group or obtained a different set of metadata. Then to_zarr could potentially create copy calls in stead of write calls to the underlying filesystem. Those copy calls could then become either more efficient local copies (i.e. within a datacenter) or could become e.g. reflink (Copy on Write) copies on btrfs, XFS, Lustre, APFS etc... or could be content-links on content addressable storage systems.

So this wouldn't go as far as making the CID available to users (which would still be very good), but would already provide some of the benefits and that even for a larger set of underlying filesystems.

yarikoptic commented 2 years ago

@martindurant :

Sorry to be very late to this discussion. I wonder if ReferenceFileSystem provides the amount of indirection ...

Sounds interesting and viable -- I wonder if you or someone else tried to come up with some prototypical implementation following that idea?

martindurant commented 2 years ago

We certainly are using ReferenceFileSystem and zarr to create virtual datasets consisting of binary blobs inside other files - and it works great! That's a mapping of path->(other path, offset, size), so similar but different to the discussion here - but adapting or making a new driver for content addressing inspired by that implementation should not be too hard.

zarr-developers / zarr-specs

Content-addressable storage transformer (v3 protocol extension) #82