Versioned arrays - Githubissues

alimanfoo commented 8 years ago

From @shoyer: What about indirection for chunks? The use case here is incrementally updated and/or versioned arrays (e.g., each day you update the last chunk with the latest data, without altering the rest of the data). In the typical way this is done, you hash each chunk and use hashes as keys in another store. The map from chunk indices to hashes is also something that you might want to store differently (e.g., in memory). I guess this would also be pretty easy to implement on top of the Storage abstraction.

alimanfoo commented 8 years ago

Hi @shoyer, can you explain this one a little more? Not fully getting it.

shoyer commented 8 years ago

Let me clarify the use-case first.

In weather and climate science, it's common to have large array datasets that are continuously updated. For example, every hour we might combine data from weather stations, satellites and physical models to estimate air temperature at every location on a grid. This quickly adds up to a large amount of data, but each incremental update is small. Imagine a movie being written frame-by-frame.

If it were always just a matter of always writing new chunks in new files, we could simply write the chunks to new files and then overwrite the array metadata. But it's common to want to overwrite existing chunks, too, either because of another incremental update (a separate quality control check) or because we need to use another chunking scheme for efficient access (e.g., to get a weather history for one location, we don't want to use one chunk/hour). Hence, indirection is useful, to enable atomic writes and/or "time travel" (like git) to view the state of the database at some prior point in time.

alimanfoo commented 8 years ago

Thanks @shoyer. So trying to make this concrete, correct me if any of the following is wrong. You have an array of air temperature estimates e. The shape of this array is something like (gridx, gridy, time) where gridx and gridy are the number of divisions in some 2D spatial grid, and time is in hours. The size of the first two dimensions is fixed, but the size of third (time) dimension grows as more data are added. Every hour you append a new array of shape (gridx, gridy) to the third (time) dimension.

When designing the chunk shape for this array, you want to be able to view the state of the whole grid at some specific time, but you also want to be able to view a time series for a specific grid square or some region of the spatial grid. To make this possible, you would naturally chunk all three dimensions, e.g., chunks might be (gridx//100, gridy//100, 1) - each chunk covers 1/10000 of the spatial grid and a single time point.

If this is all there was to it, then querying the state of the grid at some point in time could be done by indexing the third (time) dimension, e.g., e[:, :, t], and querying the history of some region could be done by indexing the first and second dimensions, e.g., e[x1:x2, y1:y2, :]. However (and this is where I'm a bit uncertain) querying the state of the grid at some point in time is not as simple as indexing the third (time) dimension, because data may get modified after it is initially appended, e.g., to correct an error or improve an estimate. In this case, as well as indexing the array on one or more dimensions, you also want to ask what the state of the array was at some previous point in time.

Have I got this right? cc @benjeffery

shoyer commented 8 years ago

In this case, as well as indexing the array on one or more dimensions, you also want to ask what the state of the array was at some previous point in time.

Yes, I think you have this right.

To be clear, while the ability to view previous versions of an array (like git) is important for some use cases (like reproducibility), it's also essential for atomic updates.

In practice, you do not want to use chunking like (gridx//100, gridy//100, 1), because that makes it very expensive to do queries at one location across all times (which is a major use case in climate/weather science). Instead, you want to use a larger chunk size along time, e.g., (gridx//1000, gridy//1000, 100), which means incremental updates involve overwriting chunks. It's nice to be able to swap in a new version of an array all at once, instead of piecemeal.

We made use of both these features in Mandoline, an array database written at my former employer.

alimanfoo commented 8 years ago

Ok, thanks. So the requirement for versioning, do you think that lives at the storage layer? I.e., does it make sense to think of this in terms of store classes that implement the basic MutableMapping interface for reading and writing chunks, but also some further methods related to versioning, e.g., committing versions, checking out versions, etc.? I.e., you do some write on an array, then you call store.commit(), then do some more writes, call store.commit(), then if you want to view a previous state, call store.checkout(version), etc.?

On Thu, Sep 22, 2016 at 4:39 PM, Stephan Hoyer notifications@github.com wrote:

In this case, as well as indexing the array on one or more dimensions, you also want to ask what the state of the array was at some previous point in time.

Yes, I think you have this right.

To be clear, while the ability to view previous versions of an array (like git) is important for some use cases (like reproducibility), it's also essential for atomic updates.

In practice, you do not want to use chunking like (gridx//100, gridy//100, 1), because that makes it very expensive to do queries at one location across all times (which is a major use case in climate/weather science). Instead, you want to use a larger chunk size along time, e.g., (gridx//1000, gridy//1000, 100), which means incremental updates involve overwriting chunks. It's nice to be able to swap in a new version of an array all at once, instead of piecemeal.

We made use of both these features in Mandoline https://github.com/TheClimateCorporation/mandoline, an array database written at my former employer.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alimanfoo/zarr/issues/63#issuecomment-248941136, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QjG2gujIN7zQVduR-tWO3mUNQBviks5qsqFEgaJpZM4J1vth .

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

shoyer commented 8 years ago

@alimanfoo Yes, I think it would make sense to implement this at the storage level.

alimanfoo commented 8 years ago

One simple way to implement this that immediately comes to mind would be to extend the DirectoryStore class and use git to provide versioning. Could even use GitPython to provide some convenience, like initialise the git repo when the store is first created, and provide Python API to make commits, checkout old versions, etc.

agstephens commented 4 years ago

Hi, my organisation works with climate simulations and they get very big (i.e. Tera to Peta-scale). The data is all stored in netCDF format at present and a single dataset is typically a set of netCDF files within a directory with a version label, such as v20190101.

A major issue for our data centres is the re-publication of data that has changed. However, some of the changes are very minor. They could be, for example, a change to:

a global attribute
a variable attribute
the name of a variable
a land/sea mask
a coordinate variable array
a small slice of large data array.

In any of the above cases, it is very costly for our community because:

the issue must be reported and documented
data must be re-generated by the scientists
new data must be transferred to the data centre
new data must be checked and ingested into the archive
a new version, even if only replacing a single attribute, doubles the size of a dataset

@alimanfoo and @shoyer: I am interested in the discussion you have had because I could Zarr being an excellent solution for resolving this issue. It could enable:

Version changes with minimal changes on the file system - saving space and quick to implement.
Massive savings for scientists and data centres in terms of time taken to re-generate entire data sets when the changes are known and could be easily projected on to an existing data set if the chunking structure was well described.

Do you know if anyone else is doing work looking at this issue with Zarr? Thanks

Carreau commented 4 years ago

@agstephens I think that would likely be possible using zarr, it also depends on whether you want zarr to be aware of versioned data, or not. At least even in current state, only changed files/keys will be synchronized by smarts tools but you do need to scan the whole archive.

I can imagine that a store could have a transaction logging all writes, allowing efficient queries of what have have changed, and only sync these. Depending on how users access data it might be more or less difficult to implement it in a way that keep the log consistant with the modifications.

I will also point out https://github.com/Quansight/versioned-hdf5

In any of the above cases, it is very costly for our community because: [...]

How do you store the dataset(s), is that using a filesystem or an object store ? I thought that filesystem like ZFS had block deduplication and that snapshot transfer was smart enough to limit to change blocks ?

I also seem to remember that https://docs.datproject.org/ was trying to solve som of those issues.

DennisHeimbigner commented 4 years ago

My 2cents. The work in databases on temporal databases may be relevant here. I am sure there any number of papers in ACM Transactions on Database Systems. In terms of atomic update (probably not possible for S3), file system journaling may also be relevant.

jakirkham commented 4 years ago

I suppose one way to address this would be to design a store to include timestamps as part of paths (perhaps underneath chunks and .z* files). This would allow the changed parts to be written under a new timestamp without altering previous ones. When reading we can just read the latest timestamp and ignore older ones.

Though there should be the possibility of exploring revision history. This could be done by filtering timestamps based on the revision someone wants to revisit and then selecting the latest amongst those (these steps can be fused for simplicity and performance).

agstephens commented 4 years ago

@Carreau , @DennisHeimbigner , @jakirkham: thanks for your responses regarding the versioning discussion.

In response to @Carreau: the data is currently all stored in NetCDF4 on a POSIX file system (although we are experimenting with object store).

I'll do some more reading this week following the leads that you have sent. My instinct is that the approach suggested by @jakirkham is worth exploring - i.e. that the DirectoryStore class manages an extra layer that includes version directories.

jakirkham commented 4 years ago

@agstephens, if you do explore this versioning idea with say a subclass of DirectoryStore, it would be interesting to hear what you discover. Is that a viable path? What issues do you encounter in implementation? Are there any consequences from that approach we should be aware of/reflect on?

joshmoore commented 4 years ago

Oddly enough, our team started discussing the same requirement Friday, too. The initial use case would be for metadata, but I can certainly also see the benefit of the functionality originally described by @shoyer. In our case, one goal would be to have global public URLs for multiple versions of the same dataset available without needing another tool to ensure API stability of existing URLs. (An example of another tool would be git cloning and then changing the branch.)

On a filesystem this could be done fairly trivially without extra storage with symlinks. There's no general solution I know of for cloud storage though. Following the above discussion, I too could see a way forward via Stores, but I'd like to raise a concern that this feels quite similar to Nested and Consolidated stores and that the issues in those two cases both led us to including those discriminators in the v3 spec.

Perhaps it's natural to start with an initial implementation and then choose to include that in the spec later, but I expect composition of the stores will be an almost immediate issue. (cf. https://github.com/zarr-developers/zarr-python/issues/540)

zarr-developers / zarr-specs

Versioned arrays #76