Generalize to other storage systems

mrocklin commented 8 years ago

I want something very similar to zarr on S3 and I'm pondering the easiest way to get there. One approach is to generalize zarr to accept pluggable byte storage solutions.

Currently, I believe that zarr effectively treats the file system as a MutableMapping into which it can deposit and retrieve bytes. If this is the case then what are your thoughts on actually using the MutableMapping interface instead of touching files directly? That way I could provide MutableMappings that use file systems, zip files, s3, hdfs, etc.. This nicely isolates a lot of the "where do I put this block of bytes" logic from the array slicing and compression logic.

For concreteness, here is a MutableMapping that loads/stores data in a directory on the file system. https://github.com/mrocklin/zict/blob/master/zict/file.py

alimanfoo commented 8 years ago

In principle it sounds great.

For the non-synchronized array/chunk implementations this sounds relatively straightforward, i.e., zarr really does just need a MutableMapping where it can store and retrieve bytes. For the synchronized implementations I'm not sure how to handle the locking. I guess the locking is at a level above the MutableMapping interface, because executing __setitem__() on a chunk includes both a retrieval operation (get existing bytes) and a store operation (put modified bytes). Currently for the persistent synchronized chunks zarr uses a fasteners inter-process lock, which depends on access to a file system.

Very happy to discuss.

mrocklin commented 8 years ago

Perhaps we supply both a MutableMapping and a Lock-like object? Current pairs might be (dict, threading.Lock) or (zict.File, fasteners.InterProcessLock)?

When dealing with distributed storage/computation we rarely care about storing in place. It's far more common to make a completely new dataset as output.

alimanfoo commented 8 years ago

Right, but when you are storing the output, what happens depends on how your storage operations are aligned with chunk boundaries. If they are not exactly aligned then two storage operations may need to store data within the same chunk. So you need to sync access to the chunk, even when writing new outputs.

Yes, maybe a MutableMapping and some other interface that allows to acquire and release a lock for a given key?

On Friday, 15 April 2016, Matthew Rocklin notifications@github.com wrote:

Perhaps we supply both a MutableMapping and a Lock-like object? Current pairs might be (dict, Lock) or (zict.File, fasteners.InterProcessLock)?

When dealing with distributed storage/computation we rarely care about storing in place. It's far more common to make a completely new dataset as output.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/alimanfoo/zarr/issues/21#issuecomment-210212761

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

mrocklin commented 8 years ago

Hrm, yes I see. For distributed computing locks are hard. Short term I see two cheap options:

Only store along chunk boundaries. In the special case of dask.array I'm completely comfortable guaranteeing this.
Only ensure safe writing in the single machine case where our processes have access to a shared file system. Accessing S3, Zip files, or other storage solutions still has value outside of distributed computing.

alimanfoo commented 8 years ago

I guess if the storage (MutableMapping) and locking interfaces were decoupled then there would be flexibility to accommodate these options. I.e., a user could decide to use S3 for storage and not use any locking because they know writes will always be aligned with chunk boundaries. Or a user could choose S3 for storage and then choose thread-based locking because they know they will be executing within a multi-threaded context. Or a user could choose S3 for storage and choose locking based on access to a shared file-system. Any combination of storage and locking would be possible, although some would not be advisable depending on what the user wants to do and how they want to run it.

There would be plenty of scope for users to blow their own feet off here, so maybe a lower level API provides the full flexibility, and a higher-level API provides convenience functions based on sensible combinations of storage and locking for some common usage patterns?

On Friday, April 15, 2016, Matthew Rocklin notifications@github.com wrote:

Hrm, yes I see. For distributed computing locks are hard. Short term I see two cheap options:

Only store along chunk boundaries. In the special case of dask.array I'm completely comfortable guaranteeing this.

Only ensure safe writing in the single machine case where our processes have access to a shared file system. Accessing S3, Zip files, or other storage solutions still has value outside of distributed computing.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/alimanfoo/zarr/issues/21#issuecomment-210223880

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

mrocklin commented 8 years ago

I'm not particularly picky about the API here, I'm quite willing to jump through a couple of hoops. However I'm also not too worried about people shooting off their own feet. Specifying MutableMappings and Locks is probably a bit of a hurdle and unlikely to be attempted by the casual user.

alimanfoo commented 8 years ago

I think this could be very elegant. It would require some fairly substantial restructuring of the existing code, but I like the decoupling of concerns and think it would be worth it. I'm sorry I wouldn't have any time to work on this myself at least until after the summer, but I'd be more than happy to discuss ideas and review code.

On Friday, April 15, 2016, Matthew Rocklin notifications@github.com wrote:

I'm not particularly picky about the API here, I'm quite willing to jump through a couple of hoops. However I'm also not too worried about people shooting off their own feet. Specifying MutableMappings and Locks is probably a bit of a hurdle and unlikely to be attempted by the casual user.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/alimanfoo/zarr/issues/21#issuecomment-210489356

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

mrocklin commented 8 years ago

Do we know of any students or other folks looking for a project? One of my goals for using zarr for this is that I wouldn't have to load the array handling bits into my head, but could restrict myself to creating MutableMappings.

alimanfoo commented 8 years ago

It would take some work to get to that point I'm afraid, although many aspects of the array-handling logic could be left untouched, it is more about refactoring.

On Friday, April 15, 2016, Matthew Rocklin notifications@github.com wrote:

Do we know of any students or other folks looking for a project? One of my goals for using zarr for this is that I wouldn't have to load the array handling bits into my head, but could restrict myself to creating MutableMappings.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/alimanfoo/zarr/issues/21#issuecomment-210495918

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

shoyer commented 8 years ago

:+1: I came here to suggest something exactly like this. I was going to suggest some sort of generic filesystem API, but MutableMappings sounds cleaner.

Guarding against concurrent attempts to write to the same chunk does seem hard to guard against for arbitrary storage systems, but I think a little bit less safety is probably OK.

mrocklin commented 8 years ago

@shoyer is there anyone in the xarray community who might be willing to take this on?

Also, what are your thoughts regarding the NetCDF data model and zarr? How do we manage related variables and coordinates?

mrocklin commented 8 years ago

@alimanfoo can you provide a more detailed list of steps that would be necessary to accomplish this? How would you go about implementing this?

shoyer commented 8 years ago

@shoyer is there anyone in the xarray community who might be willing to take this on?

Well, you can always try @pwolfram :).

Also, what are your thoughts regarding the NetCDF data model and zarr? How do we manage related variables and coordinates?

We basically need two things to make this happen, both very straightforward:

zhdf: a hierarchical storage system for zarrays that supports metadata. The obvious approach is to use the filesystem with __zattr__ JSON files to hold metadata.
znetcdf: mapping from the netCDF data model to zhdf. h5netcdf is a good example of what this entails. This should be even easier, since we don't have HDF5 dimension scales to worry about. We just need to settle on a convention for how to name dimensions in __zattr__ files.

shoyer commented 8 years ago

It would also be interesting to have a tar file backend. Having things in a single file can be convenient for moving data around.

mrocklin commented 8 years ago

I agree that single file storage is convenient, especially for sharing datasets. I would recommend Zip over Tar though. Tar doesn't support random access while Zip does. There is, fortunately, already a Zip(MutableMapping) within the zict library. It was actually the first implementation in there and the reason for the name zict.

pwolfram commented 8 years ago

I am interested in this although time is very limited right now and I have to make sure the scope is in "needed to get science done". However, it would be great to have access to a clean bundling capability via tar/zip file and this may be useful/easy within the context of a distributed/dask/xarray integration. See also https://github.com/pydata/xarray/issues/798

alimanfoo commented 8 years ago

I'll give this some thought over the next few days and try to write down some ideas on the API and how to do the refactoring.

On Saturday, 16 April 2016, Phillip Wolfram notifications@github.com wrote:

I am interested in this although time is very limited right now and I have to make sure the scope is in "needed to get science done". However, it would be great to have access to a clean bundling capability via tar/zip file and this may be useful/easy within the context of a distributed/dask/xarray integration. See also pydata/xarray#798 https://github.com/pydata/xarray/issues/798

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/alimanfoo/zarr/issues/21#issuecomment-210686715

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

alimanfoo commented 8 years ago

I've given this some thought over the weekend and have an initial sketch for how to design the API and refactor the existing code. The good news is I think this will solve several problems in one go, and make the internal architecture much simpler with clear separation of concerns. So I'm convinced it's well worth the effort. That said, I do think the code and tests need to be completely restructured, so it's not a trivial piece of work. I wish I'd had the foresight to do it this way first time, but hey, that's why open source is good.

I've pushed my initial API sketch up to a new "refactor" branch. I will give some notes and discussion below. Please note that this is just an initial sketch and I'm sure will need to be modified/refined.

The first step is to separate out the storage layer. I propose the zarr.store.base.ArrayStore abstract class, defining the interface to storage for a single array. This class has a meta property, holding a MutableMapping for storing essential configuration metadata for the array such as shape, chunk shape, dtype, compression options, etc. The data property is a MutableMapping that maps chunk indices to bytes objects holding compressed chunk data. So, i.e., data[0, 0] returns compressed bytes for the chunk at index (0, 0). The attrs property is a MutableMapping storing user-defined attributes. The cbytes property returns the total size of compressed data held for the array. It may also be useful to implement an initialized property returning the number of chunks that have been initialised with some data.

Existing code for storing arrays in memory and on disk would be refactored into the zarr.store.memory.MemoryStore and zarr.store.directory.DirectoryStore classes respectively. New implementations of storage layers, e.g., using a ZipFile or S3, would live alongside these as separate sub-modules. Implementation of the MemoryStore class could be very simple, with the meta, data and attrs properties each being a dict.

The second step is to separate the synchronization (i.e., locking) functionality. I propose the zarr.sync.ArraySynchronizer abstract class, defining the interface to synchronization operations for a single array. The most important method is the lock_chunk method, which implements a lock on a single chunk.

Existing code for doing thread-based locking and inter-process locking would be refactored into classes zarr.sync.ThreadSynchronizer and zarr.sync.ProcessSynchronizer respectively.

Once APIs for storage and synchronization are defined, we could implement two classes, Array and SynchronizedArray. These two classes would replace all of the existing array classes. When instantiating an Array an ArrayStore is provided as the only constructor argument. When instantiating a SynchronizedArray both an ArrayStore and an ArraySynchronizer are provided.

The last issue is how to deal with the operations to get data from a chunk and set or modify data in a chunk, and the lowest-level operations to compress and decompress data using blosc. Previously I had implemented a set of Chunk classes do encapsulate all of this, but in doing this refactoring I realise I think that these classes are unnecessary, i.e., all the chunk classes can be deleted. This not only simplifies the code, but it also removes a source of overhead, because no state needs to be maintained for any chunk, other than holding the compressed data for each chunk in a store.

To work this last part of the API through I've implemented the __getitem__ and __setitem__ methods on the Array class. All blosc-related operations could now be encapsulated within a very simple cython extension which I've put at blosc.pyx. Hopefully this provides enough information to figure out how to begin reorganising the existing code. Details of how to get the blosc interface right might need to get modified when this actually hits the ground.

Given this API, all of the existing tests would also need to be refactored. Again the good news is that the cleaner separation of concerns should also simplify the internal architecture of the tests, although this too is a fairly substantial piece of work.

Any comments or thoughts on this very welcome.

FrancescAlted commented 8 years ago

Looks like you put a lot of thought on this, and this separation of the storage layer seems a good idea to me too.

Just wanted to point out that blosc_compress_ctx() and blosc_decompress_ctx() will be removed in Blosc2 and replaced by newer blosc2_compress_ctx() and blosc2_decompress_ctx(). This is because contexts will be fully supported in Blosc2, and that will allow for a faster operation (i.e. contexts can be reused). More info on these new Blosc2 functions: https://github.com/Blosc/c-blosc2/blob/master/blosc/blosc.h#L535

alimanfoo commented 8 years ago

Thanks Francesc. Look forward to blosc2.

On Tuesday, April 19, 2016, FrancescAlted notifications@github.com wrote:

Looks like you put a lot of thought on this, and this separation of the storage layer seems a good idea to me too.

Just wanted to point out that blosc_compress_ctx() and blosc_decompress_ctx() will be removed in Blosc2 and replaced by newer blosc2_compress_ctx() and blosc2_decompress_ctx(). This is because contexts will be fully supported in Blosc2, and that will allow for a faster operation (i.e. contexts can be reused). More info on these new Blosc2 functions: https://github.com/Blosc/c-blosc2/blob/master/blosc/blosc.h#L535

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/alimanfoo/zarr/issues/21#issuecomment-211781243

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

alimanfoo commented 8 years ago

Just to say #22 has work in progress on refactoring to address this issue. I'm pretty excited about this and am going to try and use bits and pieces of free time over the coming weeks to push this forward, but progress may be slow so if anyone else would like to contribute please feel free to jump in.

alimanfoo commented 8 years ago

A note regarding the possibility of using a zip file as storage. It looks like it is not possible to update an entry in a zip file. Calling writestr('foo', somebytes) more than once will create two 'foo' entries within a zip file. Therefore using a zip file to store a zarr array would only work under the limited circumstances that each chunk of the array is only ever written once. This would mean that calls to __setitem__ on the array would have to be perfectly aligned with chunk boundaries.

mrocklin commented 8 years ago

Yeah, zip files don't support clean in-place updates. Semantically everything works fine but you'll get a lot of entries in the file that are no longer useful. There are other single-file archive formats, it's tricky to find one that does everything. I still think that Zip is a good choice for sending datasets around, though probably not for workflows that involve a great deal of mutation.

It's also possible to do a sort of garbage collection on the Zip file to eliminate the zombie entries. This requires a full read/write.

zarr-developers / zarr-python

Generalize to other storage systems #21