Closed mrocklin closed 8 years ago
In principle it sounds great.
For the non-synchronized array/chunk implementations this sounds relatively straightforward, i.e., zarr really does just need a MutableMapping where it can store and retrieve bytes. For the synchronized implementations I'm not sure how to handle the locking. I guess the locking is at a level above the MutableMapping interface, because executing __setitem__()
on a chunk includes both a retrieval operation (get existing bytes) and a store operation (put modified bytes). Currently for the persistent synchronized chunks zarr uses a fasteners inter-process lock, which depends on access to a file system.
Very happy to discuss.
Perhaps we supply both a MutableMapping
and a Lock
-like object? Current pairs might be (dict, threading.Lock)
or (zict.File, fasteners.InterProcessLock)
?
When dealing with distributed storage/computation we rarely care about storing in place. It's far more common to make a completely new dataset as output.
Right, but when you are storing the output, what happens depends on how your storage operations are aligned with chunk boundaries. If they are not exactly aligned then two storage operations may need to store data within the same chunk. So you need to sync access to the chunk, even when writing new outputs.
Yes, maybe a MutableMapping and some other interface that allows to acquire and release a lock for a given key?
On Friday, 15 April 2016, Matthew Rocklin notifications@github.com wrote:
Perhaps we supply both a MutableMapping and a Lock-like object? Current pairs might be (dict, Lock) or (zict.File, fasteners.InterProcessLock)?
When dealing with distributed storage/computation we rarely care about storing in place. It's far more common to make a completely new dataset as output.
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/alimanfoo/zarr/issues/21#issuecomment-210212761
Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721
Hrm, yes I see. For distributed computing locks are hard. Short term I see two cheap options:
I guess if the storage (MutableMapping) and locking interfaces were decoupled then there would be flexibility to accommodate these options. I.e., a user could decide to use S3 for storage and not use any locking because they know writes will always be aligned with chunk boundaries. Or a user could choose S3 for storage and then choose thread-based locking because they know they will be executing within a multi-threaded context. Or a user could choose S3 for storage and choose locking based on access to a shared file-system. Any combination of storage and locking would be possible, although some would not be advisable depending on what the user wants to do and how they want to run it.
There would be plenty of scope for users to blow their own feet off here, so maybe a lower level API provides the full flexibility, and a higher-level API provides convenience functions based on sensible combinations of storage and locking for some common usage patterns?
On Friday, April 15, 2016, Matthew Rocklin notifications@github.com wrote:
Hrm, yes I see. For distributed computing locks are hard. Short term I see two cheap options:
- Only store along chunk boundaries. In the special case of dask.array I'm completely comfortable guaranteeing this.
- Only ensure safe writing in the single machine case where our processes have access to a shared file system. Accessing S3, Zip files, or other storage solutions still has value outside of distributed computing.
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/alimanfoo/zarr/issues/21#issuecomment-210223880
Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721
I'm not particularly picky about the API here, I'm quite willing to jump through a couple of hoops. However I'm also not too worried about people shooting off their own feet. Specifying MutableMappings and Locks is probably a bit of a hurdle and unlikely to be attempted by the casual user.
I think this could be very elegant. It would require some fairly substantial restructuring of the existing code, but I like the decoupling of concerns and think it would be worth it. I'm sorry I wouldn't have any time to work on this myself at least until after the summer, but I'd be more than happy to discuss ideas and review code.
On Friday, April 15, 2016, Matthew Rocklin notifications@github.com wrote:
I'm not particularly picky about the API here, I'm quite willing to jump through a couple of hoops. However I'm also not too worried about people shooting off their own feet. Specifying MutableMappings and Locks is probably a bit of a hurdle and unlikely to be attempted by the casual user.
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/alimanfoo/zarr/issues/21#issuecomment-210489356
Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721
Do we know of any students or other folks looking for a project? One of my goals for using zarr for this is that I wouldn't have to load the array handling bits into my head, but could restrict myself to creating MutableMappings.
It would take some work to get to that point I'm afraid, although many aspects of the array-handling logic could be left untouched, it is more about refactoring.
On Friday, April 15, 2016, Matthew Rocklin notifications@github.com wrote:
Do we know of any students or other folks looking for a project? One of my goals for using zarr for this is that I wouldn't have to load the array handling bits into my head, but could restrict myself to creating MutableMappings.
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/alimanfoo/zarr/issues/21#issuecomment-210495918
Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721
:+1: I came here to suggest something exactly like this. I was going to suggest some sort of generic filesystem API, but MutableMappings sounds cleaner.
Guarding against concurrent attempts to write to the same chunk does seem hard to guard against for arbitrary storage systems, but I think a little bit less safety is probably OK.
@shoyer is there anyone in the xarray community who might be willing to take this on?
Also, what are your thoughts regarding the NetCDF data model and zarr
? How do we manage related variables and coordinates?
@alimanfoo can you provide a more detailed list of steps that would be necessary to accomplish this? How would you go about implementing this?
@shoyer is there anyone in the xarray community who might be willing to take this on?
Well, you can always try @pwolfram :).
Also, what are your thoughts regarding the NetCDF data model and zarr? How do we manage related variables and coordinates?
We basically need two things to make this happen, both very straightforward:
zhdf
: a hierarchical storage system for zarrays that supports metadata. The obvious approach is to use the filesystem with __zattr__
JSON files to hold metadata.znetcdf
: mapping from the netCDF data model to zhdf. h5netcdf is a good example of what this entails. This should be even easier, since we don't have HDF5 dimension scales to worry about. We just need to settle on a convention for how to name dimensions
in __zattr__
files.It would also be interesting to have a tar file backend. Having things in a single file can be convenient for moving data around.
I agree that single file storage is convenient, especially for sharing datasets. I would recommend Zip over Tar though. Tar doesn't support random access while Zip does. There is, fortunately, already a Zip(MutableMapping)
within the zict
library. It was actually the first implementation in there and the reason for the name zict
.
I am interested in this although time is very limited right now and I have to make sure the scope is in "needed to get science done". However, it would be great to have access to a clean bundling capability via tar/zip file and this may be useful/easy within the context of a distributed/dask/xarray integration. See also https://github.com/pydata/xarray/issues/798
I'll give this some thought over the next few days and try to write down some ideas on the API and how to do the refactoring.
On Saturday, 16 April 2016, Phillip Wolfram notifications@github.com wrote:
I am interested in this although time is very limited right now and I have to make sure the scope is in "needed to get science done". However, it would be great to have access to a clean bundling capability via tar/zip file and this may be useful/easy within the context of a distributed/dask/xarray integration. See also pydata/xarray#798 https://github.com/pydata/xarray/issues/798
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/alimanfoo/zarr/issues/21#issuecomment-210686715
Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721
I've given this some thought over the weekend and have an initial sketch for how to design the API and refactor the existing code. The good news is I think this will solve several problems in one go, and make the internal architecture much simpler with clear separation of concerns. So I'm convinced it's well worth the effort. That said, I do think the code and tests need to be completely restructured, so it's not a trivial piece of work. I wish I'd had the foresight to do it this way first time, but hey, that's why open source is good.
I've pushed my initial API sketch up to a new "refactor" branch. I will give some notes and discussion below. Please note that this is just an initial sketch and I'm sure will need to be modified/refined.
The first step is to separate out the storage layer. I propose the zarr.store.base.ArrayStore abstract class, defining the interface to storage for a single array. This class has a meta
property, holding a MutableMapping for storing essential configuration metadata for the array such as shape, chunk shape, dtype, compression options, etc. The data
property is a MutableMapping that maps chunk indices to bytes objects holding compressed chunk data. So, i.e., data[0, 0]
returns compressed bytes for the chunk at index (0, 0). The attrs
property is a MutableMapping storing user-defined attributes. The cbytes
property returns the total size of compressed data held for the array. It may also be useful to implement an initialized
property returning the number of chunks that have been initialised with some data.
Existing code for storing arrays in memory and on disk would be refactored into the zarr.store.memory.MemoryStore and zarr.store.directory.DirectoryStore classes respectively. New implementations of storage layers, e.g., using a ZipFile or S3, would live alongside these as separate sub-modules. Implementation of the MemoryStore
class could be very simple, with the meta
, data
and attrs
properties each being a dict.
The second step is to separate the synchronization (i.e., locking) functionality. I propose the zarr.sync.ArraySynchronizer abstract class, defining the interface to synchronization operations for a single array. The most important method is the lock_chunk
method, which implements a lock on a single chunk.
Existing code for doing thread-based locking and inter-process locking would be refactored into classes zarr.sync.ThreadSynchronizer
and zarr.sync.ProcessSynchronizer
respectively.
Once APIs for storage and synchronization are defined, we could implement two classes, Array and SynchronizedArray. These two classes would replace all of the existing array classes. When instantiating an Array
an ArrayStore
is provided as the only constructor argument. When instantiating a SynchronizedArray
both an ArrayStore
and an ArraySynchronizer
are provided.
The last issue is how to deal with the operations to get data from a chunk and set or modify data in a chunk, and the lowest-level operations to compress and decompress data using blosc. Previously I had implemented a set of Chunk classes do encapsulate all of this, but in doing this refactoring I realise I think that these classes are unnecessary, i.e., all the chunk classes can be deleted. This not only simplifies the code, but it also removes a source of overhead, because no state needs to be maintained for any chunk, other than holding the compressed data for each chunk in a store.
To work this last part of the API through I've implemented the __getitem__
and __setitem__
methods on the Array
class. All blosc-related operations could now be encapsulated within a very simple cython extension which I've put at blosc.pyx. Hopefully this provides enough information to figure out how to begin reorganising the existing code. Details of how to get the blosc interface right might need to get modified when this actually hits the ground.
Given this API, all of the existing tests would also need to be refactored. Again the good news is that the cleaner separation of concerns should also simplify the internal architecture of the tests, although this too is a fairly substantial piece of work.
Any comments or thoughts on this very welcome.
Looks like you put a lot of thought on this, and this separation of the storage layer seems a good idea to me too.
Just wanted to point out that blosc_compress_ctx()
and blosc_decompress_ctx()
will be removed in Blosc2 and replaced by newer blosc2_compress_ctx()
and blosc2_decompress_ctx()
. This is because contexts will be fully supported in Blosc2, and that will allow for a faster operation (i.e. contexts can be reused). More info on these new Blosc2 functions: https://github.com/Blosc/c-blosc2/blob/master/blosc/blosc.h#L535
Thanks Francesc. Look forward to blosc2.
On Tuesday, April 19, 2016, FrancescAlted notifications@github.com wrote:
Looks like you put a lot of thought on this, and this separation of the storage layer seems a good idea to me too.
Just wanted to point out that blosc_compress_ctx() and blosc_decompress_ctx() will be removed in Blosc2 and replaced by newer blosc2_compress_ctx() and blosc2_decompress_ctx(). This is because contexts will be fully supported in Blosc2, and that will allow for a faster operation (i.e. contexts can be reused). More info on these new Blosc2 functions: https://github.com/Blosc/c-blosc2/blob/master/blosc/blosc.h#L535
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/alimanfoo/zarr/issues/21#issuecomment-211781243
Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721
Just to say #22 has work in progress on refactoring to address this issue. I'm pretty excited about this and am going to try and use bits and pieces of free time over the coming weeks to push this forward, but progress may be slow so if anyone else would like to contribute please feel free to jump in.
A note regarding the possibility of using a zip file as storage. It looks like it is not possible to update an entry in a zip file. Calling writestr('foo', somebytes) more than once will create two 'foo' entries within a zip file. Therefore using a zip file to store a zarr array would only work under the limited circumstances that each chunk of the array is only ever written once. This would mean that calls to __setitem__
on the array would have to be perfectly aligned with chunk boundaries.
Yeah, zip files don't support clean in-place updates. Semantically everything works fine but you'll get a lot of entries in the file that are no longer useful. There are other single-file archive formats, it's tricky to find one that does everything. I still think that Zip is a good choice for sending datasets around, though probably not for workflows that involve a great deal of mutation.
It's also possible to do a sort of garbage collection on the Zip file to eliminate the zombie entries. This requires a full read/write.
I want something very similar to
zarr
on S3 and I'm pondering the easiest way to get there. One approach is to generalizezarr
to accept pluggable byte storage solutions.Currently, I believe that
zarr
effectively treats the file system as aMutableMapping
into which it can deposit and retrieve bytes. If this is the case then what are your thoughts on actually using theMutableMapping
interface instead of touching files directly? That way I could provide MutableMappings that use file systems, zip files, s3, hdfs, etc.. This nicely isolates a lot of the "where do I put this block of bytes" logic from the array slicing and compression logic.For concreteness, here is a
MutableMapping
that loads/stores data in a directory on the file system. https://github.com/mrocklin/zict/blob/master/zict/file.py