zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.52k stars 282 forks source link

File Chunk Store #556

Closed ajelenak closed 11 months ago

ajelenak commented 4 years ago

Hello!

I want to propose adding a new Zarr store type when all array chunks are located in a single binary file. A propotype implementation, named file chunk store, is described in this Medium post. In this approach, Zarr metadata (.zgroup, .zarray, .zattrs, or .zmetadata) are stored in one of the current Zarr store types while the array chunks are in a binary file. The file chunk store translates array chunk keys into file seek and read operations and therefore only provides read access to the chunk data.

The file chunk store requires a mapping between array chunk keys and their file locations. The prototype implementation put this information for every Zarr array in JSON files named .zchunkstore. An example is below:

   {
    "BEAM0001/tx_pulseflag/0": {
        "offset": 94854560,
        "size": 120
    },
    "BEAM0001/tx_pulseflag/1": {
        "offset": 94854680,
        "size": 120
    },
    "BEAM0001/tx_pulseflag/2": {
        "offset": 94854800,
        "size": 120
    },
    "BEAM0001/tx_pulseflag/3": {
        "offset": 94854920,
        "size": 120
    },
    "BEAM0001/tx_pulseflag/4": {
        "offset": 96634038,
        "size": 120
    },
    "BEAM0001/tx_pulseflag/5": {
        "offset": 96634158,
        "size": 123
    },
    "source": {
        "array_name": "/BEAM0001/tx_pulseflag",
        "uri": "https://e4ftl01.cr.usgs.gov/GEDI/GEDI01_B.001/2019.05.26/GEDI01_B_2019146164739_O02560_T04067_02_003_01.h5"
    }
}

Array chunk file location is described with the starting byte (offset) in the file and the number of bytes to read (size). Also included is the file information (source) to enable verification of chunk data provenance. The file chunk store prototype uses file-like Python objects, delegating to users the responsibility to arrange access to correct files.

We can discuss specific implementation details If there is enough interest in this new store type.

Thanks!

alimanfoo commented 4 years ago

Hi @ajelenak, sorry for slow response here. Would this be exactly the implementation that @rsignell-usgs described in the medium blog post? I.e., this would enable reading of HDF5 files (as well as other file types storing chunks)?

ajelenak commented 4 years ago

Hi, thanks for looking into this.

Yes, the approach is the same as in the Medium post. Actual implementation details are open for discussion. Producing mappings between Zarr chunk keys and their file locations is not included since it differs based on the source file format.

alimanfoo commented 4 years ago

Is this something you'd like to be able to layer over different types of storage? I.e., would you like to be able to use this over cloud object stores, as well as possibly local files?

ajelenak commented 4 years ago

Yes. The file chunk store takes a file-like object so is flexible with the file's actual storage system. I used it on local files with objects from the open() function.

rsignell-usgs commented 4 years ago

@alimanfoo, I'm wondering what folks think about this proposal. I'd like to get the conversation going again now that xarray has merged https://github.com/pydata/xarray/pull/3804

rabernat commented 4 years ago

Together with pydata/xarray#3804, the idea proposed here could unlock an amazing capability: accessing a big HDF5 file using Zarr very efficiently. I think it's important to find a way to move forward with it.

manzt commented 4 years ago

+1 here! We have been generating a complimentary offsets.json file for retrieving tiles from OME-TIFF images on the web. Having something like this could make reading open image formats as Zarr much easier.

jakirkham commented 4 years ago

I'm curious if people here have tried existing single file stores like ZipStore, DBMStore, LMDBStore or SQLiteStore? Should add these probably contain optimizations in the underlying formats that we may not have considered or would miss out on (at least initially) when implementing something new.

pbranson commented 4 years ago

I have used ZipStore quite extensively to assist in packaging (reducing inodes) and archival storage of DirectoryStores on HPC. I am interested to try a ZipStore on a http storage system (like S3) using fsspec, but have yet to try it, would be interested to know if others have tried that.

However, I think the salient point here is that there is already a significant latent investment in netCDF/HDF that is stored in cloud storage and this unlocks significant performance gains and to a degree software stack simplification if these can be accessed via Zarr store without needing to invest considerable overhead to convert the data format.

manzt commented 4 years ago

I'm curious if people here have tried existing single file stores like ZipStore, DBMStore, LMDBStore or SQLiteStore?

I've experimented with ZipStore, but reading a ZipStore remotely via HTTP is not very performant and not well supported. Traversing the central directory at the end of the zip file to find the chunk byte offsets takes a long time (and many requests) for large stores, making the DirectoryStore most ideal for archival storage.

As a side note, I wrote a small python package to serve the underlying store for any zarr-python zarr.Array or zarr.Group over HTTP (simple-zarr-server). It works by mapping HTTP requests to the underlying store.__getitem__ and store.__setitem___, making any store accessible by a python client with fsspec.HTTPFileSystem. Not ideal for archival storage, again, but at least a way access non-DirectoryStore remotely via fsspec.

However, I think the salient point here is that there is already a significant latent investment in netCDF/HDF that is stored in cloud storage and this unlocks significant performance gains and to a degree software stack simplification if these can be accessed via Zarr store without needing to invest considerable overhead to convert the data format.

Agreed. Something like a "File Chunk Store" might offer a more standardized way to read other tiled/chunked formats without requiring conversion (despite not being as performant as the built-in stores).

joshmoore commented 4 years ago

https://github.com/zarr-developers/zarr-python/issues/556#issuecomment-683770452 We have been generating a complimentary offsets.json file for retrieving tiles from OME-TIFF images on the web. Having something like this could make reading open image formats as Zarr much easier.

@manzt, you haven't tried writing a Zarr store implementation in front of OME-TIFF yet have you? That might allow unifying versions of the offset file.

manzt commented 4 years ago

@joshmoore I haven't found the time yet but was planning to investigate ... I experimented with a store implementation for OME-TIFF but it just used tifffile. We use the offsets file outside of zarr at the moment with geotiff.js, so it doesn't have a mapping of zarr-specfiic keys.

The thing with a "File Chunk Store" is that it would be much easier to enable cross-language support for many different types of files. Rather than requiring an extra runtime dependency (in each zarr implementation) to parse the file format and find chunk offsets, this can be done prior in whatever language the offsets file was created.

manzt commented 4 years ago

@joshmoore

TL;DR: An offsets file enables the explicit retrofitting of non-Zarr, array-like data as Zarr and simplifies accessing remotes single file stores via byte-range requests.

Ok, so I experimented with this today, unifying a version of the "offsets" file for both a zarr.ZipStore and a multiscale OME-TIFF image. It seems a "File Chunk Store" could be implemented just as some type extension for Zarr, where an offsets file is generated to complement some archival/read-only store.

Details about the actual file format are necessary for updating (writing) chunks, which is why read-only, but having the offsets in a separate file makes accessing the bytes remotely straight-forward (e.g. using HTTP range requests).

ZipStore: https://observablehq.com/@manzt/zarr-js-file-chunk-store OME-TIFF: https://observablehq.com/@manzt/ome-tiff-file-chunk-store

For the zarr.ZipStore example, it was just a matter of traversing the underlying ZipFile directory and gathering the byte offsets and sizes for all the data:

ZipStore "file chunk store" metadata

```python def get_offsets(zstore): offsets = {} # Traverse the file directory and compute the offsets and chunks for i in zstore.zf.infolist(): # File contents in zip file start 30 + n bytes after the file # header offset, where n is the number of bytes of the filename. name_bytes = len(i.filename.encode("utf-8")) offsets[i.filename] = dict( offset=i.header_offset + 30 + name_bytes, size=i.compress_size, ) return offsets ``` ``` { ".zarray": { "offset": 37, "size": 361 }, "0.0.0": { "offset": 433, "size": 89446 }, "0.1.0": { "offset": 89914, "size": 100028 }, "1.0.0": { "offset": 189977, "size": 88337 }, "1.1.0": { "offset": 278349, "size": 87346 } } ```

For the OME-TIFF, it was a bit more involved to generate the "offsets" file. The key was mapping compressed byte-ranges (chunks) in the OME-TIFF to the same schema, and filling in .zarray/.zattrs/.zgroup metadata where necessary. Hence, the offsets files were the same structure in both cases, except in the OME-TIFF the metadata keys contain the actual JSON metadata rather than offset and size.

OME-TIFF "file chunk store" metadata

``` { ".zgroup": { "zarr_format": 2 }, "0/.zarray": { "chunks": [ 1, 1024, 1024 ], "compressor": { "id": "zlib", "level": 8 }, "dtype": "|u1", "fill_value": 0, "filters": null, "order": "C", "shape": [ 5, 34560, 24960 ], "zarr_format": 2 }, "0/0.0.0": { "offset": 1055, "size": 1039 }, "1/0.0.0": { "offset": 1452747820, "size": 9542 ... } ```

manzt commented 4 years ago

A super naive implementation of the file chunk store for local FS. You should use zarr.ZipStore to read the zip store in this case, but the FileChunkStore can be used to read both the ZipStore and the OME-TIFF above. Something very similar could be implemented for fsspec that uses byte-range requests for HTTP for example.

from collections.abc import MutableMapping

class FileChunkStore(MutableMapping):

    def __init__(self, path, offsets):
        self.path = path
        self.offsets = offsets

    def __getitem__(self, key):
        res = self.offsets[key] 
        if not "offset" in res:
            # metadata not byte offsets
            return res
        with open(self.path, 'rb') as f:
            f.seek(res["offset"])
            cbytes = f.read(res["size"])
        return cbytes

    def __setitem__(self, key, value):
        raise NotImplementedError

    def __delitem__(self, key):
        raise NotImplementedError

    def __containsitem__(self, key):
        key in self.offsets

    def __len__(self):
        return len(self.offsets)

    def __iter__(self):
        return self.keys()

    def keys(self):
        return self.offsets.keys()

in napari:

Screen Shot 2020-09-21 at 7 10 09 PM Screen Shot 2020-09-21 at 7 11 52 PM
joshmoore commented 4 years ago

Nice, @manzt! My plan is to work my way through your OME-TIFF example. It would be interesting to hear from others if there are other examples where more complex metadata is needed in the chunk store.

manzt commented 4 years ago

Apologies for the overload of information; let me know if you have any questions. There is certainly room to add more metadata but these examples highlight the most simple case. The OME-TIFF above works for two reasons:

Unfortunately, the "padding" for a chunk would need to occur after decoding, so the store can't actually handle doing this. Perhaps edge chunks could be handled more flexibly, but I don't know enough here...

rabernat commented 4 years ago

I have reread this thread and am very excited by the opportunities here. This sort of "hackability" is exactly what we love about zarr. My understanding is that the file chunk store proposed here is not covered by the Zarr spec; it's a hack we can do to expose other legacy storage formats via Zarr's API.

It would be good to sketch out a path forward for folks to efficiently collaborate on this without getting too bogged down in edge cases. My proposal would be that we start developing the file chunk store outside of zarr-python. This will allow us to iterate quickly and prototype the different scenarios discussed above. The resulting store could be used like this:

from zarr_filechunkstore import FileChunkStore
store = ChunkStore(**options)
import zarr
array = zarr.open(store)

Since @ajelenak already has a working implementation, perhaps we could start from that?

manzt commented 4 years ago

Totally agree! I'd just like to be mindful of trying to unify a "chunkstore" so that other zarr implementations can build support

joshmoore commented 4 years ago

My understanding is that the file chunk store proposed here is not covered by the Zarr spec; it's a hack we can do to expose other legacy storage formats via Zarr's API.

I was thinking going a step beyond a hack and having this as a community convention (if not extension) assuming the interface can be somewhat nailed down.

ajelenak commented 4 years ago

Thanks @manzt for validating the FileChunkStore idea, and thanks to @rabernat for support.

I do have a prototype implementation of FileChunkStore here.

This is how I see the path forward:

  1. FileChunkStore is based on the assumption that all chunks logically associated with one Zarr array are of the same shape.
  2. Specify the .zchunkstore content (and name). My implementation includes references to the source file and its array variables (example).
  3. Specify the FileChunkStore design starting from its current prototype implementations. One issue I did not have time to improve in my prototype was handling of consolidated Zarr metadata.
  4. Decide where to store all file chunk store translators (file format/chunk location to Zarr array metadata). @rabernat's idea of something like zarr_filechunkstore package would work well, I think.

I think the above covers all the main points but I may have overlooked some important edge cases.

matthewhanson commented 4 years ago

Hey everyone,

We've also got an implementation like this, but made specifically for reading NASA EODSIS NetCDF/HDF data. Based on ConsolidatedMetadataStore, it's called ConsolidatedChunkStore. The library has a few features:

I've been testing on some MODIS SST data, but it's not clear there's much benefit over downloading the entire file. There's still too much overhead in opening the file and doing the reads, and the files aren't that big (<25MB). If you need to read a single array then it might be faster, but if you need to read the coordinate arrays as well for subsetting....it takes too long.

Working on testing it out on some larger datasets, like GEDI and IceSat2.

Unfortunately none of this is public yet, it's got to go through the NASA approval process.

ajelenak commented 4 years ago

Hi @matthewhanson,

I participated in the NASA-funded study that prototyped DMR++. Async chunk reading is something worth considering but for now FileChunkStore relies on the zarr-python machinery for chunk reading. The DMR++ to Zarr translation code should be added as another file format translator when available.

rabernat commented 4 years ago

Experimental async support has just been added to zarr-python in #606. (See #536 for more discussion of async.) It should be possible to support async from an external store by implementing the getitems method.

martindurant commented 4 years ago

fsspec's reference file system allows addressing specific bytes chunks of files as if they were a file system of their own. This allows mapping of HDF5 chunks to a zarr-like tree, as discussed here, but without the need for a storage class in zarr. Indeed, this is something that should be useful to fsspec in general. @rsignell-usgs has examples of using this on the "ike" dataset with or without Intake - it only needs zarr 2.5 and fsspec master.

I suggest that this thread can be used to discuss the best implementation-agnostic way to store the offsets and where to let the hdf-walking code live. Note that the fsspec implementation has a function specifically to read the current format.

rabernat commented 4 years ago

The evidence from this thread is that there are multiple successful implementations of this concept. Martin's fsspec-based solution is an elegant one, because it uses the fsspec abstraction of a filesystem to hide the concept of the offsets from zarr.

However, just like zarr itself, I hope that the implementations can be separate from the specification. @joshmoore made a very good point up thread:

I was thinking going a step beyond a hack and having this as a community convention (if not extension)

What we need to do now is align on a specification. It sounds like @ajelenak, @manzt, and @matthewhanson have all defined different ad hoc specifications for how to encode these offsets within a binary file and expose the internal chunks to zarr. I would go further and propose we do this as a formal extension to the zarr v3 spec.

The concept of extensions is rather new. Do we have a template for what an extension looks like? If so, I would be happy to help coordinate this process.

martindurant commented 4 years ago

In addition to @rabernat 's comment, in the zarr group we were just discussing where the snippets of code to go from given file format (such as the HDF5 example) to a set of offsets might live, and thought that it merited its own repo, which would then host the spec for the offsets file format and probably some CLI frontend implementation. Given that this doesn't need zarr intervention, I'm not certain what a zarr v3 extension would contain in this case, but it probably ought to all live in the same repo either way.

rabernat commented 4 years ago

:+1: to the idea of a standalone utility + CLI for generation of chunk offset metadata. Given a clearly defined specification for the offsets, that utility need not depend on zarr-python.

rsignell-usgs commented 4 years ago

Just chiming in here that if people want to try the Hurricane Ike demo @martindurrant mentioned above accessing HDF5 in a cloud-friendly way using the existing zarr and fsspec libraries, you can run it on binder:

badge

rabernat commented 3 years ago

So @martindurant has put together a simple specification for what we are calling a "reference filesystem." The goal is to provide a simple data structure to map the locations of zarr-readable chunks within other binary formats. Our hope is that this can cover both the HDF5 use case as well as the OME-TIFF use case provided by @manzt. We would greatly appreciate it if those interested in this feature could help us iterate to agree on the spec, which is a key step to move this feature forward.

Right now the spec, and a script to generate examples live here: https://github.com/intake/fsspec-reference-maker

Going forward, some broader questions are:

rabernat commented 3 years ago

The process by which we would like feedback on this is via issues on https://github.com/intake/fsspec-reference-maker.

rabernat commented 3 years ago

A great blog post by @rsignell-usgs has now been published which describes the fsspec reference filesystem solution to this problem: https://medium.com/pangeo/cloud-performant-netcdf4-hdf5-with-zarr-fsspec-and-intake-3d3a3e7cb935

jhamman commented 11 months ago

This can be closed now that fsspec has the reference file system.