zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Creative Commons Attribution 4.0 International
87 stars 28 forks source link

Feature request - variable chunking #138

Open alex-s-gardner opened 2 years ago

alex-s-gardner commented 2 years ago

For many applications there exist large archives of data that are continuously added to as time passes. Good examples are climate reanalysis, remote sensing data, ocean records .... ect.

The problem: Right now the chunking used for a Zarr dataset can be optimized for appending time-slice layers which makes it efficient for appending to an existing dataset as it grows with time or Zarr dataset can be optimized for accessing time series from the cube... but not both. Thus to maintain efficient time series access the Zarr cube might need to be rewritten entirely each time a new time layer is added. I have spent way too long trying to find chunking that could be a good compromise between both appending and access but no such acceptable compromise exists for large datasets>

Potential solution: If Zarr allowed variable chunking it could overcome many of these issues. I can envision how I would do this with two separate Zarr files so I suspect it could be implemented in a single file. Here's the 2 file approach:

  1. write a large Zarr with chunking optimized for time series extraction containing a large archive of all existing data [let's call this the "base cube"]
  2. write a smaller Zarr with chunking optimized for appending and append new time slices as they become available. [let's call this the "surface cube"]

This would allow one to easily append to the data cube without taking a big hit on time series access and would not require a full rewrite of the data. Once a "surface cube" comes large enough it could be added to the "base cube" to keep the cube performant.

jakirkham commented 2 years ago

(transferred to the spec repo where these kinds of things are typically discussed)

jakirkham commented 2 years ago

Thanks for the feedback!

For this use case has using separate Arrays been considered? If so, what are the strengths/weaknesses of that approach?

alex-s-gardner commented 2 years ago

@jakirkham without another option we are looking at writing 2 arrays (base + surface). The benefit is to maximize both time series read efficiency and time-slice append. The disadvantage is that all of our Zarr readers need to be written with checks for and "base" and "surface" files, something that is OK at the project level but not great for our users that won't understand what or why. Also, writing a second Array results in large duplication of metadata. I see the 2 Array solution as a hack that could ideally be handled internally to Zarr.... and I guess we would need to split the time dimension into 2 arrays... plus any other variables that contain information about the time slices... for us it's information like "image acquisition date", "image processed date", "sensor", ect.

jstriebel commented 1 year ago

This is the relevant ZEP 3 that proposes variable sized chunks: https://zarr.dev/zeps/draft/ZEP0003.html. Would this potentially solve this use case? Then we should also link this issue there. cc @martindurant

martindurant commented 1 year ago

The idea of an accumulation area, ready for turning into a full sized chunk, is not really covered in ZEP0003, but it is another interesting thing that the proposal might enable. It might be worth adding words to the ZEP to explain this workflow there. In general, if variable chunks are allowed, then having full chunks and an append "surface" chunk is fine, but the convenience of consolidating the surface when it is full needs to be implemented somewhere.

alex-s-gardner commented 1 year ago

Has this issue been addressed? I'm wondering if it also needs to be raise for the geozarr spec https://github.com/zarr-developers/geozarr-spec?

martindurant commented 1 year ago

There has been no progress on getting this spec in, or any proposed changes to it either. Maybe it's up to me to propose an implementation in zarr-python to move forward, but I don't know when I can find the time for that.

From brief reading, geozarr-spec seems to assume the zarr spec, and I don't see anything talking about the chunking strategy explicitly. I suppose variable-chunking may be immediately incompatible with multiscale pyramids (at least, you cannot simply downsample by a constant factor).

alex-s-gardner commented 1 year ago

@martindurant currently this is our biggest complication in our workflow as updating Zarr stores becomes very expensive to update when optimized for time-series access. If you don't have time would you be able to provide guidance/advice on the steps needed to move towards a solution?

rabernat commented 1 year ago

I personally think that seeing a draft implementation of ZEP-3 would go a long way towards building momentum around it. My conclusion from the recent Zarr spec activities is that it doesn't make sense to develop and approve specs before any implementation has started.

meggart commented 1 year ago

@alex-s-gardner I have some interest in that use case as well and think it should be possible to implement ZEP-3 in Zarr.jl within a reasonable amount of time, so you could start experimenting with this, at least in your Julia workflows. However, the data produced by this would then of course not be readable by other zarr implementations, including zarr-python. Don't know if this would be of any help...

rabernat commented 1 year ago

However, the data produced by this would then of course not be readable by other zarr implementations, including zarr-python.

I think this is a fine place to start. Once we have a working implementation in any language, other languages can use that as a reference for their own implementation.

martindurant commented 1 year ago

On steps, I see two main things:

alex-s-gardner commented 1 year ago

@martindurant if we did this would it also accommodate kerchunking of files with variable chunk size?

alex-s-gardner commented 1 year ago

Not to get too much off topic but is is possible to build a Zarr of Zarrs? For our specific use case the ideal implementation would be a Zarr of Zarrs. The master Zarr would include the re-chunked times-series optimized data and a kerchunk Zarr pointing to those files in the original catalogue that had not wet been re-chunked.... I can see this as the optimal approach for any "living" dataset

martindurant commented 1 year ago

would it also accommodate kerchunking of files with variable chunk size

Yes, that's exactly the origin of this from my point of view. Of course there are other reasons too, detailed in the doc.

is is possible to build a Zarr of Zarrs

Yes! You could mix zarrs and non-zarrs too. There is already a ZarrToZarr in kerchunk.

alex-s-gardner commented 1 year ago

@agoodm we can start from this thread

okz commented 1 year ago

coming from real-time data and instruments side of the sw, wanted to note here that zarr ticks a lot of boxes for a direct instrument data output file, but the fixed chunking requirement makes few things difficult.

An instruments configuration, SW implementation or HW may change, leading to variable data lengths, throughout it's lifecycle, and fixed chunk sizes means the data will likely need pre-processing, and likely need duplication to avoid touching the original data.

Looking forward to variable length chunking support.