Open alex-s-gardner opened 2 years ago
(transferred to the spec repo where these kinds of things are typically discussed)
Thanks for the feedback!
For this use case has using separate Array
s been considered? If so, what are the strengths/weaknesses of that approach?
@jakirkham without another option we are looking at writing 2 arrays (base + surface). The benefit is to maximize both time series read efficiency and time-slice append. The disadvantage is that all of our Zarr readers need to be written with checks for and "base" and "surface" files, something that is OK at the project level but not great for our users that won't understand what or why. Also, writing a second Array results in large duplication of metadata. I see the 2 Array solution as a hack that could ideally be handled internally to Zarr.... and I guess we would need to split the time dimension into 2 arrays... plus any other variables that contain information about the time slices... for us it's information like "image acquisition date", "image processed date", "sensor", ect.
This is the relevant ZEP 3 that proposes variable sized chunks: https://zarr.dev/zeps/draft/ZEP0003.html. Would this potentially solve this use case? Then we should also link this issue there. cc @martindurant
The idea of an accumulation area, ready for turning into a full sized chunk, is not really covered in ZEP0003, but it is another interesting thing that the proposal might enable. It might be worth adding words to the ZEP to explain this workflow there. In general, if variable chunks are allowed, then having full chunks and an append "surface" chunk is fine, but the convenience of consolidating the surface when it is full needs to be implemented somewhere.
Has this issue been addressed? I'm wondering if it also needs to be raise for the geozarr spec https://github.com/zarr-developers/geozarr-spec?
There has been no progress on getting this spec in, or any proposed changes to it either. Maybe it's up to me to propose an implementation in zarr-python to move forward, but I don't know when I can find the time for that.
From brief reading, geozarr-spec seems to assume the zarr spec, and I don't see anything talking about the chunking strategy explicitly. I suppose variable-chunking may be immediately incompatible with multiscale pyramids (at least, you cannot simply downsample by a constant factor).
@martindurant currently this is our biggest complication in our workflow as updating Zarr stores becomes very expensive to update when optimized for time-series access. If you don't have time would you be able to provide guidance/advice on the steps needed to move towards a solution?
I personally think that seeing a draft implementation of ZEP-3 would go a long way towards building momentum around it. My conclusion from the recent Zarr spec activities is that it doesn't make sense to develop and approve specs before any implementation has started.
@alex-s-gardner I have some interest in that use case as well and think it should be possible to implement ZEP-3 in Zarr.jl within a reasonable amount of time, so you could start experimenting with this, at least in your Julia workflows. However, the data produced by this would then of course not be readable by other zarr implementations, including zarr-python. Don't know if this would be of any help...
However, the data produced by this would then of course not be readable by other zarr implementations, including zarr-python.
I think this is a fine place to start. Once we have a working implementation in any language, other languages can use that as a reference for their own implementation.
On steps, I see two main things:
@martindurant if we did this would it also accommodate kerchunking of files with variable chunk size?
Not to get too much off topic but is is possible to build a Zarr of Zarrs? For our specific use case the ideal implementation would be a Zarr of Zarrs. The master Zarr would include the re-chunked times-series optimized data and a kerchunk Zarr pointing to those files in the original catalogue that had not wet been re-chunked.... I can see this as the optimal approach for any "living" dataset
would it also accommodate kerchunking of files with variable chunk size
Yes, that's exactly the origin of this from my point of view. Of course there are other reasons too, detailed in the doc.
is is possible to build a Zarr of Zarrs
Yes! You could mix zarrs and non-zarrs too. There is already a ZarrToZarr in kerchunk.
@agoodm we can start from this thread
coming from real-time data and instruments side of the sw, wanted to note here that zarr ticks a lot of boxes for a direct instrument data output file, but the fixed chunking requirement makes few things difficult.
An instruments configuration, SW implementation or HW may change, leading to variable data lengths, throughout it's lifecycle, and fixed chunk sizes means the data will likely need pre-processing, and likely need duplication to avoid touching the original data.
Looking forward to variable length chunking support.
For many applications there exist large archives of data that are continuously added to as time passes. Good examples are climate reanalysis, remote sensing data, ocean records .... ect.
The problem: Right now the chunking used for a Zarr dataset can be optimized for appending time-slice layers which makes it efficient for appending to an existing dataset as it grows with time or Zarr dataset can be optimized for accessing time series from the cube... but not both. Thus to maintain efficient time series access the Zarr cube might need to be rewritten entirely each time a new time layer is added. I have spent way too long trying to find chunking that could be a good compromise between both appending and access but no such acceptable compromise exists for large datasets>
Potential solution: If Zarr allowed variable chunking it could overcome many of these issues. I can envision how I would do this with two separate Zarr files so I suspect it could be implemented in a single file. Here's the 2 file approach:
This would allow one to easily append to the data cube without taking a big hit on time series access and would not require a full rewrite of the data. Once a "surface cube" comes large enough it could be added to the "base cube" to keep the cube performant.