ZEP0002 Review - Githubissues

normanrz commented 1 year ago

Hi everyone,

I hope you’re doing well.

Thank you all for taking the time to review the ZEP0001 and V3 specification. The V3 specification is approved and accepted by the ZSC, ZIC and the Zarr community.

The initial discussion on sharding dates back to 11/2021; please see https://github.com/zarr-developers/zarr-python/issues/877. There have been major developments since the proposal of sharding, some of them are:

08/2022 → Submission of sharding for a ZEP, i.e. ZEP0002; see https://github.com/zarr-developers/zeps/pull/13
08/2022 → Prototype implementation of sharding as storage transformer in Zarr-Python (2022/08); see https://github.com/zarr-developers/zarr-python/pull/1111
03/2023 → Pivoting to implement sharding as codec rather than storage transformer; see https://github.com/zarr-developers/zarr-specs/issues/220 and meeting notes
03/2023 → Updated prototype implementation of sharding as a codec in Zarrita; see here
07/2023 → Added index_codecs to the sharding codec

Now, we want to put forth the ZEP0002 - Sharding Codec for voting.

We have created this issue to track the approvals from the ZSC, ZIC and the broader Zarr community. Specific technical feedback on sharding should be made via narrowly scoped issues on the zarr-specs repository that link to this issue.

Now, according to the section, ‘How does a ZEP becomes accepted’ - ZEP0000, a ZEP must satisfy three conditions for approval:

Unanimous approval of the Zarr Steering Council
Majority approval of the Zarr Implementations Council
And, no vetos from the Zarr Implementations Council

As an implementation council member, you have three options for your vote:

YES - Approve the ZEP0002. This indicates that your implementation intends to implement support for sharding codec.
ABSTAIN - This would be appropriate if you do not want to implement the sharding codec.
VETO - As a ZIC member, you can veto a ZEP, including this one.

We request you, the ZIC, and the ZSC review the ZEP0002 and let us know your thoughts. We’ve listed steps to read and understand the sharding completely. They are as follows:

Please read and review the ZEP0002. This document contains various vital sections, e.g. Motivation and Scope, Usage and Impact, Backward Compatibility and a detailed description with illustrations.
After this, please review the specification document here: https://zarr-specs.readthedocs.io/en/latest/v3/codecs/sharding-indexed/v1.0.html.
Gather your thoughts
Optional but recommended: begin implementing the ZEP experimentally.
Cast your vote by dropping a comment on the issue here.

We understand that the whole process takes time, so we’ve decided to have a relaxed timeline for ZEP0002 voting. We’d appreciate your vote by 31 October 2023, 23:59:59 AoE.

Example implementations

zarrita by scalable minds
zarrita.js by Trevor Manz
tensorstore by Google
neuroglancer by Google
Webknossos by scalable minds

Please let us know if there are any questions. Thank you for your time.

Voting status:	Github user	Project
@joshmoore	ZSC	YES
@ryan-williams	ZSC	YES
@alimanfoo	ZSC	YES
@rabernat	ZSC	YES
@jakirkham	ZSC	YES
@andersy005	freeman-lab/zarr-js	YES
@axtimwalde	saalfeldlab/n5-zarr	YES
@aschampion	sci-rs/zarr	YES
@meggart	JuliaIO/Zarr.jl	YES
@jbms	google/tensorstore	YES
@constantinpape	constantinpape/z5	ABSTAIN
@WardF	Unidata/netcdf-c + Unidata/netcdf-java	YES
@davidbrochart	xtensor-stack/xtensor-zarr	ABSTAIN
@grlee77	zarr-developers/zarr-python	YES
@manzt	gzuidhof/zarr.js	YES

normanrz commented 1 year ago

CC: @zarr-developers/implementation-council @zarr-developers/steering-council

clbarnes commented 1 year ago

Just a quick sanity check - this spec depends heavily on storage backends supporting Range requests; suffixes in particular (for getting the shard location footer). Over on this issue https://github.com/apache/arrow-rs/issues/4611 , it's been suggested that common stores like S3 don't support suffixes. I don't have a store/ account I could use to test - could anyone give it a go? If not it's obviously a HEAD then a non-suffix GET Range request in serial, which isn't ideal. But possibly not that big an issue when accounting for the lack of multipart range support.

From https://github.com/apache/arrow-rs/issues/4612 it looks like multipart ranges aren't supported by any cloud provider (have yet to interact with a server implementation which does tbh, and you need to jump through a lot of hoops to interpret the response as, according to the HTTP spec, they can basically send you whatever they want) and that particular library gets around it by making a bunch of different requests in parallel. Presumably that isn't a blocker in and of itself, although making hundreds or thousands of requests to pull down a region of a single shard is hardly ideal, especially as any given backend could just decide to send you the entire shard, or indeed just about any subset of it, with each request!

normanrz commented 1 year ago

Just a quick sanity check - this spec depends heavily on storage backends supporting Range requests; suffixes in particular (for getting the shard location footer). Over on this issue apache/arrow-rs#4611 , it's been suggested that common stores like S3 don't support suffixes. I don't have a store/ account I could use to test - could anyone give it a go? If not it's obviously a HEAD then a non-suffix GET Range request in serial, which isn't ideal. But possibly not that big an issue when accounting for the lack of multipart range support.

All major cloud providers (including S3, GCS, Azure) and static file HTTP server applications support requesting suffixes. Here is an example against S3:

curl -H 'Range: bytes=-524292' https://static.webknossos.org/data/zarr_v3/l4_sample/color/1/c/0/3/3/1 | wc -c

From apache/arrow-rs#4612 it looks like multipart ranges aren't supported by any cloud provider (have yet to interact with a server implementation which does tbh, and you need to jump through a lot of hoops to interpret the response as, according to the HTTP spec, they can basically send you whatever they want) and that particular library gets around it by making a bunch of different requests in parallel. Presumably that isn't a blocker in and of itself, although making hundreds or thousands of requests to pull down a region of a single shard is hardly ideal, especially as any given backend could just decide to send you the entire shard, or indeed just about any subset of it, with each request!

Multipart ranges are not as widely supported. I know that S3 doesn't support it. Issuing single range requests per inner chunk is equivalent to using Zarr without sharding. So, I would argue sharding doesn't make the situation worse. On the contrary, implementations can choose to coalesce the byte ranges of multiple chunks into single requests to reduce the number of requests. This works especially well, if the chunks are laid out in an order that matches the common access pattern (e.g. Z-ordering). Implementations can also download entire shards. In our standard configuration that means reducing the number of requests by a factor of ~32,000.

clbarnes commented 1 year ago

Issuing single range requests per inner chunk is equivalent to using Zarr without sharding.

Yes, so long as the existence of sharding doesn't change users' heuristics on (sub)chunk layout. I suppose it comes down to whether you see sharding as a way to coalesce your small chunks into single files, or as a way to access inner regions of your large chunks!

In our standard configuration that means reducing the number of requests by a factor of ~32,000.

The "worst case" I was thinking about was when you need about half the chunk - clients may want some internal strategy trading off making thousands of small requests to download just what you need, or downloading considerably more than you need in 1 request to then slice the result.

JackKelly commented 1 year ago

clients may want some internal strategy trading off making thousands of small requests to download just what you need, or downloading considerably more than you need in 1 request to then slice the result

Yes, and perhaps the Zarr implementation would allow the user to specify the threshold. e.g. "I'm reading data from a RAID6 array of HDDs, which has very high bandwidth but also terrifyingly high latencies for random reads, so it's faster to read sequentially, even if I throw away 90% of the chunks after reading them from disk".

constantinpape commented 1 year ago

Hi everyone, I will go on vacation next week, and don't have time to really read up on the changes here. Overall I am very excited about sharding and its adoption in zarr, as this is a feature that I would extensively use myself / in our group down the line. I just have one high-level question before a formal vote: How do the changes here affect zarr without sharding? I assume the default will still be to use normal chunking, and libraries that don't support sharding yet can just use the rest of the zarr spec as usual? (And would need to be adapted to raise an appropriate exception if it encounters a sharded zarr array)

normanrz commented 1 year ago

Sharding is specified as a new codec. That means that Zarr without sharding is not affected. Libraries that don't support it can still use the normal chunking. Of course, libraries that don't support sharding will not be able to open/read arrays that have been created with sharding.

constantinpape commented 1 year ago

Thanks for the clarification @normanrz ! Than I vote ABSTAIN (as I don't have the capacity to implement it in z5 myself), but with full support for going this direction!

manzt commented 1 year ago

I vote YES! Currently migrating zarrita.js (future for zarr.js) to ZEP0001 and hope to have the capacity to support ZEP0002.

jbms commented 1 year ago

I vote yes for tensorstore and neuroglancer!

meggart commented 1 year ago

I vote YES for Zarr.jl , it is definitely something I would need and use, but can not promise an implementation time line.

manzt commented 1 year ago

I've implemented the sharding codec in manzt/zarrita.js v0.3.2. It is compatible with the latest from scalableminds/zarrita.

jbms commented 1 year ago

tensorstore now supports zarr v3 with sharding.

andersy005 commented 12 months ago

I vote yes for zarr-js. We recently added experimental support for Zarr v3 including support for sharding codec

jbms commented 12 months ago

Neuroglancer also now supports zarr v3 with sharding.

jbms commented 12 months ago

Is there by any chance an interesting public v3 sharded image dataset, that could be used as an example?

normanrz commented 12 months ago

We have a public EM dataset at https://static.webknossos.org/data/zarr_v3/l4dense_motta_et_al_demo. Here is the data in Webknossos: https://webknossos.org/links/yo8nbTpjtSm_F210

There is no multiscale metadata for v3, yet. So this is what the hierarchy looks like:

├── color
│   ├── 1
│   ├── 2-2-1
│   ├── 4-4-2
│   ├── 8-8-4
│   ├── 16-16-8
│   ├── 32-32-16
│   ├── 64-64-32
│   ├── 128-128-64
│   ├── 256-256-128
│   ├── 512-512-256
│   └── 1024-1024-512
└── segmentation
    ├── 1
    ├── 2-2-1
    ├── 4-4-2
    ├── 8-8-4
    ├── 16-16-8
    ├── 32-32-16
    ├── 64-64-32
    ├── 128-128-64
    ├── 256-256-128
    ├── 512-512-256
    └── 1024-1024-512

The arrays use chunk shape [1, 32, 32, 32] and shard shape [1, 1024, 1024, 1024]. The EM (color) data is 195G and the segmentation is 10G in total.

EM data by Motta et al., Science 2019, segmentation by scalable minds.

jbms commented 11 months ago

Thanks. I was able to access that dataset in Neuroglancer.

Note though that there appears to be a misconfiguration with your server for e.g. https://static.webknossos.org/data/zarr_v3/l4dense_motta_et_al_demo/color/1/zarr.json (which seems to involve cloudfront and s3). Specifically the issue is that cloudfront is caching the responses without regard to the Origin request header, but the response (namely the Access-Control-Allow-Origin header depends on the Origin request header). In particular, if you make a request with Origin: https://whatever.com, it will return Access-Control-Allow-Origin: https://whatever.com. However, that response will be cached, and a subsequent request with a different origin will still receive Access-Control-Allow-Origin: https://whatever.com which won't have the desired effect. Since these requests do not require cookies to be sent, it would also work to return Access-Control-Allow-Origin: *, which would not need to depend on the Origin.

As far as OME, I know you are working on an update to the OME-zarr spec for zarr v3. Neuroglancer also supports the existing OME-zarr metadata for zarr v3, exactly as it is supported for zarr v2.

jbms commented 11 months ago

Note: If you have a strong objection to Neuroglancer supporting the existing OME-zarr metadata with zarr v3, I am open to changing that.

normanrz commented 11 months ago

Thanks! I fixed the header caching and added OME v0.4 metadata to color and segmentation.

jbms commented 11 months ago

Thanks, I'm able to view it in neuroglancer now. Regarding the OME metadata, I think you need to add a half-voxel translation to each scale to account for the fact that OME assumes the origin is in the center of a voxel, in order to properly align the scales.

WardF commented 11 months ago

I vote YES on behalf of the Unidata seat at the table. Thanks!

normanrz commented 11 months ago

Thanks to everybody, who already voted! Everybody else, I am looking forward to your votes. Please note that the deadline for voting is in less than a week. Thanks!

cc @zarr-developers/implementation-council @zarr-developers/steering-council

rabernat commented 11 months ago

I vote in favor.

axtimwalde commented 11 months ago

I vote in favor to not further block this extension.

I find some aspects of the current extension proposal unfortunate and hope for an improved sharding extension in the future. Here are my opinions:

The line of arguments for introducing shards is not compelling and lacks quantifiable justification: e.g. on a filesystem that can store 2.4TB, 10mio files are not too many files. Cloud storage providers, in our experience, do not currently impose limits on the number of objects and keys.
The specification as a codec for arrays of dtype causes two issues:
1. Since a shard is now a chunk of an array of dtype, implementers of extensible codec readers could fall back to use case scenario 1, i.e. read and decompress the entire shard. In that scenario, the shard is a chunk, existing codecs often create internal compression blocks already, so shards are obsolete. In any implementation of a zarr-library that does not require copy-conversion of chunks after decompression reading shards completely is the same as reading chunks, i.e. shards are obsolete.
2. The fact that the chunk-size in the shard has to align with the shard-size by spec leaves room for complication and error. It'd be preferable to store the size of the grids in less fragile ways, e.g. storing an array of shards of dtype as (1) an array of chunks of chunks of dtype or (2) as an array of arrays of chunks of dtype would simplify and generalize this. Both ideas can be stacked, so shards of shards would be possible. I understand that this is difficult with the v3 spec (in particular the use of numcodec strings for dtypes) at this time and recognize this difficulty by voting in favor of this extension.
Storing the fix-size index at the end of the shard instead of at the beginning feels like the worst of two possible decision. It requires that the size of the stream is known to read just the index, and it requires that the entire index must be rewritten if a chunk grows (instead of just all the following chunks).
The documentation about checksums is incomplete and confusing.

jbms commented 11 months ago

I vote in favor to not further block this extension.

I find some aspects of the current extension proposal unfortunate and hope for an improved sharding extension in the future. Here are my opinions:

The line of arguments for introducing shards is not compelling and lacks quantifiable justification: e.g. on a filesystem that can store 2.4TB, 10mio files are not too many files. Cloud storage providers, in our experience, do not currently impose limits on the number of objects and keys.

While there is no limit on the number of objects, write throughput is higher when writing larger objects, operation costs are lower, and larger objects also help avoid limits on requests per second. Additionally, by storing multiple sub-chunks within a single file, it is possible to coalesce reads of multiple sub-chunks if they are stored next to each other, which improves read throughput, avoids running into limits on requests per second, and also reduces operation costs.

The specification as a codec for arrays of dtype causes two issues:

Since a shard is now a chunk of an array of dtype, implementers of extensible codec readers could fall back to use case scenario 1, i.e. read and decompress the entire shard. In that scenario, the shard is a chunk, existing codecs often create internal compression blocks already, so shards are obsolete. In any implementation of a zarr-library that does not require copy-conversion of chunks after decompression reading shards completely is the same as reading chunks, i.e. shards are obsolete.

It is true that if partial reads from the underlying kvstore are not supported, then this sharding codec is not particularly useful. But the solution is to implement partial reads. We can't force any particular implementation to be high quality. As far as a different codec that does sub-chunking, do you have a particular one in mind? blosc and the various blosc2 formats do some forms of sub-chunking, but then we lose the ability to use arbitrary sub-chunk codecs (e.g. image formats, segmentation-specific compression, etc.), and instead are limited to what blosc supports.

The fact that the chunk-size in the shard has to align with the shard-size by spec leaves room for complication and error. It'd be preferable to store the size of the grids in less fragile ways, e.g. storing an array of shards of dtype as (1) an array of chunks of chunks of dtype or (2) as an array of arrays of chunks of dtype would simplify and generalize this. Both ideas can be stacked, so shards of shards would be possible. I understand that this is difficult with the v3 spec (in particular the use of numcodec strings for dtypes) at this time and recognize this difficulty by voting in favor of this extension.

Note that we are no longer using the numpy type string syntax to specify data types in zarr v3.

The sharding_indexed codec introduced by this ZEP can be stacked, and while stacking may be rare, it can be useful to do so to avoid an excessively large shard index.

I think you are suggesting that we could support nested arrays, where the nested arrays may not all be the same size (if the nested arrays are all the same size, then it would be equivalent to just adding additional dimensions). I can see that in combination with:

partial read support for these nested arrays
a transform that turns a nested array into a flat array that we could accomplish the same thing as sharding. But it isn't clear to me that this is necessarily a simpler or more elegant solution.

Storing the fix-size index at the end of the shard instead of at the beginning feels like the worst of two possible decision. It requires that the size of the stream is known to read just the index, and it requires that the entire index must be rewritten if a chunk grows (instead of just all the following chunks).

By writing it at the end, you can begin stream chunks before you know the size of all chunks, and then write the index at the end. In order to write the index at the beginning you need to either buffer the entire shard, or reserve space at the beginning and update afterwards (which has some additional challenges). The assumption was that sharding would be used in cases where the underlying store supports random access, as otherwise partial reads would not be possible. Note that with HTTP and Cloud Storage, for example, you can request the last N bytes without knowing the size.

The documentation about checksums is incomplete and confusing.

Perhaps we can clarify this. Note that the sharding_indexed codec itself no longer does anything with checksums. If you want checksums of the index and/or sub-chunks, you can layer in a checksum codec (currently just crc32c).

mkitti commented 11 months ago

As far as a different codec that does sub-chunking, do you have a particular one in mind

Zstandard for one has internal blocks https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#blocks which could be possibly decompressed with a dictionary.

In the larger picture, our experience with neuroglancer precomputed has already demonstrated many of @jbms's points in practice. That said I would very much like to see benchmark comparisons to provide guidance about when this becomes useful in both the cloud and file system contexts.

axtimwalde commented 11 months ago

By writing it at the end, you can begin stream chunks before you know the size of all chunks, and then write the index at the end. In order to write the index at the beginning you need to either buffer the entire shard, or reserve space at the beginning and update afterwards (which has some additional challenges). The assumption was that sharding would be used in cases where the underlying store supports random access, as otherwise partial reads would not be possible. Note that with HTTP and Cloud Storage, for example, you can request the last N bytes without knowing the size.

I understand that it can be done, it feels wrong though because it optimizes(?) for the hopefully rare case of writing shards in full. Since the index is fixed length, reserving the space in the front is not complicated (fill with -1) and can be randomly accessed after streaming the chunks. Reading any part of the shard (the most frequent operation) with index at the front would be easier because you read a fix size field first, then seek forward. Writing an arbitrary subset of the shard with index in the front means reading the index, then writing the subset (could be just one chunk if it is smaller or equal than the currently stored chunk or all chunks after it), then updating the index where necessary. One could even wastefully add new chunks at the end and update only one index field until an explicit re-packing is requested. With index at the end, you always have to write the entire index.

jbms commented 11 months ago

I have generally worked with storage systems like gcs and s3 that are mostly write-once, and with append-only distributed filesystems, and have always just written the entire shard at once

However I can see that for storage systems that permit random writes the index at the start may be advantageous.

We could add an "index_location" configuration option that may be either "start" or "end", defaulting to "end". Would that address your concerns?

jbms commented 11 months ago

As far as a different codec that does sub-chunking, do you have a particular one in mind

Zstandard for one has internal blocks https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#blocks which could be possibly decompressed with a dictionary.

In the larger picture, our experience with neuroglancer precomputed has already demonstrated many of @jbms's points in practice. That said I would very much like to see benchmark comparisons to provide guidance about when this becomes useful in both the cloud and file system contexts.

Zstd blocks only accomplish the same thing as sharding if you accept some additional constraints: you can only sub-chunk along the single outermost dimension, and the encoding of each element must be fixed-size (no variable-length strings).

mkitti commented 11 months ago

I've had some similar questions as this ZEP was developed. Maybe one of the more important aspects of this proposal is that the sharding as described here is implemented as a codec. There will likely be other sharding codecs. For example, a Zip file codec is likely forthcoming. I've also laid some ground work for reusing HDF5 files.

hopefully rare case of writing shards in full

A relatively common case of writing shards in full is at acquisition where tiling or n-dimensional chunking is a deterministic step that can be done synchronously with acquisition. The Keller and Betzig Labs as well as parts of the Allen Institute currently can produce chunked data layouts in sharded formats as data is acquired.

If one is in full control of the acquisition process, writing the chunk index first is easier. The more common case is where we lack control of acquisition and proprietary formats may be involved. In this case, this sharding codec allows for the conversion of arbitrary acquisition files into Zarr shards without rewriting the primary data by appending the chunk index.

Another observation that I made is that a modern HDF5 file (1.10+) containing a chunked dataset contains a chunk index in a very similar format. Creating a compatbile Zarr shard from such a HDF5 file is just a matter of copying or moving that chunk index to the end of the file.

Reading any part of the shard (the most frequent operation) with an index at the front would be easier because you read a fix size field first, then seek forward.

Jeremy alluded to HTTP range requests. The range requests can take the following form where providing a single negative integer retrieves the bytes at the end of the array.

Range: <unit>=-<suffix-length>

suffix-length An integer indicating the number of (given) units at the end of the file to return.

This means that the last N bytes of a file containing the chunk index can be retrieved in a single request without first having to query the file length. In the cloud case, retrieving the last N bytes of a file seems to be as easy as retrieving the first N bytes of a file. Seeking to retrieve the last N bytes of a local file also does not seem particularly onerous.

mkitti commented 11 months ago

We could add an "index_location" configuration option that may be either "start" or "end", defaulting to "end". Would that address your concerns?

Why not just allow for an arbitrary byte offset where 0 indicates the "start"?

jbms commented 11 months ago

We could add an "index_location" configuration option that may be either "start" or "end", defaulting to "end". Would that address your concerns?

Why not just allow for an arbitrary byte offset where 0 indicates the "start"?

Since it is a configuration option, it would presumably have to be the same for all shards, which means that anything other than "start" or "end" is unlikely to be useful, and "end" couldn't be indicated by a fixed byte offset.

mkitti commented 11 months ago

Could end be the additive inverse of the number of the byte length of the index? Basically you would just forward that negative value to be the Range field of the request. We could just make it the exact HTTP Range request that one would use to obtain the index.

clbarnes commented 11 months ago

The line of arguments for introducing shards is not compelling and lacks quantifiable justification: e.g. on a filesystem that can store 2.4TB, 10mio files are not too many files

Our institute uses cephfs for its large-scale cluster-accessible storage, which gets grumpy if you have too many files. Strictly I think that file counts only impact performance when there are too many in one directory (which is much more likely with V2-style dot-separated chunk indices), but in practice cephfs' administration tools impose quotas on total number of files. (LMB example here - not sure how modern this setup is or how common similar types of storage are elsewhere)

normanrz commented 11 months ago

We could add an "index_location" configuration option that may be either "start" or "end", defaulting to "end". Would that address your concerns?

Why not just allow for an arbitrary byte offset where 0 indicates the "start"?

Since it is a configuration option, it would presumably have to be the same for all shards, which means that anything other than "start" or "end" is unlikely to be useful, and "end" couldn't be indicated by a fixed byte offset.

I like the idea of having an index_location option. That would also be helpful for some of our use cases. @mkitti What would be your use case for a numeric value instead of start/end?

The line of arguments for introducing shards is not compelling and lacks quantifiable justification: e.g. on a filesystem that can store 2.4TB, 10mio files are not too many files. Cloud storage providers, in our experience, do not currently impose limits on the number of objects and keys.

Our original motivation: Some file systems that hold multiple petabytes are configured to have larger block sizes (e.g. 2MB). For the chunk sizes that we need for interactive visualization (e.g 64,64,64) and assuming a uint8 datatype, storing each chunk as one file (=262kB) would be wasteful. Also, prior to sharding we were constanly running out of inodes on our file system.

normanrz commented 11 months ago

I added a proof-of-concept for the index_location option to zarrita: https://github.com/scalableminds/zarrita/compare/sharding-index-location

mkitti commented 11 months ago

@mkitti What would be your use case for a numeric value instead of start/end?

We could reference a chunk index in the middle of the shard.
We could append a chunk to the end of the file without rewriting the chunk index.
We could have more than one chunk index in a shard.
For a HDF5 file, we may be able to access multiple datasets (arrays) within the file with just the sharding codec.

normanrz commented 11 months ago

@mkitti What would be your use case for a numeric value instead of start/end?

We could reference a chunk index in the middle of the shard.

We could append a chunk to the end of the file without rewriting the chunk index.

We could have more than one chunk index in a shard.

For a HDF5 file, we may be able to access multiple datasets (arrays) within the file with just the sharding codec.

Since the codec configuration, including the index_location, is per array, these seem to me only practical for single-shard arrays?

jbms commented 11 months ago

Another consideration is that both start and end are relatively easy to support when writing but an arbitrary byte offset would be kind of tricky --- you may have to add padding bytes.

mkitti commented 11 months ago

Yes, I understand. I'm imagining a scenario where there is an existing HDF5 file that is symlinked and used as a shard in multiple arrays.

That said a few of Stephan's scenarios include the index somewhere in the middle of the file, so having the index at an arbitrary offset would help address those.

Implementations may need to calculate the length of the chunk index anyways so is checking if the location is the negative length really that different?

I think the main burden is having to validate another parameter.

My proposal is that we accept anything that is a valid value for a HTTP Range request.

mkitti commented 11 months ago

Another consideration is that both start and end are relatively easy to support when writing but an arbitrary byte offset would be kind of tricky --- you may have to add padding bytes.

I'm confused. Wouldn't the writer be the one setting the indexLocation?

jbms commented 11 months ago

You might want to write to an existing array created by a different implementation.

Which of Stephan's points relates to having an index in the middle?

axtimwalde commented 11 months ago

@mkitti I agree that doing this as a codec enables other shard codecs in the future, this is great. Sorry if I wasn't clear and caused confusion. None of my scenarios demand an index in the middle and I agree with everybody else that this sounds complicated and does not generalize across shards. Start and end as options sounds great.

Thanks for the updates on the many files argument. It'd be great if specific cases where this is relevant could be listed. For our use of AWS S3, and the institute managed file system, we try to reach 1MB blocks and consult with the providers and administrators to make it so that this is ok. The streaming-speed argument holds and is most important for the streaming case (which can be lessened by parallel asynchronous access), random access of single chunks would not be accelerated by shards. I haven't found strict rules about number of files/ keys in the S3 or GC docs but I probably haven't looked carefully enough. Generally, I believe that it would be great to have concrete real world examples where this is useful. @clbarnes ' hostile admin example may be a good one.

mkitti commented 11 months ago

There seems to be a general preference for the index to be at the end from the others, so if we ever needed to add chunks to the end of the file, then perhaps having the index in the middle is at least a temporary scenario before a rewrite.

My abstractions generally do not depend on the index anywhere in particular. Kerchunk exports chunk information to an external file. I also have a program to move HDF5's chunk index to an arbitrary location in the file.

My understanding is that the Betzig lab, or perhaps more specifically the Advanced Bioimaging Center at UC Berkeley, also encountered some issues with file limits on Lustre based clusters.

The practical limits probably should be directory based, but from a file system perspective.

The other complaint I have heard from storage admins is a relatively high amount of metadata IOPs hitting thr file systems, basically stat calls. This problem is avoided on object storage systems due to the use of buckets rather than per file access control.

mkitti commented 11 months ago

Searching around, I found another anecdote of someone running out of inodes when using Zarr: https://github.com/pangeo-data/pangeo/issues/659

The solution there was to use ZipStore.

Incidentally, zip files also have a central directory at the end of a file.

I'm also reminded here of @rabernat 's benchmarks of the initial implementation in zarr-python. I would be interested in hearing how the other implementations of sharding have avoided the pitfalls that Ryan encountered.

davidbrochart commented 11 months ago

Hi everyone, I vote ABSTAIN, since I don't have time to work on xtensor-zarr anymore.

aschampion commented 11 months ago

Just to amplify @normanrz's example, since that was exactly what motivated my and @clbarnes original interest in sharding. The block allocation and inode limits of our various network filesystems meant that 2-4MB files were the best common denominator, but this made chunks far larger than optimal for remote visualization. I also see this as somewhat of an amelioration on getting in-memory chunk-size to be comparable for multiple datasets for computational purposes, when compressed size may differ greatly for efficient storage purposes (e.g., raw microscopy vs. seg), or more generally, some accommodation for memory-awareness/striding.

I vote YES on behalf of sci-rs/zarr.

joshmoore commented 11 months ago

Happy 🎃, everyone.

I vote YES for the ZSC.

I also assume that there will be a follow-on PR to introduce the index location. The exact process of those adjustments is still a bit up in the air. I’d propose as with https://github.com/zarr-developers/zarr-specs/pull/263 that we will ping the ZIC for votes or vetoes there.

And in general, as with ZEP1, please keep any further clarifications and questions coming as implementations are written. But I think we’ll all be quite enthused to have another ZEP signed off on. Thanks, all!

jbms commented 3 weeks ago Note: If you have a strong objection to Neuroglancer supporting the existing OME-zarr metadata with zarr v3, I am open to changing that.

Though this is more a discussion for elsewhere, I personally don’t see any issue with having support for that combination, but I’d highly suggest we not expose the community to that mix until Norman’s NGFF spec is decided on (i.e. let’s not write or publish them)

joshmoore commented 11 months ago

I vote YES for the ZSC.

Sorry for the confusion, I should have said, "as a member of the ZSC". I've updated the description but also pinged all of the remaining voters.

zarr-developers / zarr-specs

ZEP0002 Review #254