zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Creative Commons Attribution 4.0 International
88 stars 28 forks source link

spec: add a motivation for the c prefix on chunk keys #271

Closed d-v-b closed 11 months ago

d-v-b commented 1 year ago

I added a motivation for the c prefix on chunk keys. I'm happy to modify this if the reported motivation is factually or stylistically incorrect :)

LDeakin commented 1 year ago

Just checking, doesn't your note only apply when the separator is /?

d-v-b commented 1 year ago

@LDeakin you're right, the spec indicates that for the dimension separator ".", we should have chunk keys like c.1.23.45, which offers no benefit when listing the contents of a directory (and the "." separator will lead to the most objects under one prefix, too). I think either I have misunderstood the purpose of the "c" prefix, or the spec should be altered to be true to the purpose indicated in my note, i.e. by stipulating that chunk keys always start with "c/" instead of "c".

Can you clarify @jbms ?

jbms commented 1 year ago

If "c/" were used as the prefix in all cases, then on storage systems that have a concept of directories, to create a zarr array would require a minimum of 2 directories, rather than 1 as in zarr v2, which may be significant added cost when creating a large number of arrays on some storage systems with high per-directory cost.

If you are using "/" as the separator then you will already have to create a lot of directories and one more would probably not be significant.

d-v-b commented 1 year ago

@jbms how would you describe the overall purpose of the "c" prefix then?

jbms commented 1 year ago

I wasn't strongly for or against it. But there was a desire to have a separate prefix for chunks to keep the key space more organized, and easier for users to browse when using directories. That is accomplished when using a separator of "/". When using a separator of "." there is still a separate prefix but for directory-based stores it won't be too helpful, but that way we have the option to avoid the overhead of an additional directory, without introducing an additional option. Using a prefix of "c" also means we don't need a special case for zero dimensions, the single chunk is just "c". In zarr v2 we use "0" as a special case.

d-v-b commented 1 year ago

I have to confess I find this confusing: when the dimension_separator is "/" (i.e., when chunks are stored in separate directories, so the top-level directory contains no chunks at all), the prefix of "c" is used to facilitate browsing directories, but when the dimension_separator is ".", i.e., when all the chunks are in the top-level directory, the prefix of "c" has a different purpose? It seems like the "." case is exactly when directory browsing is most inconvenient. Am I missing something here?

jbms commented 1 year ago

Even with "/" you can still have a lot of entries under "c/" if the first dimension has a lot of chunks. In any case the "c" prefix still serves to separate the key space, even if it doesn't result in a separate directory (and on storage systems like s3 or gcs there are no directories anyway). In the future, we may likely want to add new key encodings to make certain queries on chunks more efficient, and those key encodings may use characters other than 0-9. Having a separate prefix of "c" helps avoid conflicts with other extension metadata keys.

d-v-b commented 1 year ago

that makes sense; I will edit this PR to just state that the purpose of the "c" is to partition the key space