Closed d-v-b closed 11 months ago
Just checking, doesn't your note only apply when the separator is /
?
@LDeakin you're right, the spec indicates that for the dimension separator ".", we should have chunk keys like c.1.23.45
, which offers no benefit when listing the contents of a directory (and the "." separator will lead to the most objects under one prefix, too). I think either I have misunderstood the purpose of the "c" prefix, or the spec should be altered to be true to the purpose indicated in my note, i.e. by stipulating that chunk keys always start with "c/" instead of "c
Can you clarify @jbms ?
If "c/" were used as the prefix in all cases, then on storage systems that have a concept of directories, to create a zarr array would require a minimum of 2 directories, rather than 1 as in zarr v2, which may be significant added cost when creating a large number of arrays on some storage systems with high per-directory cost.
If you are using "/" as the separator then you will already have to create a lot of directories and one more would probably not be significant.
@jbms how would you describe the overall purpose of the "c" prefix then?
I wasn't strongly for or against it. But there was a desire to have a separate prefix for chunks to keep the key space more organized, and easier for users to browse when using directories. That is accomplished when using a separator of "/". When using a separator of "." there is still a separate prefix but for directory-based stores it won't be too helpful, but that way we have the option to avoid the overhead of an additional directory, without introducing an additional option. Using a prefix of "c" also means we don't need a special case for zero dimensions, the single chunk is just "c". In zarr v2 we use "0" as a special case.
I have to confess I find this confusing: when the dimension_separator is "/" (i.e., when chunks are stored in separate directories, so the top-level directory contains no chunks at all), the prefix of "c" is used to facilitate browsing directories, but when the dimension_separator is ".", i.e., when all the chunks are in the top-level directory, the prefix of "c" has a different purpose? It seems like the "." case is exactly when directory browsing is most inconvenient. Am I missing something here?
Even with "/" you can still have a lot of entries under "c/" if the first dimension has a lot of chunks. In any case the "c" prefix still serves to separate the key space, even if it doesn't result in a separate directory (and on storage systems like s3 or gcs there are no directories anyway). In the future, we may likely want to add new key encodings to make certain queries on chunks more efficient, and those key encodings may use characters other than 0-9. Having a separate prefix of "c" helps avoid conflicts with other extension metadata keys.
that makes sense; I will edit this PR to just state that the purpose of the "c" is to partition the key space
I added a motivation for the
c
prefix on chunk keys. I'm happy to modify this if the reported motivation is factually or stylistically incorrect :)