zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Creative Commons Attribution 4.0 International
87 stars 28 forks source link

Support for non-zip archive Stores? #209

Open mike-lawrence opened 1 year ago

mike-lawrence commented 1 year ago

When data is initially collected as a DirectoryStore then compressed using 7z a -tzip ... as suggested in the docs, the resulting zip file is larger (~4x) than the original .zarr directory, and substantially larger (~40x) than if compressed without the -tzip flag (presumably thanks to zip's well-known issues with a large number of files?).

Is it fundamentally not possible to support non-zip formatted archives (like 7zip's native format, or xz, or ..., etc)?

jbms commented 1 year ago

Regarding the zarr spec: The zarr v2 spec does not mention stores at all --- and in practice the supported stores vary greatly between implementations.

In zarr v3 there may be some mention of stores but that does not preclude an implementation from supporting additional ones.

I believe you can already use 7z archives with zarr-python via fsspec: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.libarchive.LibArchiveFileSystem

However, I have not actually tried that myself.

Regarding the size increase, I'm rather surprised that the size increases significantly --- I would expect only a minimal size increase, since as far as I am aware, the per file metadata in a zip file does not take up that much space. Only if your chunks are extremely small would I expect it to have a significant impact.

In general,, for choosing an archive format, since the chunks can already be compressed by zarr, I would not expect it to matter much what compression options the archive format supports --- you can just use no compression. I would expect the compression provided by the archive to be particularly useful only if you are storing a lot of json metadata rather than chunk data.

The main requirement for any archive format is the ability to read individual files efficiently. For example, tar is a poor choice because it only supports sequential access.

mike-lawrence commented 1 year ago

I’ve toyed with python-zarr and 7z, and so far as I can tell, if you start with a DirectoryStore and compress with 7z, you have to use the -tzip flag to yield a file that is readable by python-zarr (and you have to name it .zip), which is also much larger than the original DirectoryStore’s size, presumably because the -tzip flag tells 7z to use zip as the archive format rather than the 7z-native format.

On your surprise at my observation that zip can increase file sizes, I do think that it’s the number of files and you’re right to mention chunk size as I’m probably setting that rather non-optimally. I’m using zarr in a setting where data is visualized in real-time as it’s collected by a separate process from the process writing to zarr, and rather than send over a queue, I just write to zarr. To optimize for latency then, I made my chunks size 1 in the sample dimension, which therefore makes for lots of chunk files. Seeing this failure mode amidst python-zarr’s zip-only limitations, I should probably revert to sending data over a queue and make the zarr chunks bigger.

If you want to play with an example data file of the type I’m working with, I’ve uploaded one here: [image: Zip Archive] p10enrollment-112722_202211https://drive.google.com/file/d/1pcuaqqdebZopcL7pfsAQJH2fLwnKA4T-/view?usp=drivesdk It’s a DirectoryStore that’s been zipped using the point-and-click “compress” built into Nautilus on Ubuntu (note: while they both presumably use zip as a format, the files created using “compress” are about 2x bigger than those created with 7z -tzip).

On Sun, Feb 19, 2023 at 3:02 AM Jeremy Maitin-Shepard < @.***> wrote:

Regarding the zarr spec: The zarr v2 spec does not mention stores at all --- and in practice the supported stores vary greatly between implementations.

In zarr v3 there may be some mention of stores but that does not preclude an implementation from supporting additional ones.

I believe you can already use 7z archives with zarr-python via fsspec: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.libarchive.LibArchiveFileSystem

However, I have not actually tried that myself.

Regarding the size increase, I'm rather surprised that the size increases significantly --- I would expect only a minimal size increase, since as far as I am aware, the per file metadata in a zip file does not take up that much space. Only if your chunks are extremely small would I expect it to have a significant impact.

In general,, for choosing an archive format, since the chunks can already be compressed by zarr, I would not expect it to matter much what compression options the archive format supports --- you can just use no compression. I would expect the compression provided by the archive to be particularly useful only if you are storing a lot of json metadata rather than chunk data.

The main requirement for any archive format is the ability to read individual files efficiently. For example, tar is a poor choice because it only supports sequential access.

— Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr-specs/issues/209#issuecomment-1435910449, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABEZ7PZULBT5BD6YAAAQVTWYHARHANCNFSM6AAAAAAVANJTVM . You are receiving this because you authored the thread.Message ID: @.***>

--

-- Mike Lawrence, PhD Co-founder & Research Scientist Axem Neurotechnology axemneuro.com

~ Certainty is (usually) folly ~

rabernat commented 1 year ago

@mike-lawrence - this is exactly one of the use cases that sharding (#134, #152) is designed to address.

jbms commented 1 year ago

I took a look at your zip file --- the issue is that your chunks are way too small for efficient access or storage. Some of your chunks contain just a single 8-byte value. Zarr compresses each chunk individually, and no compression is possible for only 8 bytes. Blosc adds a 16 byte header, such that each chunk in that case is a 24 byte file (already tripling the size). But that ignores the per-file overhead required by the filesystem or archive. On most filesystems, files always consume a multiple of the block size, typically 4KB. So when using a local filesystem each of your 8 bytes of data is actually consuming 4KB. In a zip archive the file size won't be padded but there is still per-file overhead to store the filename, etc.

Even with sharding I would still recommend a much larger chunk size, as most zarr implementations will have poor performance with such small chunks.

jakirkham commented 1 year ago

Should move this issue to zarr-python? It doesn't seem like a spec issue

mike-lawrence commented 1 year ago

Should move this issue to zarr-python? It doesn't seem like a spec issue

Sure, the only reason I posted here is because the zarr-python issue page recommends putting feature requests here rather than there.

mike-lawrence commented 1 year ago

Even with sharding I would still recommend a much larger chunk size, as most zarr implementations will have poor performance with such small chunks.

Ah, silly me. I'd forgotten that I'd made all the arrays store in that one-sample-per-chunk mode, when only one was intended to be stored that way (and I should play to check if increasing the chunk size in that one even affects my real-time use case performance; I can't remember if I did that now).