zarr-developers / numcodecs

A Python package providing buffer compression and transformation codecs for use in data storage and communication applications.
http://numcodecs.readthedocs.io
MIT License
128 stars 88 forks source link

LZ4 in N5 vs Zarr #175

Open jakirkham opened 5 years ago

jakirkham commented 5 years ago

It appears that LZ4 support in N5 differs from Zarr. Have not had a chance to dive deeply into it, but here is the gist.

N5 is using the lz4-java library here to compress chunks. This lz4-java library provides its own custom blocked format.

Zarr's Numcodecs library uses LZ4_compress_fast, which comes from the lz4 C library.

Encountered this issue with N5Store in PR ( https://github.com/zarr-developers/zarr/pull/309 ). So disabled LZ4 support in N5Store for now. Not entirely sure how to bridge the gap between these two, but figured I'd raise this here for awareness and discussion.

jakirkham commented 5 years ago

cc @axtimwalde @funkey @constantinpape

jakirkham commented 5 years ago

Is there anything we still need to do on this one?

axtimwalde commented 5 years ago

Hi @jakirkham I forgot about this one. Would using LZ4FrameOutputStream in N5 work for zarr? We could introduce this as a parameter like in GzipCompression to switch between Gzip and Zlib and then there is at least some intersection?

jakirkham commented 5 years ago

No worries. Me too. Thanks for looking into this. 🙂

I think so. We would have to test it on some data to be sure.

Sure that could be reasonable. I think we won't be able to reproduce the current Java blocked algorithm in Python, but as long as we have something in common we should be ok. Probably will need some documentation once it is all sorted out.

alimanfoo commented 5 years ago

Hi folks, took a brief look into this, here's the options (I think)...

The current LZ4 codec in numcodecs does the simplest possible thing, which is to add a 4 byte header to store the length of the uncompressed data, then it compresses all the data in a single call to LZ4_compress_fast. So the output is 4 byte header + single block of compressed data.

The Java LZ4FrameOutputStream uses the LZ4 frame format, which has a different header + multiple blocks of compressed data + final checksum.

So option 1 would be that n5-java switches to use LZ4FrameOutputStream and we change numcodecs to also use the LZ4 frame format. (In numcodecs that would actually need to be implemented as a new codec, because it is a different format from the current "lz4" codec.)

Option 2 would be that n5-java switches to use the same encoding as the current numcodecs lz4 codec, i.e., 4 byte header plus single block of compressed data.

Both approaches are fine by me, just trying to lay out the options.

mkitti commented 3 years ago

Is there still an outstanding issue here? We were discussing this at the OME-Zarr NGFF meeting.

constantinpape commented 3 years ago

I am pretty sure that is still a problem; lz4 is not supported in the zarr N5Store yet, see https://github.com/zarr-developers/zarr-python/blob/master/zarr/n5.py#L403-L469.