zarr-developers / numcodecs

A Python package providing buffer compression and transformation codecs for use in data storage and communication applications.
http://numcodecs.readthedocs.io
MIT License
128 stars 88 forks source link

Support for WavPack codec #334

Open alejoe91 opened 2 years ago

alejoe91 commented 2 years ago

Hi numcodecs team!

First of all, thank you for the amazing resource that you put together!

I would like to inquire whether you would be interested in including the WavPack codec as an available numcodecs compressor. Wavpack is an audio codec developed by @dbry and it has both a lossless mode (default) and also an interesting lossy mode (hybrid mode). In addition to working well for audio signals, it performs really well for any kind of timeseries and it can compress up-to 1024 channels simultaneusly. We use if for data from high-density electrophysiology and it gives very good compression performance.

I'm also the core developer of SpikeInterface, an open-source framework for electrophysiology analysis. We have a built-in save to zarr function, so having a codec that is specifically designed for audio-like timeseries data would be very convenient for the elctrophysiology community (I'm sure also the NWB folks would like to use it once the ZARR-backend is available - hopefully soon).

We have a working version of a numcodecs implementation of WavPack here: https://github.com/AllenNeuralDynamics/wavpack_numcodecs Internally, it uses the WavPack CLI to encode and decode using pipes to pass binary data between processes. The wavpack binaries for Windows, macOS, and Linux are also shipped with the package and the tests are run on all three platforms. We currently use the CLI rather than binding the wavpack C library directly because there is not a clear way to do encode and decode in memory. But we are open to suggestions!

We look forward to hearing your thoughts!

Cheers Alessio

cgohlke commented 2 years ago

there is not a clear way to do encode and decode in memory

I think it might be enough to provide your own callback functions for reading and writing from memory instead of file. It's a standard C API that can be implemented in Cython.

https://github.com/dbry/WavPack/blob/89ef99e84333534d9d43093a5264a398b5f1e14a/include/wavpack.h#L259-L290

alejoe91 commented 2 years ago

@cgohlke we tried to reimplement the codec with cython following examples from others in the numcodecs library. We needed to make two additional C functions (see encoder.c and decoder.c) for in-memory compression-decompression.

Would you mind taking a look? https://github.com/AllenNeuralDynamics/wavpack_numcodecs/tree/cython/wavpack_cython

If everything looks ok, we'll extend the cython version with additional options and start preparing a PR.

alejoe91 commented 2 years ago

Hi guys,

We successfully implemented the Cython version here: https://github.com/AllenNeuralDynamics/wavpack_numcodecs/pull/6

If you guys are ok with it, I'll start preparing a PR to numcodecs :)

A couple of comments:

Let us know if this approach sounds reasonable!

Cheers Alessio

martindurant commented 2 years ago

@joshmoore : given the entrypoints registration system (when it works), is there any reason to want to add more compiled codecs directly into numcodecs, or should they be separate packages?

joshmoore commented 2 years ago

The primary trade-off I would think would be how a user is to know what package needs installing. If there's another package that is likely to be installed in which the codec could live then it's fairly straight-forward. Alternatively, the registry that's in progress should allow clients to find documentation on what package provides a given codec_id: https://alt-shivam.github.io/Codecs-Registry/

cc: @Alt-Shivam

alejoe91 commented 2 years ago

@joshmoore : given the entrypoints registration system (when it works), is there any reason to want to add more compiled codecs directly into numcodecs, or should they be separate packages?

@martindurant sorry I'm not super familiar with the entrypoints registration. Could you explain what you mean?

martindurant commented 2 years ago

An argument to setup() in a typical setup.py:

    entry_points={
        "numcodecs.codecs": [
            "grib = kerchunk.codecs:GRIBCodec",
            "fill_hdf_strings = kerchunk.codecs:FillStringsCodec",
            "FITSAscii = kerchunk.codecs:AsciiTableCodec",
            "FITSVarBintable = kerchunk.codecs:VarArrCodec",
            "record_member = kerchunk.codecs.RecordArrayMember",
        ],
    },

(this one copied from kerchunk) will make each class on the right hand side of an "=" available under the name given on the left hand side.

joshmoore commented 2 years ago

@alejoe91: https://entrypoints.readthedocs.io/en/latest/ allows projects to an extension point which other project can then implement. So, as long as you have run e.g. pip install numcodecs-wavpack, the main numcodecs library will be able to find it at runtime (https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/registry.py#L10).

martindurant commented 2 years ago

Thank you, Josh, that's more succinct an to the point than what I said :)

alejoe91 commented 2 years ago

Thank you for the explanation. So you suggest we make our own package and then add it as an entrypoint?

One question: If we use the numcodecs-wavpack for compression, and then someone else (unaware of wavpack-numcodecs) wants to access and decode our data, is there a way to prompt the message: you need to pip install numcodecs-wavpack or will it print wavpack codec not found in the registry?

Would you guys be available to discuss about this over a call?

Thanks! Alessio

martindurant commented 2 years ago

is there a way to prompt the message:

As things stand, you would see a simple ImportError or unknown-codec. There is a mention of an online registry resource, above, which we could (but do not yet) point the user to, and ought to provide install instructions/links for each codec.

As a reference, I have two plugin systems I maintain with opposite philosophies: