zarr-developers / numcodecs

A Python package providing buffer compression and transformation codecs for use in data storage and communication applications.
http://numcodecs.readthedocs.io
MIT License
128 stars 88 forks source link

Using JPEG2000 for chunk compression #73

Open jmswaney opened 6 years ago

jmswaney commented 6 years ago

I've been using chunk compressed Zarr arrays for some neuroscience image processing tasks, and it's been great so far. However, JPEG2000 might perform better than lz4 or Zstd for my images. I'd like to use Zarr to handle the image chunking with a JPEG2000 compressor, but I'm not sure if this is possible. I realize that this feature isn't as general as numcodecs would want, but I'm mostly asking what the steps would be to see if I should even try.

alimanfoo commented 6 years ago

Hi Justin, the general approach to implementing a new compression codec is to sub-class the numcodecs.Codec class and implement the methods encode(), decode(), get_config(), from_config(), and also the codec_id attribute. Docs here: http://numcodecs.readthedocs.io/en/stable/abc.html

To use the codec with zarr you need to register it with a call to numcodecs.register_codec(cls). That just sets up the mapping from codec ID to codec class. Docs here: http://numcodecs.readthedocs.io/en/stable/registry.html

In terms of implementation, any of the existing codec classes is worth looking at as an example. If you need to interface with external C code then there's various options. The existing codecs like Zstd, LZ4 and Blosc use Cython but there's other ways to do it.

I don't know anything about JPEG encoding but very happy to learn more if you find it useful.

On Sat, 14 Apr 2018, 16:53 Justin Swaney, notifications@github.com wrote:

I've been using chunk compressed Zarr arrays for some neuroscience image processing tasks, and it's been great so far. However, JPEG2000 might perform better than lz4 or Zstd for my images. I'd like to use Zarr to handle the image chunking with a JPEG2000 compressor, but I'm not sure if this is possible. I realize that this feature isn't as general as numcodecs would want, but I'm mostly asking what the steps would be to see if I should even try.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/numcodecs/issues/73, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Qq_K5se-4OPYJf_6uBTsmYRqJVoqks5tohuJgaJpZM4TVHh0 .

jakirkham commented 6 years ago

Adding a JPEG2000 compression filter would be great. Know others use this compression for image data as well.

FWIW we made some changes described in this comment, which should make wrapping a compressor pretty simple. Feel free to ask questions if you need any help.

ryan-williams commented 4 years ago

@joshmoore, @jakirkham, and I looked into this for a while today.

@sofroniewn described how pyramidal-image support (cf. https://github.com/zarr-developers/zarr-specs/issues/23) is implemented in napari:

I have a zarr pyramid on s3://sofroniewn/image-data/camelyon16/ which came from https://camelyon16.grand-challenge.org/Data/ (there is a google drive with tiff if you poke around)

Each resolution-level is a sibling Dataset in a containing Zarr Group. Napari loads each resolution level as a Dask Array, and changes which resolution level it pulls chunks from based on the user's zoom level.

That process works pretty well today, and longer-term we'd like to clean up and factor that pyramiding code out of Napari (which could have a cleaner interface to it, in addition to benefitting from more general community support of pyramiding).

Napari's main pain point is that the Zarr pyramids are e.g. 60x larger on disk than the pyramidal TIFF files that they originated as. The Zarr pyramids use Zarr's default Blosc compressor codec (which is likely bad at compressing image data), while the original TIFFs likely use JPEG2000 (which is quite good), so we think adding a JEPG2000 codec to numcodecs, and having Napari use that, will solve Napari's main issue with its Zarr pyramids.

@jakirkham started prototying a JPEG2000 codec today; a nice thing is that the Codec interface receives an ndarray as input (we originally thought it only received a BytesLike, which would be hard to reconstruct image dimensions from, which JPEG2000 would need). One caveat is that filters can't be applied before the JPEG2000 codec, bc then the latter would actually just receive a BytesLike; raiseing seems appropriate in this situation.

Otherwise, we just need a good python binding to a JPEG2000 codec. imageio, imagecodecs, and glymur were looked at. There were a mix of concerns about dependencies / installation hassle as well as API semantics (we need something shaped like Buffer ⇒ BytesLike not PathLike ⇒ PathLike).

Dependency concerns could be mitigated by adding a pip qualifier (e.g. pip install numcodecs[jpeg]), and some light fork to expose in-memory access to the one of those projects could be undertaken, if necessary.

cgohlke commented 4 years ago

Imagecodecs includes a bytes<->numpy encoder and decoder for JPEG200 based on the OpenJPEG library. I think it should be relatively easy to take the Cython code out of imagecodecs (BSD licensed) and adapt it for numcodecs.

jakirkham commented 4 years ago

Thanks Christoph! 😄

Using that I wrote the following. This seems like what we would want for a first pass.

from numcodecs.abc import Codec
from numcodecs.compat import ensure_ndarray
from numcodecs.registry import register_codec

from imagecodecs import jpeg2k_encode, jpeg2k_decode

class JPEG2000(Codec):
    codec_id = "JPEG2000"

    def encode(self, buf):
        return jpeg2k_encode(ensure_ndarray(buf))

    def decode(self, buf):
        return jpeg2k_decode(ensure_ndarray(buf))

register_codec(JPEG2000)

This works for encoding. However we have an issue on decoding. Maybe there's something I'm missing above? 🙂

---------------------------------------------------------------------------
Jpeg2kError                               Traceback (most recent call last)
<ipython-input-6-7d1c93c78b4f> in <module>
----> 1 c.decode(c.encode(a))

<ipython-input-1-01776c52a8bc> in decode(self, buf)
     13 
     14     def decode(self, buf):
---> 15         return jpeg2k_decode(ensure_ndarray(buf))
     16 
     17 

imagecodecs/_jpeg2k.pyx in imagecodecs._jpeg2k.jpeg2k_decode()

Jpeg2kError: opj_read_header failed
cgohlke commented 4 years ago

I didn't try to reproduce this yet, but it looks like this simple roundtrip should work if the output of ensure_ndarray(buf) can be cast to uint8_t[::1] by Cython, which appears to be the case since otherwise the detection of the codecformat would likely fail. Please try passing the buf bytes directly to jpeg2k_decode and enable OpenJPEG error handling and warnings with verbose=3. What is the shape and dtype of the input a?

jakirkham commented 4 years ago

Thanks Christoph!

Yeah was wondering about that too. So had tried with and without ensure_ndarray just in case, but got the same error. Either way the data provided to jpeg2k_decode was something that could be cast to uint8_t[::1] as it was just the output of jpeg2k_encode.

Sure let me provide a clear MRE.

Sorry if I missed something, but how do we set the verbosity?

jakirkham commented 4 years ago

Here's an MRE showing what I'm seeing. Happy to play with this more (adding verbosity and such) as is helpful 🙂

In [1]: import numpy as np                                                      

In [2]: a = np.arange(6, dtype="u4").reshape(2, 3)                              

In [3]: a                                                                       
Out[3]: 
array([[0, 1, 2],
       [3, 4, 5]], dtype=uint32)

In [4]: from imagecodecs import jpeg2k_encode, jpeg2k_decode                    

In [5]: b = jpeg2k_encode(a)                                                    

In [6]: b                                                                       
Out[6]: bytearray(b'\x00\x00\x00\x0cjP  \r\n\x87\n\x00\x00\x00\x14ftypjp2 \x00\x00\x00\x00jp2 \x00\x00\x00-jp2h\x00\x00\x00\x16ihdr\x00\x00\x00\x02\x00\x00\x00\x03\x00\x01\x1f\x07\x00\x00\x00\x00\x00\x0fcolr\x01\x00\x00\x00\x00\x00\x11\x00\x00\x00\x89jp2c\xffO\xffQ\x00)\x00\x00\x00\x00\x00\x03\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x1f\x01\x01\xffR\x00\x0c\x00\x00\x00\x01\x00\x00\x04\x04\x00\x01\xff\\\x00\x04@\x00\xffd\x00%\x00\x01Created by OpenJPEG version 2.3.1\xff\x90\x00\n\x00\x00\x00\x00\x00\x17\x00\x01\xff\x93\xc0\x00\x00\x00\xf8C\x0fwv\xff\xd9')

In [7]: len(b)                                                                  
Out[7]: 214

In [8]: jpeg2k_decode(b)                                                        
---------------------------------------------------------------------------
Jpeg2kError                               Traceback (most recent call last)
<ipython-input-8-d3265f5af6b1> in <module>
----> 1 jpeg2k_decode(b)

imagecodecs/_jpeg2k.pyx in imagecodecs._jpeg2k.jpeg2k_decode()

Jpeg2kError: opj_read_header failed
cgohlke commented 4 years ago

I see: dtype=uint32. While JPEG 2000 supports 32 and 64 bit integers (up to 38 bits), OpenJPEG doesn't. I obviously never fully tested these cases, only 8 and 16 bit. You can get the OpenJPEG warnings and errors as follows:

>>> b = jpeg2k_encode(a, verbose=3)
JPEG2K info: tile number 1 / 1
>>> jpeg2k_decode(b, verbose=3)
JPEG2K info: Start to read j2k main header (85).
imagecodecs._jpeg2k.Jpeg2kError: Invalid values for comp = 0 : prec=32 (should be between 1 and 38 according to the JPEG2000 norm. OpenJpeg only supports up to 31)
Exception ignored in: 'imagecodecs._jpeg2k.j2k_error_callback'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
imagecodecs._jpeg2k.Jpeg2kError: Invalid values for comp = 0 : prec=32 (should be between 1 and 38 according to the JPEG2000 norm. OpenJpeg only supports up to 31)
imagecodecs._jpeg2k.Jpeg2kError: Marker handler function failed to read the marker segment
Exception ignored in: 'imagecodecs._jpeg2k.j2k_error_callback'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
imagecodecs._jpeg2k.Jpeg2kError: Marker handler function failed to read the marker segment
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "imagecodecs\_jpeg2k.pyx", line 390, in imagecodecs._jpeg2k.jpeg2k_decode
imagecodecs._jpeg2k.Jpeg2kError: opj_read_header failed

Not sure why OpenJPEG doesn't throw an error in jpeg2k_encode. Maybe OpenJPEG does create a valid JPEG 2000 stream, but can't decode it...

cgohlke commented 4 years ago

Another thought: since this issue is about efficiently compressing image data, you might want to have a look at JPEG-LS via the CharLS library. There's also JPEG-XR (used commonly in CZI files), which also handles float32, but the jxrlib library is not so nice to work with. Imagecodes supports both, but I never benchmarked the codecs/implementations. None of these formats support 32 or 64 bit integers.

jakirkham commented 4 years ago

Ah ok. So this is bad usage on my part. Thanks for clarifying Christoph! Should there be an error if the user supplies an unsupported type or are there situations where this might work?

Great, thanks for the suggestions. Will check those out too. Yeah this is mostly about compression. Just trying to think what makes a reasonably generic/useful compressor here (given the different type support of these). Do you have any thoughts on this? 🙂

LeeKamentsky commented 3 years ago

If this issue needs a champion, I think I can make a case for taking on this work (@jmswaney above was part of our lab). @jakirkham I'm not sure if you have a branch going that I should contribute to - if not, I'm fine with restarting. Regarding support for 32 and 64 bit integers, we also would only use 8 and 16 bits, so pragmatically, I'd vote for disallowing 32 and 64 bit integers early on in the encoding process and by trying to detect failures due to 32 and 64 bit integers in decoding and reporting them.

If that plan seems workable, I'll go ahead and start work towards the goal of a pull request.

LeeKamentsky commented 3 years ago

My use case for JPEG2000 is grayscale 3D stacks of JPEG2000 planes and my plan was to JPEG2000-encode each of the planes separately (and for 4 and 5D, stack over the first dimensions and encode over the last 2 dimensions), but an alternate would be to interpret arrays with 3 axes as Y, X and color if the size of the last axis was 3 (RGB) and encode as a color image.

My gut tells me to avoid a heuristic that operates differently depending on the size of the last dimension and encode what might be a color image as three grayscale planes. This also has a side-effect of not requiring the LCMS library (see https://github.com/cgohlke/imagecodecs/tree/master/3rdparty/openjpeg in imagecodecs) which simplifies the build.

I'd appreciate any feedback.

jakirkham commented 3 years ago

Thanks for offering to help here Lee! 😄

Unfortunately I don't have an existing branch, but I think the code in comment ( https://github.com/zarr-developers/numcodecs/issues/73#issuecomment-592111560 ) should be a good starting point and likely pretty close to what we need here. So would see if you can get that to run and go from there. Please let us know if you have any questions 🙂

jakirkham commented 3 years ago

Looks like @d-v-b did some work on a JPEG codec ( https://github.com/d-v-b/zarr-jpeg ). Not sure if JPEG2000 is considered there as well

LeeKamentsky commented 3 years ago

I think it's not. I started work on the codec, have had to pause it recently.

martindurant commented 3 years ago

Does imagecodecs.numcodecs.register_codecs() suffice now to cover needs here?

jakirkham commented 3 years ago

^ @d-v-b @LeeKamentsky @joshmoore

d-v-b commented 3 years ago

For my own purposes the codec registration api in numcodecs sufficed perfectly

joshmoore commented 3 years ago

@d-v-b : just to clarify, you mean numcodecs API worked for you, not imagecodecs, right?

Thinking through some of the recent conversations with @DennisHeimbigner, if we're going to lean on imagecodecs for JPEG2000 support, we may want to go about defining an ID for it in this repo a la https://github.com/zarr-developers/numcodecs/issues/278

cc: @cgohlke

martindurant commented 3 years ago

Not a bad idea, but imagecodecs does already provide unambiguous numcodecs IDs for all the classes it registers - I would not suggest changing them (although adding aliases would be fine).

The current list I get in my installation:

['imagecodecs_aec',
 'imagecodecs_avif',
 'imagecodecs_bitorder',
 'imagecodecs_bitshuffle',
 'imagecodecs_blosc',
 'imagecodecs_brotli',
 'imagecodecs_bz2',
 'imagecodecs_deflate',
 'imagecodecs_delta',
 'imagecodecs_float24',
 'imagecodecs_floatpred',
 'imagecodecs_gif',
 'imagecodecs_jpeg',
 'imagecodecs_jpeg2k',
 'imagecodecs_jpegls',
 'imagecodecs_jpegxr',
 'imagecodecs_lerc',
 'imagecodecs_ljpeg',
 'imagecodecs_lz4',
 'imagecodecs_lz4f',
 'imagecodecs_lzf',
 'imagecodecs_lzma',
 'imagecodecs_lzw',
 'imagecodecs_packbits',
 'imagecodecs_png',
 'imagecodecs_snappy',
 'imagecodecs_tiff',
 'imagecodecs_webp',
 'imagecodecs_xor',
 'imagecodecs_zfp',
 'imagecodecs_zlib',
 'imagecodecs_zopfli',
 'imagecodecs_zstd']
d-v-b commented 3 years ago

@d-v-b : just to clarify, you mean numcodecs API worked for you, not imagecodecs, right?

Correct, I defined a jpeg compressor and registered it with the numcodecs register_codec function.

I should add that there's complexity involved in compressing 3D+ data with 2D codecs. You will almost certainly want to generate a 2D tiled version of the ND data, and compress that, but this requires codec metadata that defines the ND -> 2D transformation. I have not implemented this to my satisfaction.

chris-allan commented 3 years ago

Hello all!

Based on the initial investigations of @cgohlke and @jakirkham on this thread along with some of our own by @muhanadz we have released, heavily inspired by the existing work from @d-v-b, a Zarr JPEG-2000 codec using imagecodecs and by extension OpenJPEG:

Any and all feedback welcome!

Similar to the discussion on d-v-b/zarr-jpeg#1, our primary motivation for the codec is the compression of interleaved RGB bright-field whole slide imaging data.

jakirkham commented 2 years ago

@martindurant what would one need to do add an entrypoint to use zarr-jpeg2k above?

joshmoore commented 2 years ago

An entrypoint needs to be registered roughly of the form:

[numcodecs.codecs]
jpeg2k = zarr_jpeg2k.zarr_jpeg2k:jpeg2k
martindurant commented 2 years ago

I read further up the thread and deleted my comment...

I am a little confused. Why is there a different package for jpeg2k as a numcodecs codec, which calls imagecodecs, when imagecodecs already has one? All the codecs there can be registered with numcodecs by calling imagecodecs.numcodecs.register_codecs(). We just need a PR there to add the entrypoints, I'm sure it would be accepted. Perhaps when the conversation above happened, imagecodecs had not yet progressed as far.

cgohlke commented 2 years ago

All the codecs there can be registered with numcodecs by calling imagecodecs.numcodecs.register_codecs(). We just need a PR there to add the entrypoints, I'm sure it would be accepted. Perhaps when the conversation above happened, imagecodecs had not yet progressed as far.

For the time being I decided to distribute the numcodecs entry points as a separate package: https://pypi.org/project/imagecodecs-numcodecs/#files.

martindurant commented 2 years ago

That sounds reasonable, @cgohlke . Unfortunately, it doesn't have a conda package.

rahedges commented 8 months ago

It looks like the work on integrating jpeg2000 was abandoned. Is there any progress on this I'm missing? This is the only numcodecs thread I found related to this work.

martindurant commented 8 months ago

jpeg2000 is included in imagecodecs, which has numcodecs wrappers

rahedges commented 8 months ago

Thanks. I guess I missed that in the docs. I'm trying to figure out how to use j2k as the compression scheme in a zarr file.

martindurant commented 8 months ago

I believe so long as you have https://pypi.org/project/imagecodecs-numcodecs/ installed, "imagecodecs_jpeg2k" wll an available codec without further effort.