zarr-developers / numcodecs

A Python package providing buffer compression and transformation codecs for use in data storage and communication applications.
http://numcodecs.readthedocs.io
MIT License
121 stars 82 forks source link

Add bitinfo codec #503

Open thodson-usgs opened 5 months ago

thodson-usgs commented 5 months ago

Add a new bitinfo codec, which reimplements a numpy-based version of the bitrounding algorithm from the Julia package BitInformation.jl (and the Python package xbitinfo). When used with ds.to_zarr, the codec would compute the real information chunk-wise, which yields better results for data with variable information content.

import xarray as xr
ds = xr.tutorial.open_dataset("air_temperature")

from numcodecs import Blosc, BitInfo
compressor = Blosc(cname="zstd", clevel=3)
filters = [BitInfo(info_level=0.99)]

encoding = {"air": {"compressor": compressor, "filters": filters}}

ds.to_zarr('xbit.zarr', mode="w", encoding=encoding)

TODO:

thodson-usgs commented 5 months ago

Thanks, @martindurant, I'll work on example, but as I dig into this, I realize that we can't pass a function in this way: https://github.com/zarr-developers/zarr-python/blob/main/zarr/util.py#L56-64 will lead to

TypeError: Object of type function is not JSON serializable

I'd hoped this would work without modifying zarr-python, but for now, I'll go ahead and make a small modification and continue testing.

martindurant commented 5 months ago

Ah, the default serialisation of the codecs, as dumped into the zarr metadata, looks at the __dict__ of the instance, and so is catching the function. Prefixing the attribute with "_" would solve this, or you could override get_config. There is actually a good argument for wanting to serialise the full state, since you can create an array without filling it or fill only part and come back to it later. I don't know how you would approach that without storing some indicator of the function in the metadata.

thodson-usgs commented 5 months ago

Those are both great suggestions. I'll advance one more: add a stripped down xbitinfo to the bitround codec. The implementation looks fairly minimal. Let's see what comes from observingClouds/xbitinfo/issues/257 then reassess.

joshmoore commented 5 months ago

A heads up that depending on custom code will impact the portability of the codec across language implementations.

thodson-usgs commented 5 months ago

observingClouds/xbitinfo/issues/257: their preference is my original suggestion. Potentially yielding something like:

from xbitinfo import helper_function
from numcodecs import Blosc, BitRound
compressor = Blosc(cname="zstd", clevel=3)
filters = [BitRound(helper_function)]

encoding = {"precip": {"compressor": compressor, "filters": filters}}
ds.to_zarr(<file_name>, encoding=encoding)
thodson-usgs commented 5 months ago

@martindurant, I tried your suggestion about prefixing, but I'm getting the general impression this approach won't work. In this example

from xbitinfo import helper_function
from numcodecs import Blosc, BitRound
compressor = Blosc(cname="zstd", clevel=3)
filters = [BitRound(helper_function)]

encoding = {"precip": {"compressor": compressor, "filters": filters}}
ds.to_zarr(<file_name>, encoding=encoding)

the object created by BitRound(helper_function) isn't actually passed through to the filtering step, only the dictionary. The filter is recreated from that dictionary before it is applied. In other words, if I prefix with _, that object won't get passed through the dictionary, and won't be accessible at filter time.

If that impression is correct, then reimplementing a basic bitinformation algorithm in numcodecs might be the simplest route.

martindurant commented 5 months ago

if I prefix with _, that object won't get passed through the dictionary, and won't be accessible at filter time.

Ah, sorry

pep8speaks commented 5 months ago

Hello @thodson-usgs! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 94:1: E402 module level import not at top of file

Comment last updated at 2024-04-30 21:18:13 UTC
thodson-usgs commented 5 months ago

Rather than including this in bitround, lets create a bitinfo codec that calls bitround. I've drafted the PR and solicited feedback from the xbitinfo devs. I will begin writing tests.

thodson-usgs commented 4 months ago

@observingClouds, @martindurant As my last commit indicates, I've reverted the codec to match the functionality of the xbitinfo implementation (albeit all computation will be done chunk-wise when called through ds.to_zarr()) I'm happy to continue working across libraries to improve functionality and reduce duplication; however, further development needs to come from xbitinfo, not the codec. In the meantime, I do not plan any further changes to this PR except any additional tests or bug fixes as needed.

thodson-usgs commented 2 months ago

@martindurant, sorry to bother. Would you run these checks, or have I missed something? Thanks

martindurant commented 2 months ago

I have no idea why the checks aren't running :|

thodson-usgs commented 2 months ago

Maybe it's waiting for me to tick all the boxes? Seems a bit of a catch 22, but I'll give it a try.

thodson-usgs commented 2 months ago

Let me know if this fixed it

git commit --amend --no-edit
git push --force-with-lease

otherwise, I'll open a new PR

thodson-usgs commented 2 months ago

Windows and OSX builds failing with:

900+ lines of errors and warnings...

  × Building editable for numcodecs (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> See above for output.

  note: This error originates from a subprocess, and is likely not a problem with pip.
  full command: /usr/local/miniconda/envs/env/bin/python /usr/local/miniconda/envs/env/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py build_editable /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/tmp4l4qy_42
  cwd: /Users/runner/work/numcodecs/numcodecs
  Building editable for numcodecs (pyproject.toml): finished with status 'error'
  ERROR: Failed building editable for numcodecs
Failed to build numcodecs
ERROR: Could not build wheels for numcodecs, which is required to install pyproject.toml-based projects
Error: Process completed with exit code 1.

I'm not sure what happened here, but I'll try to replicate this locally.

thodson-usgs commented 2 months ago

The Windows build failed because I had a test using dtype=float128, but I don't understand why OSX is failing to install numcodecs with python>=3.10. Can we run these one more time before I go digging into OSX?

martindurant commented 2 months ago

Indeed, just osx left.

thodson-usgs commented 2 months ago

Added a usage example and rebased. I have still not resolved why the OSX 3.10 build failed. Hopefully later this week.

thodson-usgs commented 2 months ago

oh darn, my doc example gets tests too! 😞

thodson-usgs commented 2 months ago

Well, this passed just fine in a clean, local environment.

But the GitHub runner is erroring with

...
043     >>> import xarray as xr
044     >>> ds = xr.tutorial.open_dataset("air_temperature")
045     >>> from numcodecs import Blosc, BitInfo
046     >>> compressor = Blosc(cname="zstd", clevel=3)
047     >>> filters = [BitInfo(info_level=0.99)]
048     >>> encoding = {"air": {"compressor": compressor, "filters": filters}}
049     >>> _ = ds.to_zarr('xbit.zarr', mode="w", encoding=encoding)
UNEXPECTED EXCEPTION: ImportError("cannot import name 'MutableMapping' from 'collections' (/usr/share/miniconda/envs/env/lib/python3.11/collections/__init__.py)")
Traceback (most recent call last):
  File "/usr/share/miniconda/envs/env/lib/python3.11/doctest.py", line 1355, in __run
    exec(compile(example.source, filename, "single",
  File "<doctest numcodecs.bitinfo.BitInfo[6]>", line 1, in <module>
  File "/usr/share/miniconda/envs/env/lib/python3.11/site-packages/xarray/core/dataset.py", line 2520, in to_zarr
    return to_zarr(  # type: ignore[call-overload,misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...

Unless there is an obvious fix, I will keep the usage example but remove it from the doctest, then run some test builds on OSX, then resubmit.

martindurant commented 2 months ago

I suggest you remove the example, perhaps marking it as doctest skip.

thodson-usgs commented 1 month ago

(investigating local OSX build now)

thodson-usgs commented 1 month ago

@martindurant, the problem with OSX seems to be a broader issue. Note the previous commit to main is also failing (#527). Perhaps we rerun the tests and proceed with any other issues, while we wait for an OSX fix.

thodson-usgs commented 1 month ago

rebased on main