Open thodson-usgs opened 5 months ago
Thanks, @martindurant, I'll work on example, but as I dig into this, I realize that we can't pass a function in this way: https://github.com/zarr-developers/zarr-python/blob/main/zarr/util.py#L56-64 will lead to
TypeError: Object of type function is not JSON serializable
I'd hoped this would work without modifying zarr-python
, but for now, I'll go ahead and make a small modification and continue testing.
Ah, the default serialisation of the codecs, as dumped into the zarr metadata, looks at the __dict__
of the instance, and so is catching the function. Prefixing the attribute with "_" would solve this, or you could override get_config. There is actually a good argument for wanting to serialise the full state, since you can create an array without filling it or fill only part and come back to it later. I don't know how you would approach that without storing some indicator of the function in the metadata.
Those are both great suggestions. I'll advance one more: add a stripped down xbitinfo to the bitround
codec. The implementation looks fairly minimal. Let's see what comes from observingClouds/xbitinfo/issues/257 then reassess.
A heads up that depending on custom code will impact the portability of the codec across language implementations.
observingClouds/xbitinfo/issues/257: their preference is my original suggestion. Potentially yielding something like:
from xbitinfo import helper_function
from numcodecs import Blosc, BitRound
compressor = Blosc(cname="zstd", clevel=3)
filters = [BitRound(helper_function)]
encoding = {"precip": {"compressor": compressor, "filters": filters}}
ds.to_zarr(<file_name>, encoding=encoding)
@martindurant, I tried your suggestion about prefixing, but I'm getting the general impression this approach won't work. In this example
from xbitinfo import helper_function
from numcodecs import Blosc, BitRound
compressor = Blosc(cname="zstd", clevel=3)
filters = [BitRound(helper_function)]
encoding = {"precip": {"compressor": compressor, "filters": filters}}
ds.to_zarr(<file_name>, encoding=encoding)
the object created by BitRound(helper_function)
isn't actually passed through to the filtering step, only the dictionary. The filter is recreated from that dictionary before it is applied. In other words, if I prefix with _
, that object won't get passed through the dictionary, and won't be accessible at filter time.
If that impression is correct, then reimplementing a basic bitinformation algorithm in numcodecs
might be the simplest route.
if I prefix with _, that object won't get passed through the dictionary, and won't be accessible at filter time.
Ah, sorry
Hello @thodson-usgs! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
numcodecs/__init__.py
:Rather than including this in bitround
, lets create a bitinfo
codec that calls bitround
. I've drafted the PR and solicited feedback from the xbitinfo
devs. I will begin writing tests.
@observingClouds, @martindurant
As my last commit indicates, I've reverted the codec to match the functionality of the xbitinfo implementation (albeit all computation will be done chunk-wise when called through ds.to_zarr()
) I'm happy to continue working across libraries to improve functionality and reduce duplication; however, further development needs to come from xbitinfo, not the codec. In the meantime, I do not plan any further changes to this PR except any additional tests or bug fixes as needed.
@martindurant, sorry to bother. Would you run these checks, or have I missed something? Thanks
I have no idea why the checks aren't running :|
Maybe it's waiting for me to tick all the boxes? Seems a bit of a catch 22, but I'll give it a try.
Let me know if this fixed it
git commit --amend --no-edit
git push --force-with-lease
otherwise, I'll open a new PR
Windows and OSX builds failing with:
900+ lines of errors and warnings...
× Building editable for numcodecs (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
full command: /usr/local/miniconda/envs/env/bin/python /usr/local/miniconda/envs/env/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py build_editable /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/tmp4l4qy_42
cwd: /Users/runner/work/numcodecs/numcodecs
Building editable for numcodecs (pyproject.toml): finished with status 'error'
ERROR: Failed building editable for numcodecs
Failed to build numcodecs
ERROR: Could not build wheels for numcodecs, which is required to install pyproject.toml-based projects
Error: Process completed with exit code 1.
I'm not sure what happened here, but I'll try to replicate this locally.
The Windows build failed because I had a test using dtype=float128
, but I don't understand why OSX is failing to install numcodecs
with python>=3.10. Can we run these one more time before I go digging into OSX?
Indeed, just osx left.
Added a usage example and rebased. I have still not resolved why the OSX 3.10 build failed. Hopefully later this week.
oh darn, my doc example gets tests too! 😞
Well, this passed just fine in a clean, local environment.
But the GitHub runner is erroring with
...
043 >>> import xarray as xr
044 >>> ds = xr.tutorial.open_dataset("air_temperature")
045 >>> from numcodecs import Blosc, BitInfo
046 >>> compressor = Blosc(cname="zstd", clevel=3)
047 >>> filters = [BitInfo(info_level=0.99)]
048 >>> encoding = {"air": {"compressor": compressor, "filters": filters}}
049 >>> _ = ds.to_zarr('xbit.zarr', mode="w", encoding=encoding)
UNEXPECTED EXCEPTION: ImportError("cannot import name 'MutableMapping' from 'collections' (/usr/share/miniconda/envs/env/lib/python3.11/collections/__init__.py)")
Traceback (most recent call last):
File "/usr/share/miniconda/envs/env/lib/python3.11/doctest.py", line 1355, in __run
exec(compile(example.source, filename, "single",
File "<doctest numcodecs.bitinfo.BitInfo[6]>", line 1, in <module>
File "/usr/share/miniconda/envs/env/lib/python3.11/site-packages/xarray/core/dataset.py", line 2520, in to_zarr
return to_zarr( # type: ignore[call-overload,misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
Unless there is an obvious fix, I will keep the usage example but remove it from the doctest, then run some test builds on OSX, then resubmit.
I suggest you remove the example, perhaps marking it as doctest skip.
(investigating local OSX build now)
@martindurant, the problem with OSX seems to be a broader issue. Note the previous commit to main is also failing (#527). Perhaps we rerun the tests and proceed with any other issues, while we wait for an OSX fix.
rebased on main
Add a new
bitinfo
codec, which reimplements anumpy
-based version of the bitrounding algorithm from the Julia packageBitInformation.jl
(and the Python packagexbitinfo
). When used withds.to_zarr
, the codec would compute the real information chunk-wise, which yields better results for data with variable information content.TODO: