zarr-developers / numcodecs

A Python package providing buffer compression and transformation codecs for use in data storage and communication applications.
http://numcodecs.readthedocs.io
MIT License
128 stars 88 forks source link

Support `pcodec` v0.3 #639

Open slevang opened 1 week ago

slevang commented 1 week ago

Compatibility for the new delta_spec argument and modes in pcodec.ChunkConfig. Maintains backwards compatibility if only delta_encoding_order is passed. Also adds the paging_spec arg.

Closes #623

TODO:

codecov[bot] commented 1 week ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 99.89%. Comparing base (2309714) to head (39e454a).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #639 +/- ## ======================================= Coverage 99.88% 99.89% ======================================= Files 62 62 Lines 2723 2752 +29 ======================================= + Hits 2720 2749 +29 Misses 3 3 ``` | [Files with missing lines](https://app.codecov.io/gh/zarr-developers/numcodecs/pull/639?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=zarr-developers) | Coverage Δ | | |---|---|---| | [numcodecs/pcodec.py](https://app.codecov.io/gh/zarr-developers/numcodecs/pull/639?src=pr&el=tree&filepath=numcodecs%2Fpcodec.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=zarr-developers#diff-bnVtY29kZWNzL3Bjb2RlYy5weQ==) | `100.00% <100.00%> (ø)` | | | [numcodecs/tests/test\_pcodec.py](https://app.codecov.io/gh/zarr-developers/numcodecs/pull/639?src=pr&el=tree&filepath=numcodecs%2Ftests%2Ftest_pcodec.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=zarr-developers#diff-bnVtY29kZWNzL3Rlc3RzL3Rlc3RfcGNvZGVjLnB5) | `100.00% <100.00%> (ø)` | |

🚨 Try these New Features:

slevang commented 1 week ago

Maybe we should expand the numcodecs API to support future specs?

delta_spec: Literal["auto", "none", "try_consecutive", "try_lookback"] = "auto" # ignored if <v0.3
delta_encoding_order: Optional[int] = None # ignored if >=v0.3 and spec != try_consecutive?

Or accept actual ModeSpec/DeltaSpec objects, with sensible auto defaults?

cc @rabernat @mwlon

rabernat commented 1 week ago

Maybe we should expand the numcodecs API to support future specs?

Could you say more about what specifically you mean here? I don't quite get the context for this comment.

slevang commented 1 week ago

@mwlon presumably understands the underlying changes much better than I do. But the gist is pcodec now takes a more flexible DeltaSpec object, rather than a simple int that routes to auto/consecutive modes.

This broke the numcodecs interface on the latest release, so I'm just trying to figure out the right path to support any future additions without breaking existing numcodecs users.

mwlon commented 1 week ago

Can we just bump pcodec version to >=0.3,<0.4 so we don't need to support both code paths?

You're correct: I added a new type of delta encoding, now handled by delta spec. This should be harder to break API wise in the future.

slevang commented 1 week ago

Yeah that makes a lot more sense :+1:

rabernat commented 1 week ago

A key consideration here, and an important priority for numcodecs, is backwards compatibility. Ideally any data written by Zarr will be readable for a long time into the future. This means that breaking changes in the codec parameters, which would cause decoding of existing data to fail, should be avoided.

This could be relaxed if we had some sort of versioning system for codecs. But unfortunately we don't.

For any of the proposals above, I would ask these questions:

slevang commented 1 week ago

I believe these are purely API changes and feature additions, since the ChunkConfig is only involved in setting up the encoding step. Presumably everything that's needed for decoding is self contained.

Would be good for @mwlon to confirm though. And I do wonder if the new modes (e.g. DeltaSpec.try_lookback, ModeSpec.try_float_quant) can be decoded by older versions of pcodec, although that's much less important.

Is the change future-proof to additional future evolution of the codec parameters?

Seems all of the config API has evolved to these spec objects, so maybe now is a good time to make the numcodecs API match fully:

level: int = 8,
delta_spec: DeltaSpec | None = None,
mode_spec: ModeSpec | None = None,
paging_spec: PagingSpec | None = None,

and the user is responsible for passing through an explicit pcodec object if they want custom behavior.

mwlon commented 1 week ago

Can we still read old data written with earlier versions of the codec?

Yes! Pcodec will always be able to decode older format versions.

And I think I agree with @slevang 's last comment: I'm in favor of passing the pcodec specs straight through to keep everything simple. Presumably we can convert non-spec kwargs into these (based on type) for numcodec's API backward compatibility. @rabernat do you agree?

slevang commented 1 week ago

Presumably we can convert non-spec kwargs into these (based on type) for numcodec's API backward compatibility.

It's entirely possible Ryan and I are the only ones actually using numcodecs.pcodec, and I've never used any options besides level. So API backwards compat is probably not super important here :slightly_smiling_face:

slevang commented 1 week ago

and the user is responsible for passing through an explicit pcodec object if they want custom behavior

Ah nevermind I think, because numcodecs expects the codec config to be json-serializable which is a requirement of zarr. I guess we'll have to just pile on the kwargs to support additional features.

rabernat commented 1 week ago

It's entirely possible Ryan and I are the only ones actually using numcodecs.pcodec,

Definitely not the case. We have customers with petabytes of Zarr data (now a little bit less! 🙌 ) encoded with pcodec.

slevang commented 1 week ago

Ok new approach, keep the string style args but also support delta_spec and paging_spec. It would be possible to add support for the try mode specs now, but wasn't straightforward to integrate with the existing tests since these are dtype specific. So I skipped it.