Closed davidbrochart closed 3 years ago
Thanks for working on this! I see the same behavior locally as on the CI here.
I could be wrong (I haven't worked with the BLOSC library previously), but it might be related to the addition of BLOSC_MAX_OVERHEAD
in this line?. (I think it was correct to include this BLOSC_MAX_OVERHEAD in the sizeOut
computation several lines above, but it does not need to also be added to sizeCompressed
).
Apparently BLOSC_MAX_OVERHEAD is 16
I could be wrong (I haven't worked with the BLOSC library previously), but it might be related to the addition of
BLOSC_MAX_OVERHEAD
in this line?. (I think it was correct to include this BLOSC_MAX_OVERHEAD in thesizeOut
computation several lines above, but it does not need to also be added tosizeCompressed
).
Thanks for looking into it @grlee77. zarr-python seems to be more tolerant, as it can read z5py-zarr-blosc successfully, but that doesn't mean these trailing 16 bytes are valid. I'll investigate before we merge this PR.
Thanks for looking into it @grlee77. zarr-python seems to be more tolerant, as it can read z5py-zarr-blosc successfully, but that doesn't mean these trailing 16 bytes are valid. I'll investigate before we merge this PR.
Let me know what you find, should hopefully be a simple fix in z5. Maybe I am accidentally padding something when writing blosc.
python-blosc doesn't seem to be able to read the chunks either:
>>> import blosc
>>> with open("data/z5py.zr/blosc/lz4/0.0.0", "rb") as f:
... data = f.read()
...
>>> d = blosc.decompress(data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/david/mambaforge/envs/zarr_implementations-dev/lib/python3.8/site-packages/blosc/toplevel.py", line 594, in decompress
return _ext.decompress(bytes_like, as_bytearray)
blosc_extension.error: Error 10032 : not a Blosc buffer or header info is corrupted
@constantinpape I think you should not store the maximum possible size of the compressed data. Blosc gives you the actual size after it has compressed, you can see how we do it in xtensor-io.
Thanks, I will have a look.
Let me know what you find, should hopefully be a simple fix in z5. Maybe I am accidentally padding something when writing blosc.
Hopefully it is as simple as removing BLOSC_MAX_OVERHEAD from this line
I had an initial look and I can reproduce this locally. Unfortunately it looks like just removing the overhead here does not fix the issue. But maybe I missed something, gonna have a closer look when I have time (not sure if I ll manage before the weekend.)
I had another look, and removing the BLOSC_MAX_OVERHEAD
fixes the issue; I just didn't see it locally for some other reason.
I am trying to fix the windows ci in z5 (https://github.com/constantinpape/z5/pull/181) and will then draft a release to fix the tests here as well.
Ok, I drafted a new release. (Windows build issues seem to be unrelated, https://github.com/ilammy/msvc-dev-cmd/issues/34) Should be on conda forge in the next few days, then we can update the dependencies here.
I updated the env so that the correct z5py version is installed. That seems to work, but now a bunch of other tests fail. I am not quite sure what's going on.
It looks like there is a new nested
parameter for the read functions? I will need to implement that.
It looks like there is a new
nested
parameter for the read functions? I will need to implement that.
I see, that's probably due to the zarr release that happened in the meantime. This might be solved by #33 already and it would be enough to rebase onto master. Maybe @grlee77 or @joshmoore could comment.
I just rebased but that is not enough. I guess that this nested
argument is related to the dimension separator, I'm going to look into it.
@davidbrochart, I can help take a look at this. I think the issue is that the nested vs. flat here was implemented prior to dimension_separator
, so I was relying on parsing the filenames to determine what the separator was.
Basically there are v2 files where 'dimension_separator' is '/', but I don't think the .zr
files currently have that attribute in the JSON, it is just being manually specified at read time like here based on the filename having "nested" or "flat" in it:
https://github.com/zarr-developers/zarr_implementations/blob/7349e03aa8622982611c13ad9af48a4937173546/test/test_read_all.py#L73-L84
I think Josh said that key_separator argument may be going away, so I should update the zarr python generators/tests to rely on the dimension_separator
key instead.
This PR will require zarr-python >= 2.8 so that the dimension_separator
metadata key gets used. The conda-forge package for 2.8 was just uploaded, so it should work now. See: https://github.com/davidbrochart/zarr_implementations/pull/1
Thanks @grlee77, all green now!
🚀 Merging so I can rebase the jzarr PR.
It looks like xtensor-zarr cannot read zarr v2 written by z5py using blosc compression. What I can see:
"fill_value": 0.0
, which is not a valid unsigned 8-bit integer value, although it can be cast to it.