Read with xtensor-zarr, support v3

davidbrochart commented 3 years ago

It looks like xtensor-zarr cannot read zarr v2 written by z5py using blosc compression. What I can see:

z5py writes "fill_value": 0.0, which is not a valid unsigned 8-bit integer value, although it can be cast to it.
all chunk data files written by z5py have 16 more bytes (zeros) at the end, compared to all other implementations. I'm not sure this is a problem though.

grlee77 commented 3 years ago

Thanks for working on this! I see the same behavior locally as on the CI here.

I could be wrong (I haven't worked with the BLOSC library previously), but it might be related to the addition of BLOSC_MAX_OVERHEAD in this line?. (I think it was correct to include this BLOSC_MAX_OVERHEAD in the sizeOut computation several lines above, but it does not need to also be added to sizeCompressed).

Apparently BLOSC_MAX_OVERHEAD is 16

davidbrochart commented 3 years ago

I could be wrong (I haven't worked with the BLOSC library previously), but it might be related to the addition of BLOSC_MAX_OVERHEAD in this line?. (I think it was correct to include this BLOSC_MAX_OVERHEAD in the sizeOut computation several lines above, but it does not need to also be added to sizeCompressed).

Thanks for looking into it @grlee77. zarr-python seems to be more tolerant, as it can read z5py-zarr-blosc successfully, but that doesn't mean these trailing 16 bytes are valid. I'll investigate before we merge this PR.

constantinpape commented 3 years ago

Thanks for looking into it @grlee77. zarr-python seems to be more tolerant, as it can read z5py-zarr-blosc successfully, but that doesn't mean these trailing 16 bytes are valid. I'll investigate before we merge this PR.

Let me know what you find, should hopefully be a simple fix in z5. Maybe I am accidentally padding something when writing blosc.

davidbrochart commented 3 years ago

python-blosc doesn't seem to be able to read the chunks either:

>>> import blosc
>>> with open("data/z5py.zr/blosc/lz4/0.0.0", "rb") as f:
...     data = f.read()
... 
>>> d = blosc.decompress(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/david/mambaforge/envs/zarr_implementations-dev/lib/python3.8/site-packages/blosc/toplevel.py", line 594, in decompress
    return _ext.decompress(bytes_like, as_bytearray)
blosc_extension.error: Error 10032 : not a Blosc buffer or header info is corrupted

@constantinpape I think you should not store the maximum possible size of the compressed data. Blosc gives you the actual size after it has compressed, you can see how we do it in xtensor-io.

constantinpape commented 3 years ago

Thanks, I will have a look.

grlee77 commented 3 years ago

Let me know what you find, should hopefully be a simple fix in z5. Maybe I am accidentally padding something when writing blosc.

Hopefully it is as simple as removing BLOSC_MAX_OVERHEAD from this line

constantinpape commented 3 years ago

I had an initial look and I can reproduce this locally. Unfortunately it looks like just removing the overhead here does not fix the issue. But maybe I missed something, gonna have a closer look when I have time (not sure if I ll manage before the weekend.)

constantinpape commented 3 years ago

I had another look, and removing the BLOSC_MAX_OVERHEAD fixes the issue; I just didn't see it locally for some other reason. I am trying to fix the windows ci in z5 (https://github.com/constantinpape/z5/pull/181) and will then draft a release to fix the tests here as well.

constantinpape commented 3 years ago

Ok, I drafted a new release. (Windows build issues seem to be unrelated, https://github.com/ilammy/msvc-dev-cmd/issues/34) Should be on conda forge in the next few days, then we can update the dependencies here.

constantinpape commented 3 years ago

I updated the env so that the correct z5py version is installed. That seems to work, but now a bunch of other tests fail. I am not quite sure what's going on.

davidbrochart commented 3 years ago

It looks like there is a new nested parameter for the read functions? I will need to implement that.

constantinpape commented 3 years ago

It looks like there is a new nested parameter for the read functions? I will need to implement that.

I see, that's probably due to the zarr release that happened in the meantime. This might be solved by #33 already and it would be enough to rebase onto master. Maybe @grlee77 or @joshmoore could comment.

davidbrochart commented 3 years ago

I just rebased but that is not enough. I guess that this nested argument is related to the dimension separator, I'm going to look into it.

grlee77 commented 3 years ago

@davidbrochart, I can help take a look at this. I think the issue is that the nested vs. flat here was implemented prior to dimension_separator, so I was relying on parsing the filenames to determine what the separator was.

Basically there are v2 files where 'dimension_separator' is '/', but I don't think the .zr files currently have that attribute in the JSON, it is just being manually specified at read time like here based on the filename having "nested" or "flat" in it: https://github.com/zarr-developers/zarr_implementations/blob/7349e03aa8622982611c13ad9af48a4937173546/test/test_read_all.py#L73-L84

I think Josh said that key_separator argument may be going away, so I should update the zarr python generators/tests to rely on the dimension_separator key instead.

grlee77 commented 3 years ago

This PR will require zarr-python >= 2.8 so that the dimension_separator metadata key gets used. The conda-forge package for 2.8 was just uploaded, so it should work now. See: https://github.com/davidbrochart/zarr_implementations/pull/1

davidbrochart commented 3 years ago

Thanks @grlee77, all green now!

joshmoore commented 3 years ago

🚀 Merging so I can rebase the jzarr PR.

zarr-developers / zarr_implementations

Read with xtensor-zarr, support v3 #34