zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.52k stars 282 forks source link

`nbytes_stored` incorrect when `dimension_separator="/"` #2174

Open dstansby opened 1 month ago

dstansby commented 1 month ago

Zarr version

2.18.2

Numcodecs version

0.13.0

Python Version

3.10.4

Operating System

macOS

Installation

conda

Description

When saving an array to disk and loading it again with dimension_separator="/", the number of stored bytes is incorrectly reported. In this case it is just reporting the size of the .zarray file.

Steps to reproduce

import numpy as np
import zarr

zarr_path = "test.zarr"
data = np.random.randint(0, 2**8, size=(64, 64, 64), dtype=np.uint8)

for dimension_separator in [".", "/"]:
    zarr.save_array(zarr_path, data, dimension_separator=dimension_separator)
    zarr_arr = zarr.open(zarr_path)
    print(f"{dimension_separator=}")
    print("nbytes_stored:", zarr_arr.nbytes_stored)
    print()
dimension_separator='.'
nbytes_stored: 262567

dimension_separator='/'
nbytes_stored: 391

Additional output

No response

kabilar commented 1 month ago

+1 Thank you, @dstansby.

dstansby commented 1 month ago

Root cause of this is https://github.com/zarr-developers/zarr-python/issues/253, but I'll leave this open as it gives a nice self contained example of the issue.

Note that for OME-zarr the deafult separator is /, so currently zarr-python v2 will report the wrong size for all OME-zarr arrays 😱

kabilar commented 1 month ago

I believe the Chunks initialized metadata is also incorrect. See example below.

import numpy as np
import zarr

zarr_path = "test.zarr"
data = np.random.randint(0, 2**8, size=(1000, 1000), dtype=np.uint8)

for dimension_separator in [".", "/"]:
    zarr.save_array(zarr_path, data, chunks=(100,100), dimension_separator=dimension_separator)
    zarr_arr = zarr.open(zarr_path)
    print(f"{dimension_separator=}")
    print("nbytes_stored:", zarr_arr.nbytes_stored)
    print("nchunks_initialized:", zarr_arr.nchunks_initialized)
    print()

Output

dimension_separator='.'
nbytes_stored: 1001973
nchunks_initialized: 100

dimension_separator='/'
nbytes_stored: 373
nchunks_initialized: 10