zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
http://zarr.readthedocs.io/
MIT License
1.39k stars 263 forks source link

write behavior for empty chunks #2015

Open d-v-b opened 3 days ago

d-v-b commented 3 days ago

In v2, at array access time it is possible to set whether empty chunks (defined as chunks that are entirely fill_value) should be written to storage or skipped. This is an extremely useful feature for high-latency storage backends, or in any context where too many objects in storage is burdensome.

We don't support this in v3 yet, but we should. How should we do it? I will throw out a few options in order of practicality:

The first option seems pretty expedient, and I don't think we had a lot of problems with this approach in v2. The only drawback is that if people want the same array to exhibit conditional write_empty_chunks behavior, then they might need something like the second approach, which has its own drawbacks IMO (i'm not a big fan of mutable global state).

I would propose that we emulate v2 for now (i.e., make write_empty_chunks a keyword argument to array access) and note any friction this causes, and consider ways to alleviate that in a subsequent design refresh if the friction is severe.

cc @constantinpape

constantinpape commented 3 days ago

My 2cents: I personally think this should be the default behavior. We have use-cases where writing empty chunks is extremely bad (as in bringing us over IO node quota immediately and killing everyone's jobs) and I am hesitant to give a library to students which has this as a default behavior.

Besides this, I think that both option 1 and 2 sound ok to me.

d-v-b commented 3 days ago

My 2cents: I personally think this should be the default behavior.

I tend to agree, the one counter-point is that for dense arrays, i.e. those arrays where all chunks have useful data in them, inspecting each chunk adds some overhead to writing. But I think our design should err on the side of using up some extra CPU cycles over potentially generating massive numbers of useless files.

rabernat commented 2 days ago

inspecting each chunk adds some overhead to writing

Here is some data on that from my laptop

dtype bytes time (s) throughput (MB/s)
i2 2000 1.40e-04 14.31
i2 2000000 2.53e-04 7897.34
i2 2000000000 6.26e-01 3195.49
i4 4000 9.53e-05 41.98
i4 4000000 2.01e-04 19863.44
i4 4000000000 7.60e-01 5261.13
i8 8000 3.64e-05 219.93
i8 8000000 2.54e-04 31506.36
i8 8000000000 1.24e+00 6441.56
f4 4000 6.52e-05 61.38
f4 4000000 2.60e-04 15382.13
f4 4000000000 7.29e-01 5490.37
f8 8000 4.01e-05 199.38
f8 8000000 2.49e-04 32074.79
f8 8000000000 1.19e+00 6706.42

Given that the throughput is many GB/s for larger chunk sizes, this seems unlikely to be a rate limiting step for most I/O bound workloads against disk or cloud storage. So I think it's a fine default.

Does it make sense to think of this as an Array -> Array codec which may just abort the entire writing pipeline?

``` import numpy as np for dtype in ['i2', 'i4', 'i8', 'f4', 'f8']: for n in [1000, 1_000_000, 1_000_000_000]: data = np.zeros(n, dtype=dtype) nbytes = data.nbytes tic = perf_counter() np.any(data > 0) toc = perf_counter() - tic throughput = nbytes / toc / 1e6 print(f"| {dtype} | {nbytes} | {toc:3.2e} | {throughput:4.2f} |") ```