write behavior for empty chunks

d-v-b commented 3 days ago

In v2, at array access time it is possible to set whether empty chunks (defined as chunks that are entirely fill_value) should be written to storage or skipped. This is an extremely useful feature for high-latency storage backends, or in any context where too many objects in storage is burdensome.

We don't support this in v3 yet, but we should. How should we do it? I will throw out a few options in order of practicality:

emulate v2: provide a keyword argument like write_empty_chunks when accessing an array. All chunk writes from that array will be affected.
put the write_empty_chunks setting in a global config. All chunk writes from all arrays in a session will be affected by the config parameter.
design an API for array IO wherein IO is wrapped in a context that can be parametrized, e.g. with a context manager, and one of those parameters is the write_empty_chunks-ness of the write transaction. Highly speculative.

The first option seems pretty expedient, and I don't think we had a lot of problems with this approach in v2. The only drawback is that if people want the same array to exhibit conditional write_empty_chunks behavior, then they might need something like the second approach, which has its own drawbacks IMO (i'm not a big fan of mutable global state).

I would propose that we emulate v2 for now (i.e., make write_empty_chunks a keyword argument to array access) and note any friction this causes, and consider ways to alleviate that in a subsequent design refresh if the friction is severe.

cc @constantinpape

constantinpape commented 3 days ago

My 2cents: I personally think this should be the default behavior. We have use-cases where writing empty chunks is extremely bad (as in bringing us over IO node quota immediately and killing everyone's jobs) and I am hesitant to give a library to students which has this as a default behavior.

Besides this, I think that both option 1 and 2 sound ok to me.

d-v-b commented 3 days ago

My 2cents: I personally think this should be the default behavior.

I tend to agree, the one counter-point is that for dense arrays, i.e. those arrays where all chunks have useful data in them, inspecting each chunk adds some overhead to writing. But I think our design should err on the side of using up some extra CPU cycles over potentially generating massive numbers of useless files.

rabernat commented 2 days ago

inspecting each chunk adds some overhead to writing

Here is some data on that from my laptop

dtype	bytes	time (s)	throughput (MB/s)
i2	2000	1.40e-04	14.31
i2	2000000	2.53e-04	7897.34
i2	2000000000	6.26e-01	3195.49
i4	4000	9.53e-05	41.98
i4	4000000	2.01e-04	19863.44
i4	4000000000	7.60e-01	5261.13
i8	8000	3.64e-05	219.93
i8	8000000	2.54e-04	31506.36
i8	8000000000	1.24e+00	6441.56
f4	4000	6.52e-05	61.38
f4	4000000	2.60e-04	15382.13
f4	4000000000	7.29e-01	5490.37
f8	8000	4.01e-05	199.38
f8	8000000	2.49e-04	32074.79
f8	8000000000	1.19e+00	6706.42

Given that the throughput is many GB/s for larger chunk sizes, this seems unlikely to be a rate limiting step for most I/O bound workloads against disk or cloud storage. So I think it's a fine default.

Does it make sense to think of this as an Array -> Array codec which may just abort the entire writing pipeline?

``` import numpy as np for dtype in ['i2', 'i4', 'i8', 'f4', 'f8']: for n in [1000, 1_000_000, 1_000_000_000]: data = np.zeros(n, dtype=dtype) nbytes = data.nbytes tic = perf_counter() np.any(data > 0) toc = perf_counter() - tic throughput = nbytes / toc / 1e6 print(f"| {dtype} | {nbytes} | {toc:3.2e} | {throughput:4.2f} |") ```

zarr-developers / zarr-python

write behavior for empty chunks #2015