Open michael-sutherland opened 2 years ago
Hi @michael-sutherland. Can you include which version of dask?
dask: 2021.12.0
numpy docs for astype: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.astype.html dask docs for astype: https://docs.dask.org/en/latest/generated/dask.array.Array.astype.html
Thanks for the info, @michael-sutherland. Looking at the stacktrace again, any objections to migrating this to github.com/ome/ome-zarr-py ?
cc: @sbesson @will-moore
Oops! Sorry if I misidentified which library the problem was in. Would you like me to migrate the issue or would you prefer to do it?
No worries. Hmmm.... looks like I can't transfer automatically between orgs anyway. See https://github.com/ome/ome-zarr-py/issues/169
As mentioned on https://github.com/zarr-developers/zarr-python/issues/962, the example can be simplified to remove ome_zarr
and get the same exception:
import numpy as np
import zarr
from zarr.storage import FSStore
import os
import dask.array as da
path = "test_dask_image.zarr"
os.mkdir(path)
mean_val = 10
rng = np.random.default_rng(0)
data = rng.poisson(mean_val, size=(50, 128, 128)).astype(np.uint8)
data = da.from_array(data) # <--- !!!THIS IS WHAT BREAKS THE EXAMPLE!!!
# write the image data
store = FSStore(path, mode="w")
root = zarr.group(store=store)
root.create_dataset("0", data=data, chunks=(1, 128, 128))
Updated @will-moore's example to define mean_val = 10
.
Ok. I assume this is related to @madsbk's https://github.com/zarr-developers/zarr-python/pull/934 in the sense that typically dask arrays are assumed to wrap zarr arrays rather than vice versa.
cc: @jakirkham
Hi @michael-sutherland, is there a reason you don't want to use the dask.array.to_zarr() function?
I imagine there might be some cases where it makes sense, and this example could have lost that context when it was simplified to make debugging easier.
I poked around with this a little, and there seem to be two places where things go wrong: one in dask, and one in zarr.
dask/array/core.py
, in the astype
method of the Array
classThis line: https://github.com/dask/dask/blob/cfaf931072ae079313761d4d61e25956029abf6c/dask/array/core.py#L2237
If it were changed to this, I think we'd fix the dask part of the problem:
extra = set(kwargs) - {"casting", "copy", "order"}
zarr/core.py(2189)_encode_chunk()
There is an ensure_ndarray
function, which checks to see if the chunk can be written properly (this check happens with numcodecs
). But, because chunk here is a chunk of the dask array instead of a numpy array, the numcodecs library chokes on it. What it sees in the buffer is not a numpy array but instead something different, and so it throws an error.> /Users/genevieb/mambaforge/envs/dask-dev/lib/python3.9/site-packages/zarr/core.py(2189)_encode_chunk()
2187
2188 # check object encoding
-> 2189 if ensure_ndarray(chunk).dtype == object:
2190 raise RuntimeError('cannot write object array without object codec')
Zarr could potentially work around this second issue by trying to .compute()
the dask chunk in ensure_ndarray
, before passing it along. But then there's some question about whether that's actually a sensible thing to do or not. Things would probably also become uglier in the zarr code, if now zarr needs to check what type of array it is handling, especially since numpy and dask are not the only two options for array types.
Josh says above that typically it is expected dask will wrap zarr, and not the other way around. Making changes to (2) above would be a bit in conflict with that expectation.
Hi guys, Google brings me here, I recently face the same error as well and I used dask.array.to_zarr()
method:
On the dask side of things, I've opened an issue and PR:
As discussed above, this won't completely fix the problem here.
Thanks @GenevieveBuckley , hopefully, this will solve the issue on my side.
Yes, let us know!
@GenevieveBuckley https://github.com/zarr-developers/zarr-python/issues/962#issuecomment-1193283690 2. There is an
ensure_ndarray
function, which checks to see if the chunk can be written properly (this check happens withnumcodecs
). But, because chunk here is a chunk of the dask array instead of a numpy array, the numcodecs library chokes on it.
cc: @madsbk just in case his recent work will or could handle this.
Zarr could potentially work around this second issue by trying to
.compute()
the dask chunk inensure_ndarray
, before passing it along. But then there's some question about whether that's actually a sensible thing to do or not. Things would probably also become uglier in the zarr code, if now zarr needs to check what type of array it is handling, especially since numpy and dask are not the only two options for array types.
I agree, calling .compute()
in ensure_ndarray
is doable but I don't think it is desirable. We already support any data type that implements NDArrayLike
so it should be possible to wrap a dask.array
in a class that implements NDArrayLike
and calls .compute()
when accessed.
I'm +1 on redirecting everyone back to use the dask.array.to_zarr
method instead of making additional changes here.
@michael-sutherland, can you let us know if you've found a solution that works for you?
I ended up moving to a custom solution using an H5 file with a similar structure. My application involved displaying a 2D mosiac that was being generated from a tool (in this case an X-ray microscope) in "real time". I don't know the area ahead of time and I needed to be able to expand the arrays and update the pyramid as I went. It is all working now, although I'd prefer a more standard format if possible. Also, expanding "to the left and up" is painful and involves copying, which I might be able to work around in a zarr file structure. If I can get the time, I'd like to try porting it to zarr.
Sorry if you were only doing this for me, I think supporting dask and other numpy-like arrays is important, although I think doing a custom call to "compute()" isn't the answer since that is so dask specific. Maybe wrapping in a call to np.array(), which will pull in data from h5py or dask or whatever lazy loading system someone might be using would be better? It also won't make a copy if it is already a numpy array (as far as I know).
No worries @michael-sutherland, it is always good to discuss design decisions :)
I am not sure that converting Dask Arrays to a local ndarray
seamlessly is a good idea. The data of a Dask Arrays is often distributed between multiple machines so gathering all data to a single local ndarray
is very expensive in terms of memory usage and performance.
In most cases, I think it is more desiable to use something like DataFrame.map_partitions()
to operate on each local Zarr array.
cc: @mrocklin just to clarify that it wasn't just lack of Dask-:heart: but the deeper question of to compute()
or not to compute()
, in which case the duck-typing (.chunks
, etc.) aren't enough anyway. There would additionally need to be some recognition of the delayed-ness.
but the deeper question of to compute() or not to compute(),
Typically we handle this by using protocols like __array__
. For example if Zarr were to call np.asarray
on every chunk before inserting it into some store then that might help to ensure that things work well.
Chatting live with @joshmoore . Some thoughts:
@GenevieveBuckley did some work, maybe stuff works. We should test this (I don't know this, Josh is suggesting it, don't blame if he's wrong 🙂 )
If it doesn't work, make sure that the sliced chunk you're about to write is concrete and in memory (probably a numpy but maybe something else, like a cupy array or some other buffer thing). In dumb code, this probably looks like this:
for slice in ...:
chunk = array[slice]
chunk = np.asarray(chunk)
write(chunk, path)
This is a bit circular, but Zarr could use dask array
def give_array_to_zarr(self, arr, path):
if arr.shape == self.chunks:
# do numpy thing
else:
# Dask
x = da.from_array(array)
x.to_zarr(...)
We need to be a little careful to avoid an infinite zarr-dask-zarr loop, but I think the first if statement handles it.
There's also some concern of this code being split between two repositories. I generously offer that you all own this code in the future 🙂
Minimal, reproducible code sample, a copy-pastable example if possible
Problem description
Get an error when trying to write a dask array to a zarr file. I took the proposed example from https://github.com/ome/ome-zarr-py/pull/121 and wrapped the numpy array in a dask array. It appears that the dask array function astype doesn't support the parameter "order". The example code works when using a numpy array for me. Here's the resulting traceback:
Version and installation information
Please provide the following:
zarr.__version__
: 2.10.3numcodecs.__version__
: 0.9.1Also, if you think it might be relevant, please provide the output from
pip freeze
orconda env export
depending on which was used to install Zarr.