zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.54k stars 288 forks source link

Passing dask array to zarr leads to a TypeError #962

Open michael-sutherland opened 2 years ago

michael-sutherland commented 2 years ago

Minimal, reproducible code sample, a copy-pastable example if possible

import numpy as np
import zarr
import os
import dask.array as da

from ome_zarr.io import parse_url
from ome_zarr.writer import write_image

path = "test_dask_image.zarr"
os.mkdir(path)

mean_val = 10
rng = np.random.default_rng(0)
data = rng.poisson(mean_val, size=(50, 128, 128)).astype(np.uint8)
data = da.from_array(data)  #  <--- !!!THIS IS WHAT BREAKS THE EXAMPLE!!!

# write the image data
store = parse_url(path, mode="w").store
root = zarr.group(store=store)
write_image(image=data, group=root, chunks=(1, 128, 128), axes="zyx")
root.attrs["omero"] = {
    "channels": [{
        "color": "00FFFF",
        "window": {"start": 0, "end": 20},
        "label": "random",
        "active": True,
    }]
}

Problem description

Get an error when trying to write a dask array to a zarr file. I took the proposed example from https://github.com/ome/ome-zarr-py/pull/121 and wrapped the numpy array in a dask array. It appears that the dask array function astype doesn't support the parameter "order". The example code works when using a numpy array for me. Here's the resulting traceback:

Traceback (most recent call last):
  File "/home/sutherland/git/xray_napari/write_zarr_with_dask_failure.py", line 20, in <module>
    write_image(image=data, group=root, chunks=(1, 128, 128), axes="zyx")
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/ome_zarr/writer.py", line 178, in write_image
    write_multiscale(image, group, chunks=chunks, fmt=fmt, axes=axes)
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/ome_zarr/writer.py", line 105, in write_multiscale
    group.create_dataset(str(path), data=dataset, chunks=chunks)
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/zarr/hierarchy.py", line 808, in create_dataset
    return self._write_op(self._create_dataset_nosync, name, **kwargs)
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/zarr/hierarchy.py", line 661, in _write_op
    return f(*args, **kwargs)
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/zarr/hierarchy.py", line 824, in _create_dataset_nosync
    a = array(data, store=self._store, path=path, chunk_store=self._chunk_store,
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/zarr/creation.py", line 377, in array
    z[...] = data
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/zarr/core.py", line 1224, in __setitem__
    self.set_basic_selection(selection, value, fields=fields)
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/zarr/core.py", line 1319, in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/zarr/core.py", line 1610, in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/zarr/core.py", line 1682, in _set_selection
    self._chunk_setitems(lchunk_coords, lchunk_selection, chunk_values,
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/zarr/core.py", line 1871, in _chunk_setitems
    cdatas = [self._process_for_setitem(key, sel, val, fields=fields)
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/zarr/core.py", line 1871, in <listcomp>
    cdatas = [self._process_for_setitem(key, sel, val, fields=fields)
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/zarr/core.py", line 1924, in _process_for_setitem
    chunk = value.astype(self._dtype, order=self._order, copy=False)
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/dask/array/core.py", line 2086, in astype
    raise TypeError(
TypeError: astype does not take the following keyword arguments: ['order']

Version and installation information

Please provide the following:

Also, if you think it might be relevant, please provide the output from pip freeze or conda env export depending on which was used to install Zarr.

joshmoore commented 2 years ago

Hi @michael-sutherland. Can you include which version of dask?

michael-sutherland commented 2 years ago

dask: 2021.12.0

michael-sutherland commented 2 years ago

numpy docs for astype: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.astype.html dask docs for astype: https://docs.dask.org/en/latest/generated/dask.array.Array.astype.html

joshmoore commented 2 years ago

Thanks for the info, @michael-sutherland. Looking at the stacktrace again, any objections to migrating this to github.com/ome/ome-zarr-py ?

cc: @sbesson @will-moore

michael-sutherland commented 2 years ago

Oops! Sorry if I misidentified which library the problem was in. Would you like me to migrate the issue or would you prefer to do it?

joshmoore commented 2 years ago

No worries. Hmmm.... looks like I can't transfer automatically between orgs anyway. See https://github.com/ome/ome-zarr-py/issues/169

will-moore commented 2 years ago

As mentioned on https://github.com/zarr-developers/zarr-python/issues/962, the example can be simplified to remove ome_zarr and get the same exception:

import numpy as np
import zarr
from zarr.storage import FSStore
import os
import dask.array as da

path = "test_dask_image.zarr"
os.mkdir(path)

mean_val = 10
rng = np.random.default_rng(0)
data = rng.poisson(mean_val, size=(50, 128, 128)).astype(np.uint8)
data = da.from_array(data)  #  <--- !!!THIS IS WHAT BREAKS THE EXAMPLE!!!

# write the image data
store = FSStore(path, mode="w")
root = zarr.group(store=store)
root.create_dataset("0", data=data, chunks=(1, 128, 128))
joshmoore commented 2 years ago

Updated @will-moore's example to define mean_val = 10.

joshmoore commented 2 years ago

Ok. I assume this is related to @madsbk's https://github.com/zarr-developers/zarr-python/pull/934 in the sense that typically dask arrays are assumed to wrap zarr arrays rather than vice versa.

cc: @jakirkham

GenevieveBuckley commented 2 years ago

Hi @michael-sutherland, is there a reason you don't want to use the dask.array.to_zarr() function?

I imagine there might be some cases where it makes sense, and this example could have lost that context when it was simplified to make debugging easier.

GenevieveBuckley commented 2 years ago

I poked around with this a little, and there seem to be two places where things go wrong: one in dask, and one in zarr.

  1. In dask/array/core.py, in the astype method of the Array class

This line: https://github.com/dask/dask/blob/cfaf931072ae079313761d4d61e25956029abf6c/dask/array/core.py#L2237

If it were changed to this, I think we'd fix the dask part of the problem:

extra = set(kwargs) - {"casting", "copy", "order"}
  1. In zarr, zarr/core.py(2189)_encode_chunk() There is an ensure_ndarray function, which checks to see if the chunk can be written properly (this check happens with numcodecs). But, because chunk here is a chunk of the dask array instead of a numpy array, the numcodecs library chokes on it. What it sees in the buffer is not a numpy array but instead something different, and so it throws an error.
> /Users/genevieb/mambaforge/envs/dask-dev/lib/python3.9/site-packages/zarr/core.py(2189)_encode_chunk()
   2187
   2188         # check object encoding
-> 2189         if ensure_ndarray(chunk).dtype == object:
   2190             raise RuntimeError('cannot write object array without object codec')

Zarr could potentially work around this second issue by trying to .compute() the dask chunk in ensure_ndarray, before passing it along. But then there's some question about whether that's actually a sensible thing to do or not. Things would probably also become uglier in the zarr code, if now zarr needs to check what type of array it is handling, especially since numpy and dask are not the only two options for array types.

Josh says above that typically it is expected dask will wrap zarr, and not the other way around. Making changes to (2) above would be a bit in conflict with that expectation.

vietnguyengit commented 2 years ago

Hi guys, Google brings me here, I recently face the same error as well and I used dask.array.to_zarr() method:

image image

GenevieveBuckley commented 2 years ago

On the dask side of things, I've opened an issue and PR:

As discussed above, this won't completely fix the problem here.

vietnguyengit commented 2 years ago

Thanks @GenevieveBuckley , hopefully, this will solve the issue on my side.

GenevieveBuckley commented 2 years ago

Yes, let us know!

joshmoore commented 2 years ago

@GenevieveBuckley https://github.com/zarr-developers/zarr-python/issues/962#issuecomment-1193283690 2. There is an ensure_ndarray function, which checks to see if the chunk can be written properly (this check happens with numcodecs). But, because chunk here is a chunk of the dask array instead of a numpy array, the numcodecs library chokes on it.

cc: @madsbk just in case his recent work will or could handle this.

madsbk commented 2 years ago

Zarr could potentially work around this second issue by trying to .compute() the dask chunk in ensure_ndarray, before passing it along. But then there's some question about whether that's actually a sensible thing to do or not. Things would probably also become uglier in the zarr code, if now zarr needs to check what type of array it is handling, especially since numpy and dask are not the only two options for array types.

I agree, calling .compute() in ensure_ndarray is doable but I don't think it is desirable. We already support any data type that implements NDArrayLike so it should be possible to wrap a dask.array in a class that implements NDArrayLike and calls .compute() when accessed.

GenevieveBuckley commented 2 years ago

I'm +1 on redirecting everyone back to use the dask.array.to_zarr method instead of making additional changes here.

joshmoore commented 2 years ago

@michael-sutherland, can you let us know if you've found a solution that works for you?

michael-sutherland commented 2 years ago

I ended up moving to a custom solution using an H5 file with a similar structure. My application involved displaying a 2D mosiac that was being generated from a tool (in this case an X-ray microscope) in "real time". I don't know the area ahead of time and I needed to be able to expand the arrays and update the pyramid as I went. It is all working now, although I'd prefer a more standard format if possible. Also, expanding "to the left and up" is painful and involves copying, which I might be able to work around in a zarr file structure. If I can get the time, I'd like to try porting it to zarr.

Sorry if you were only doing this for me, I think supporting dask and other numpy-like arrays is important, although I think doing a custom call to "compute()" isn't the answer since that is so dask specific. Maybe wrapping in a call to np.array(), which will pull in data from h5py or dask or whatever lazy loading system someone might be using would be better? It also won't make a copy if it is already a numpy array (as far as I know).

madsbk commented 2 years ago

No worries @michael-sutherland, it is always good to discuss design decisions :)

I am not sure that converting Dask Arrays to a local ndarray seamlessly is a good idea. The data of a Dask Arrays is often distributed between multiple machines so gathering all data to a single local ndarray is very expensive in terms of memory usage and performance. In most cases, I think it is more desiable to use something like DataFrame.map_partitions() to operate on each local Zarr array.

joshmoore commented 1 year ago

cc: @mrocklin just to clarify that it wasn't just lack of Dask-:heart: but the deeper question of to compute() or not to compute(), in which case the duck-typing (.chunks, etc.) aren't enough anyway. There would additionally need to be some recognition of the delayed-ness.

mrocklin commented 1 year ago

but the deeper question of to compute() or not to compute(),

Typically we handle this by using protocols like __array__. For example if Zarr were to call np.asarray on every chunk before inserting it into some store then that might help to ensure that things work well.

mrocklin commented 1 year ago

Chatting live with @joshmoore . Some thoughts:

Maybe things work now?

@GenevieveBuckley did some work, maybe stuff works. We should test this (I don't know this, Josh is suggesting it, don't blame if he's wrong 🙂 )

Maybe call np.asarray on each sliced chunk?

If it doesn't work, make sure that the sliced chunk you're about to write is concrete and in memory (probably a numpy but maybe something else, like a cupy array or some other buffer thing). In dumb code, this probably looks like this:

for slice in ...:
    chunk = array[slice]
    chunk = np.asarray(chunk)
    write(chunk, path)

Maybe use Dask Array?

This is a bit circular, but Zarr could use dask array

def give_array_to_zarr(self, arr, path):
    if arr.shape == self.chunks:
        # do numpy thing
    else:
        # Dask
        x = da.from_array(array)
        x.to_zarr(...)

We need to be a little careful to avoid an infinite zarr-dask-zarr loop, but I think the first if statement handles it.

There's also some concern of this code being split between two repositories. I generously offer that you all own this code in the future 🙂