Open adamkglaser opened 3 years ago
I tried the same test on my linux workstation and in all cases the store was written in parallel. Here are my results:
N5 + raw (note that wall time is much less than total CPU time, indicating parallelism)
%%time
darray.to_zarr(zarr.N5Store('test/test1.n5'), compressor = None)
CPU times: user 31.7 s, sys: 16.2 s, total: 48 s
Wall time: 14.3 s
N5 + GZip (more total time because of compression)
%%time
from numcodecs import GZip
darray.to_zarr(zarr.N5Store('test/test2.n5'), compressor=GZip(2))
CPU times: user 4min 39s, sys: 10.2 s, total: 4min 49s
Wall time: 11.3 s
Default Zarr + raw
%%time
darray.to_zarr(zarr.DirectoryStore('test/test3.zarr'), compressor = None)
CPU times: user 23.1 s, sys: 9.27 s, total: 32.3 s
Wall time: 9.08 s
Default zarr + GZip
%%time
darray.to_zarr(zarr.DirectoryStore('test/test4.zarr'), compressor = GZip(level=2))
CPU times: user 4min 8s, sys: 6.52 s, total: 4min 15s
Wall time: 7.94 s
zarr-N5 is pretty consistently slower in my tests than vanilla zarr, (and I have no idea why), but all these implementations are getting parallelized on my machine. I can try these same tests later on a windows machine and see if I observe any discrepancies.
The N5Store
does something mapping between N5 and Zarr under-the-hood. There may be some cost incurred by this. One would probably need to go through and profile the methods more carefully with a mix of cProfiler and line_profiler to determine where the slowdowns are
Sorry, I missed that there were two locations. I did a similar analysis to @d-v-b in https://github.com/adamkglaser/io_benchmarks/issues/1 and also saw multi-threading.
I have not yet done the profiling to figure where the zarr/n5 lies.
Thanks @d-v-b. Would be great to hear if you also get parallel writing on Windows. Unfortunately all of our lab PCs are Windows, and will be running Windows to control our systems with which I am hoping to move towards N5 writing of data.
I transferred this issue from zarr-python
to n5py
.
Problem description
Using Zarr + Dask, when saving to N5, multi-threading only works when a compressor is used. When saving as raw with compressor = None, the operation runs single threaded. When using Zarr + Dask and saving to raw zarr format with compressor = None, the operation runs multi-threaded. Is there a potential bug when saving to raw N5 that disables multi-threading?
Thanks! Adam
Python code
import zarr import numpy as np import dask.array as da
data = np.random.randint(0, 2000, size = [512,2048,2048]).astype('uint16') darray = da.from_array(data, chunks = (16,256,256))
no compression to n5 - no multi-threading bug (?)
compressor = None store = zarr.N5Store('test1.n5') darray.to_zarr(store, compressor = compressor)
with compression to n5 - multi-threading works
compressor = GZip(level = 2) store = zarr.N5Store('test2.n5') darray.to_zarr(store, compressor = compressor)
no compression to zarr - multi-threading works
compressor = None store = zarr.DirectoryStore('test3') darray.to_zarr(store, compressor = compressor)
Version and installation information
Zarr 2.6.1 Dask 2020.12.0 Python 3.9.1 Windows 10 Zarr installed via Conda