Open aero-108 opened 3 years ago
Not sure I follow. Zarr is a storage format for arrays. One can use those arrays in other pipelines with Numba, Dask, etc.
@jakirkham as mentioned in example numba not really friendly with zarr arrays and any modifications wont pass back.
Yeah wouldn't really expect in-place modifications to work. There's a lot of encoding, decompression, chunk aggregation, etc. logic that needs to happen
zarr_arr[0:100]
will return a NumPy array. Modifying that in-place won't affect Zarr. One could pass the whole Zarr array into that function, but then will need to explicitly write that out
If one loaded this with Dask, could use map_blocks
to apply the function over the array and then write that back out to Zarr
If I were to try with that code, would do something like this...
import numba as nb
import zarr
import numpy as np
@nb.njit()
def test(arr):
for i in range(arr.shape[0]):
arr[i] = 5.0
zarr_arr = zarr.full((100,), fill_value=np.nan, dtype='float64')
arr = zarr_arr[0:100]
test(arr)
zarr_arr[0:100] = arr
(should add I haven't run this. so please check as well)
To pick up this thread a bit – I think numba support would be really great.
I think the biggest use case here working with very large amounts of data as quickly as possible. Ideally I could iterate through chunks of my zarr array and compute something from them without having to swap back and forth between numba and python. This process is a bit of a pain, and has a fair bit of overhead.
I can see how this would be difficult. Not sure how much of the stack (e.g. possibly all of numcodecs?) would have to be compiled to make this work.
I think the biggest use case here working with very large amounts of data as quickly as possible.
This is generally what everyone wants to do. But I'm not sure that this proposed integration is needed for it. Zarr and Numba solve orthogonal problems. Zarr accelerates data I/O, which can speed up I/O bound problems. Zarr will help get data from files or object storage into memory quickly. Numba accelerates computation, which can speed up compute bound problems. Numba operates on in-memory data.
If you want to process a lot of data quickly using Zarr and Numba:
Can you explain why this workflow does not meet your needs?
A more sophisticated use case would involve using Dask to coordinate and schedule many simultaneous reading / processing tasks.
+1 to everything Ryan said.
A more sophisticated use case would involve using Dask to coordinate and schedule many simultaneous reading / processing tasks.
I think an interesting question for users asking about this would be, are you using or have you tried using Zarr + Dask + Numba? If so, what painpoints have you experienced when doing that? What do your workflows look like?
If we find enough common use cases of such a workflow above, we might be able to dive deeper into how these could be improved.
I wanted to write up a longer response to this with an example for indexing into an on disk sparse matrix, but that requires me digging up some old branches. I think I've got a nice small example though.
I would like to be able to efficiently search a sorted set of genomic intervals. E.g. bedtools intersect
, but with my genomic ranges stored in a chunked columnar format. My current use case is ATAC-seq data, and would involve a file similar to how ranges are stored by ArchR
.
Basically I would be needing to do a pair of binary searches over the start and end columns. I would like to just have one implementation of the search.
I may also want to find all sets of overlaps, by iterating through a pair of interval sets.
This code gets quite messy if there has to be a function barrier between the numba code, and the code that retrieves the chunks from files. In both of these cases I would be dynamically choosing which chunks are read when, so these are not good fits for dask.
Hello, is there are any plans to support numba njit? Can be very useful.