zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
http://zarr.readthedocs.io/
MIT License
1.41k stars 268 forks source link

numba njit support #771

Open aero-108 opened 3 years ago

aero-108 commented 3 years ago

Hello, is there are any plans to support numba njit? Can be very useful.

import numba as nb
import zarr
import numpy as np

@nb.njit()
def test(arr):
    for i in range(arr.shape[0]):
        arr[i] = 5.0

zarr_arr = zarr.full((100,), fill_value=np.nan, dtype='float64')

test(zarr_arr[0:100])

 zarr_arr[0:100]
Out[4]: 
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan])
jakirkham commented 3 years ago

Not sure I follow. Zarr is a storage format for arrays. One can use those arrays in other pipelines with Numba, Dask, etc.

aero-108 commented 3 years ago

@jakirkham as mentioned in example numba not really friendly with zarr arrays and any modifications wont pass back.

jakirkham commented 3 years ago

Yeah wouldn't really expect in-place modifications to work. There's a lot of encoding, decompression, chunk aggregation, etc. logic that needs to happen

zarr_arr[0:100] will return a NumPy array. Modifying that in-place won't affect Zarr. One could pass the whole Zarr array into that function, but then will need to explicitly write that out

If one loaded this with Dask, could use map_blocks to apply the function over the array and then write that back out to Zarr

jakirkham commented 3 years ago

If I were to try with that code, would do something like this...

import numba as nb
import zarr
import numpy as np

@nb.njit()
def test(arr):
    for i in range(arr.shape[0]):
        arr[i] = 5.0

zarr_arr = zarr.full((100,), fill_value=np.nan, dtype='float64')

arr = zarr_arr[0:100]
test(arr)
zarr_arr[0:100] = arr

(should add I haven't run this. so please check as well)

ivirshup commented 2 years ago

To pick up this thread a bit – I think numba support would be really great.

I think the biggest use case here working with very large amounts of data as quickly as possible. Ideally I could iterate through chunks of my zarr array and compute something from them without having to swap back and forth between numba and python. This process is a bit of a pain, and has a fair bit of overhead.

I can see how this would be difficult. Not sure how much of the stack (e.g. possibly all of numcodecs?) would have to be compiled to make this work.

rabernat commented 2 years ago

I think the biggest use case here working with very large amounts of data as quickly as possible.

This is generally what everyone wants to do. But I'm not sure that this proposed integration is needed for it. Zarr and Numba solve orthogonal problems. Zarr accelerates data I/O, which can speed up I/O bound problems. Zarr will help get data from files or object storage into memory quickly. Numba accelerates computation, which can speed up compute bound problems. Numba operates on in-memory data.

If you want to process a lot of data quickly using Zarr and Numba:

Can you explain why this workflow does not meet your needs?

A more sophisticated use case would involve using Dask to coordinate and schedule many simultaneous reading / processing tasks.

jakirkham commented 2 years ago

+1 to everything Ryan said.

A more sophisticated use case would involve using Dask to coordinate and schedule many simultaneous reading / processing tasks.

I think an interesting question for users asking about this would be, are you using or have you tried using Zarr + Dask + Numba? If so, what painpoints have you experienced when doing that? What do your workflows look like?

If we find enough common use cases of such a workflow above, we might be able to dive deeper into how these could be improved.

ivirshup commented 2 years ago

I wanted to write up a longer response to this with an example for indexing into an on disk sparse matrix, but that requires me digging up some old branches. I think I've got a nice small example though.

I would like to be able to efficiently search a sorted set of genomic intervals. E.g. bedtools intersect, but with my genomic ranges stored in a chunked columnar format. My current use case is ATAC-seq data, and would involve a file similar to how ranges are stored by ArchR.

Basically I would be needing to do a pair of binary searches over the start and end columns. I would like to just have one implementation of the search.

I may also want to find all sets of overlaps, by iterating through a pair of interval sets.

This code gets quite messy if there has to be a function barrier between the numba code, and the code that retrieves the chunks from files. In both of these cases I would be dynamically choosing which chunks are read when, so these are not good fits for dask.