Refactor retrieval to make it faster to run in numba mode

xhluca commented 2 months ago

This a work in progress!

This PR will make numba mode faster by rewriting the entire retrieve process into a numba JIT-able function (see _retrieve_internal_numba_parallel)

TODO:

[x] Cleanup retrieve_numba to make it compatible with retrieve when BM25 object is initiatilized with backend="numba"
[x] Deprecate selection_backend in retrieve so that it happens at the object init time
[x] Potentially rename _retrieve_internal_numba_parallel
[x] Make tqdm work in _retrieve_internal_numba_parallel
[x] Potentially refactor the behavior of the selection and numba.selection modules
[x] Create a tokenizer class (perhaps in a separate PR? also should handle #31 at the same time)
[x] add Tests for numba in numpy-disk mode and with bm25+ (use non-occurrence matrix)

xhluca commented 2 months ago

I wonder if it is possible to do invertex indexing here, by creating an array that tracks start and end: https://github.com/xhluca/bm25s/blob/daf29ceaa2fd77ca8601920502b7b8f05eb82be2/bm25s/scoring.py#L329-L352

xhluca commented 2 months ago

Deprecate selection_backend in retrieve so that it happens at the object init time

In retrospective, it seems that selection_backend remains useful for testing purposes, as well as using the jax backend. Let's not deprecate it in 0.2.0

xhluca commented 2 months ago

Make tqdm work in _retrieve_internal_numba_parallel

Unfortuantely tqdm won't work, so we can't add progress bar to retrieve when backend is set to numba

xhluca commented 2 months ago

Create a tokenizer class (perhaps in a separate PR? also should handle https://github.com/xhluca/bm25s/issues/31 at the same time)

Will do that in a separate PR

xhluca / bm25s

Refactor retrieval to make it faster to run in numba mode #47