xhluca / bm25s

Fast lexical search implementing BM25 in Python using Numpy, Numba and Scipy
https://bm25s.github.io
MIT License
862 stars 35 forks source link

Refactor retrieval to make it faster to run in numba mode #47

Closed xhluca closed 2 months ago

xhluca commented 2 months ago

This a work in progress!

This PR will make numba mode faster by rewriting the entire retrieve process into a numba JIT-able function (see _retrieve_internal_numba_parallel)

TODO:

xhluca commented 2 months ago

I wonder if it is possible to do invertex indexing here, by creating an array that tracks start and end: https://github.com/xhluca/bm25s/blob/daf29ceaa2fd77ca8601920502b7b8f05eb82be2/bm25s/scoring.py#L329-L352

xhluca commented 2 months ago

Deprecate selection_backend in retrieve so that it happens at the object init time

In retrospective, it seems that selection_backend remains useful for testing purposes, as well as using the jax backend. Let's not deprecate it in 0.2.0

xhluca commented 2 months ago

Make tqdm work in _retrieve_internal_numba_parallel

Unfortuantely tqdm won't work, so we can't add progress bar to retrieve when backend is set to numba

xhluca commented 2 months ago

Create a tokenizer class (perhaps in a separate PR? also should handle https://github.com/xhluca/bm25s/issues/31 at the same time)

Will do that in a separate PR