xhluca / bm25s

Fast lexical search implementing BM25 in Python using Numpy, Numba and Scipy
https://bm25s.github.io
MIT License
862 stars 35 forks source link

[Feature request] Document metadata and filtering #35

Closed dl423 closed 2 months ago

dl423 commented 3 months ago

Hi, I've used bm25s on a fairly large production dataset, and I'm super-impressed by the speed!! Having fumbled around with rank_bm25 quite a bit and suffered through the pain of its slow speed and large memory usage, I would say the speed and memory efficiency of bm25s is absolutely mind-blowing.

As a suggestion, I think it might be useful to add support for document metadata and filtering. The metadata would be fields like "author", "title", "date", etc. which wouldn't be included in the keyword tokenization, but can be used for filtering the search results during query (e.g. only searching for documents from a specific author).

Thanks!

xhluca commented 3 months ago

@dl423 Thank you for the suggestion! I think it's an interesting idea, however I'm not sure how that could be incorporated into the current API without changing how the corpus is currently filtered, which might require breaking changes.

For now, could you perhaps share an example of how that would work? We could add it to the examples dir, so users can follow best practices. If it's something that would be straightforward to add, I'm happy to review a PR with the new feature (perhaps as a util function that can be called before retrieve?) and relevant unit tests!

dl423 commented 3 months ago

@xhluca I've been thinking about how to implement the filter, and I settled on a simpler solution compared to what I last mentioned.

I was originally envisioning a filtering functionality where the query would be something like retriever.retrieve(query_tokens, k=2, filter={"author": "Charles Dickens"}). But implementing it might require significant change to the code, along with performance implications. Also, supporting more advanced filtering operations such as allowing multiple filter conditions joined by AND or OR can be a challenge.

Instead, I'm now thinking of a simpler approach where a bitmask is passed to the retriever to do the filtering. The bitmask will be a list of 1's and 0's, each corresponding to a doc in the corpus. Only the docs corresponding to a 1 will be included in the search results.

Here's an example of what I mean:

# Suppose there are 5 docs in the corpus
bitmask = [1, 0, 1, 1, 0] 
retriever.retrieve(query_tokens, k=2, filter=bitmask)

Then only the first, third and fourth documents can appear in the search result.

Here's a practical use case for this kind of filtering:

  1. Suppose a corpus of docs is stored in a database (e.g. Postgres) along with their metadata such as author, title, etc.
  2. This corpus is indexed in bm25s and stored in the exact same order as in the database
  3. When I want to do a bm25 search on only the docs written by Charles Dickens, I would first construct a bitmask from the database -- using a SQL to get the rows (docs) in the database whose author is Charles Dickens to return a 1 and the other rows become a 0.
  4. Pass the bitmask to bm25s retriever, so the search results only contain docs written by Charles Dickens.

Essentially, this approach leverages the powerful filtering capability already offered by a database system to do the real heavy-lifting for the filtering. This way, the filtering functionality in bm25s can be kept quite simple.

I expect this to be a relatively minor change that's mostly going to be made in selection.py. I will submit a PR once I'm done, thanks. :))

xhluca commented 2 months ago

That would be pretty interesting! I think it's worth adding an example and tests for this, if it works well i think it's a good idea to merge it. I think that the mask argument would be better than filter as it is more explicit about what it accomplishes?