Closed dl423 closed 2 months ago
@dl423 Thank you for the suggestion! I think it's an interesting idea, however I'm not sure how that could be incorporated into the current API without changing how the corpus is currently filtered, which might require breaking changes.
For now, could you perhaps share an example of how that would work? We could add it to the examples dir, so users can follow best practices. If it's something that would be straightforward to add, I'm happy to review a PR with the new feature (perhaps as a util function that can be called before retrieve?) and relevant unit tests!
@xhluca I've been thinking about how to implement the filter, and I settled on a simpler solution compared to what I last mentioned.
I was originally envisioning a filtering functionality where the query would be something like retriever.retrieve(query_tokens, k=2, filter={"author": "Charles Dickens"})
. But implementing it might require significant change to the code, along with performance implications. Also, supporting more advanced filtering operations such as allowing multiple filter conditions joined by AND or OR can be a challenge.
Instead, I'm now thinking of a simpler approach where a bitmask is passed to the retriever to do the filtering. The bitmask will be a list of 1's and 0's, each corresponding to a doc in the corpus. Only the docs corresponding to a 1 will be included in the search results.
Here's an example of what I mean:
# Suppose there are 5 docs in the corpus
bitmask = [1, 0, 1, 1, 0]
retriever.retrieve(query_tokens, k=2, filter=bitmask)
Then only the first, third and fourth documents can appear in the search result.
Here's a practical use case for this kind of filtering:
Essentially, this approach leverages the powerful filtering capability already offered by a database system to do the real heavy-lifting for the filtering. This way, the filtering functionality in bm25s can be kept quite simple.
I expect this to be a relatively minor change that's mostly going to be made in selection.py
. I will submit a PR once I'm done, thanks. :))
That would be pretty interesting! I think it's worth adding an example and tests for this, if it works well i think it's a good idea to merge it. I think that the mask
argument would be better than filter
as it is more explicit about what it accomplishes?
Hi, I've used bm25s on a fairly large production dataset, and I'm super-impressed by the speed!! Having fumbled around with rank_bm25 quite a bit and suffered through the pain of its slow speed and large memory usage, I would say the speed and memory efficiency of bm25s is absolutely mind-blowing.
As a suggestion, I think it might be useful to add support for document metadata and filtering. The metadata would be fields like "author", "title", "date", etc. which wouldn't be included in the keyword tokenization, but can be used for filtering the search results during query (e.g. only searching for documents from a specific author).
Thanks!