xhluca / bm25s

Fast lexical search library implementing BM25 in Python using Numpy and Scipy
https://bm25s.github.io
MIT License
768 stars 31 forks source link

Can you query without a tokenization step? #21

Closed snewcomer closed 2 months ago

snewcomer commented 2 months ago

In the case I have an index of the queries, I would like to retrieve the tokenized version of that query. This use case can come up when doing bm25 eval across a matrix of known x and y types of objects.

x_corpus = [...]

y_corpus = [
    "fooo",
    "does the fish purr like a cat?"
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

class XEntity:
  corpus_tokens = bm25s.tokenize(x_corpus, stopwords="en", stemmer=stemmer)
  x_retriever = bm25s.BM25()
  x_retriever.index(corpus_tokens)

class YEntity:
  corpus_tokens = bm25s.tokenize(y_corpus, stopwords="en", stemmer=stemmer)
  y_retriever = bm25s.BM25()
  y_retriever.index(corpus_tokens)

corpus_tokens for y_corpus

Tokenized(ids=[[8], [10, 7, 9, 2, 11, 4, 13, 6, 5, 12], [7, 0, 3, 14, 1]], vocab={'creatur': 0, 'swim': 1, 'like': 2, 'live': 3, 'bird': 4, 'can': 5, 'anim': 6, 'fish': 7, 'fooo': 8, 'purr': 9, 'doe': 10, 'cat': 11, 'fli': 12, 'beauti': 13, 'water': 14})

If I tokenize independently...

a_query = "does the fish purr like a cat?"
Tokenized(ids=[[3, 1, 2, 0, 4]], vocab={'like': 0, 'fish': 1, 'purr': 2, 'doe': 3, 'cat': 4})

Given the results when tokenizing the index (looks like some optimizations happening), is there a way to get a subset from the index that represents the query as represented when the index was built?

query_from_y = precomputed_representation_of_a_query_without_tokenization_step
ranked_results = x_retriever.retrieve(query_from_y, k=5)