In the case I have an index of the queries, I would like to retrieve the tokenized version of that query. This use case can come up when doing bm25 eval across a matrix of known x and y types of objects.
x_corpus = [...]
y_corpus = [
"fooo",
"does the fish purr like a cat?"
"a bird is a beautiful animal that can fly",
"a fish is a creature that lives in water and swims",
]
class XEntity:
corpus_tokens = bm25s.tokenize(x_corpus, stopwords="en", stemmer=stemmer)
x_retriever = bm25s.BM25()
x_retriever.index(corpus_tokens)
class YEntity:
corpus_tokens = bm25s.tokenize(y_corpus, stopwords="en", stemmer=stemmer)
y_retriever = bm25s.BM25()
y_retriever.index(corpus_tokens)
a_query = "does the fish purr like a cat?"
Tokenized(ids=[[3, 1, 2, 0, 4]], vocab={'like': 0, 'fish': 1, 'purr': 2, 'doe': 3, 'cat': 4})
Given the results when tokenizing the index (looks like some optimizations happening), is there a way to get a subset from the index that represents the query as represented when the index was built?
In the case I have an index of the queries, I would like to retrieve the tokenized version of that query. This use case can come up when doing bm25 eval across a matrix of known
x
andy
types of objects.corpus_tokens for
y_corpus
If I tokenize independently...
Given the results when tokenizing the index (looks like some optimizations happening), is there a way to get a subset from the index that represents the query as represented when the index was built?