Order-based matching of corpus metadata to to tokens

fabiannagel commented 4 months ago

Hi! Thanks a lot for this nice little library, the timing is perfect :)

If I want to provide additional metadata in my corpus, how is it matched to the indexed corpus tokens at retrieval time? Is it entirely based on both structures having the same order such that the indices apply?

Just looking for a quick confirmation before using this in a real-world application :)

Quick example to illustrate:

import bm25s
import Stemmer

# corpus with metadata
corpus = [
    {"id": 0, "text": "a cat is a feline and likes to purr"},
    {"id": 1, "text": "a dog is the human's best friend and loves to play"},
    {"id": 2, "text": "a bird is a beautiful animal that can fly"},
    {"id": 3, "text": "a fish is a creature that lives in water and swims"},
]

stemmer = Stemmer.Stemmer("english")

# build corpus without metadata
corpus_tokens = bm25s.tokenize([d['text'] for d in corpus], stopwords="en", stemmer=stemmer)
retriever = bm25s.BM25()
retriever.index(corpus_tokens)

query = "does the fish purr like a cat?"
query_tokens = bm25s.tokenize(query, stemmer=stemmer)

results, scores = retriever.retrieve(query_tokens, corpus=corpus, k=2)

for i in range(results.shape[1]):

    # doc is a dictionary with "id" and "text" - how are they matched?
    doc, score = results[0, i], scores[0, i]
    print(f"Rank {i + 1} (score: {score:.2f}): {doc}")

xhluca commented 4 months ago

Yes, it is dependent on the structure! If you don't provide corpus=corpus it will simply return the index, which you can use to match manually. bm25s does not do any index checking (since it only sees lists).

fabiannagel commented 4 months ago

Thanks for clarifying!

xhluca / bm25s

Order-based matching of corpus metadata to to tokens #4