xhluca / bm25s

Fast lexical search library implementing BM25 in Python using Numpy and Scipy
https://bm25s.github.io
MIT License
767 stars 31 forks source link

[feature request] Implement BMX algorithm #40

Open logan-markewich opened 1 month ago

logan-markewich commented 1 month ago

Recently introduced here https://www.mixedbread.ai/blog/intro-bmx

Seems like there's enough info + another open source implementation that this algorithm could make it's way here

xhluca commented 1 month ago

I think the mixedbread.ai team did a great job with implementing bmx in https://github.com/mixedbread-ai/baguetter/

I'm not sure if it makes sense to copy their code into this repository, considering they are actively maintaining their new library. They might also make changes to their algorithms in the future too.

logan-markewich commented 1 month ago

That's fair. Just thought it might be nice to have a single place for all things bm25 😁

xhluca commented 1 month ago

I think if it is as easy as changing the term frequency component or the idf function, it would definitely be great to add! Im not sure if BMX is as simple as that yet though, since I haven't had the chance to read the paper closely.

At the moment, bm25+, BM25L, etc. are all very similar to the base implementation, with only small changes, so it was easy to add those variants.

logan-markewich commented 13 hours ago

Was reading more about this paper yesterday. It looks like it requires calculating two additional parameters -- term entropies, and "similarity"

i.e. in rust syntax,

let idf = ((num_docs - doc_freq + 0.5) / (doc_freq + 0.5) + 1.0).ln();
let tf = (term_freq * (self.alpha + 1.0)) / (term_freq + (self.k * ((1.0 - self.b) + self.b * doc_length / avg_doc_length)) + self.alpha * term_entropies.avg);
score += idf * (tf + self.beta * term_entropies.normalized[i] * similarity);

Where similarity is just counting the number of common terms, and term entropies is something like

// Calculate probabilities and entropy
let mut entropy = 0.0;
for freq in doc_term_freqs {
    let p = freq as f64 / term_freq_sum as f64;
    entropy -= p * p.log2();
}