Open logan-markewich opened 1 month ago
I think the mixedbread.ai team did a great job with implementing bmx in https://github.com/mixedbread-ai/baguetter/
I'm not sure if it makes sense to copy their code into this repository, considering they are actively maintaining their new library. They might also make changes to their algorithms in the future too.
That's fair. Just thought it might be nice to have a single place for all things bm25 😁
I think if it is as easy as changing the term frequency component or the idf function, it would definitely be great to add! Im not sure if BMX is as simple as that yet though, since I haven't had the chance to read the paper closely.
At the moment, bm25+, BM25L, etc. are all very similar to the base implementation, with only small changes, so it was easy to add those variants.
Was reading more about this paper yesterday. It looks like it requires calculating two additional parameters -- term entropies, and "similarity"
i.e. in rust syntax,
let idf = ((num_docs - doc_freq + 0.5) / (doc_freq + 0.5) + 1.0).ln();
let tf = (term_freq * (self.alpha + 1.0)) / (term_freq + (self.k * ((1.0 - self.b) + self.b * doc_length / avg_doc_length)) + self.alpha * term_entropies.avg);
score += idf * (tf + self.beta * term_entropies.normalized[i] * similarity);
Where similarity is just counting the number of common terms, and term entropies is something like
// Calculate probabilities and entropy
let mut entropy = 0.0;
for freq in doc_term_freqs {
let p = freq as f64 / term_freq_sum as f64;
entropy -= p * p.log2();
}
Recently introduced here https://www.mixedbread.ai/blog/intro-bmx
Seems like there's enough info + another open source implementation that this algorithm could make it's way here