xhluca / bm25s

Fast lexical search implementing BM25 in Python using Numpy, Numba and Scipy
https://bm25s.github.io
MIT License
916 stars 38 forks source link

Make empty strings an acceptable token #67

Closed xhluca closed 1 month ago

xhluca commented 1 month ago

Closes #60

xhluca commented 1 month ago

@zhuwenxing does this tackle the issue you encountered? Feel free to contribute tests for that specific instance.

zhuwenxing commented 1 month ago

Hi, @xhluca . From my perspective, raising an error when tokenizing an empty string is not a good practice. In real-world data distribution, it's common for data to be imperfect and contain null values (or empty strings), so it's normal for data to include empty strings.

Raising an error directly for empty strings might frustrate developers, who would then have to clean the data before using bm25s.

Additionally, I experimented with Elasticsearch, and inserting data with empty strings didn't cause any issues. Even if all the data were empty strings, searching wouldn't result in an error.

xhluca commented 1 month ago

You are right. I've removed the valueerrror and instead use make the empty string into an accepted vocab token.

Can you let me know if that works on your end?

xhluca commented 1 month ago

@zhuwenxing I tried running your example, however there was a few things i had to change, i also removed jieba to make it more generic (since the problem can be found in english as well), here's the updated code:

import bm25s
from bm25s.tokenization import Tokenizer

# Create an empty corpus
corpus = ["", "", "", ""]
# Create a list of queries
queries = ["what is the meaning of life?"]

# The tokenizer will return a list of list of tokens
tokenizer = Tokenizer()
corpus_tokens = tokenizer.tokenize(corpus, return_as="tuple", allow_empty=True)
print("Corpus tokens:", corpus_tokens)

query_tokens = tokenizer.tokenize(queries, return_as="ids", update_vocab=False, allow_empty=True)
print(f"Query tokens: {query_tokens}")

retriever = bm25s.BM25()
retriever.index(corpus_tokens)

results, scores = retriever.retrieve(query_tokens, corpus=corpus, k=2)
print("Results:", results)

If you run the updated code, you will find an error with the latest version of this library, however, if you run it after commit https://github.com/xhluca/bm25s/pull/67/commits/c87f2fe2306023ae7e0c71bfe7c3b87ba0a6d71a of this branch, the error will be resolved.

Anyways, let me know if this is the intended behavior you were expecting.