xhluca / bm25s

Fast lexical search library implementing BM25 in Python using Numpy and Scipy
https://bm25s.github.io
MIT License
761 stars 29 forks source link

On-the-fly stemming #31

Closed xhluca closed 1 day ago

xhluca commented 1 month ago

Right now, stemming is done after the strings are split and converted to IDs:

https://github.com/xhluca/bm25s/blob/73c7dea9ea7f88a23a7fa9a94e9a7bca48669f1c/bm25s/tokenization.py#L152-L177

However, it can probably be done here instead:

https://github.com/xhluca/bm25s/blob/73c7dea9ea7f88a23a7fa9a94e9a7bca48669f1c/bm25s/tokenization.py#L141-L142

Probably would need:

token_to_stem = {}  # do we need this? maybe useful to keep, though stemmer_fn should be sufficient
token_to_index = {}  # this is used to convert tokens to stem id (the true id) on the fly
stem_to_index = {}  # only tracks stems and their ID (this is the true vocab dict)

# example: changing -> chang, changed -> chang
# chang's stem_id = 42
# stem_to_index = {"chang": 42}  --> real vocab_dict
# token_to_index = {"changing": 42, "changed": 42}

# ...

for ...:
  if token not in token_to_index:
    stem = stemmer_fn(token)
    if stem not in stem_to_index:
      stem_to_index[stem] = len(stem_to_index)
    stem_id = stem_to_index[stem]
    token_to_index[token] = stem_id  # the token should now map to the stem's ID
  token_id = token_to_index[token]
# ...
vocab_dict = stem_to_index
xhluca commented 1 month ago

Would proably make sense to have a Tokenizer class at this point to allow for generator/streaming. I.e.:

class Tokenizer:
    def __init__(self):
        self.vocab_dict = {}

    def __call__(self, texts, stream=False):
        for text in texts:
            # ...
            # update self.vocab_dict
            if stream is True:
                yield text_tokens
            else:
                tokens.append(text_tokens) 

THis would allow it to be very memory efficient.