On-the-fly stemming - Githubissues

Right now, stemming is done after the strings are split and converted to IDs:

https://github.com/xhluca/bm25s/blob/73c7dea9ea7f88a23a7fa9a94e9a7bca48669f1c/bm25s/tokenization.py#L152-L177

However, it can probably be done here instead:

https://github.com/xhluca/bm25s/blob/73c7dea9ea7f88a23a7fa9a94e9a7bca48669f1c/bm25s/tokenization.py#L141-L142

Probably would need:

token_to_stem = {}  # do we need this? maybe useful to keep, though stemmer_fn should be sufficient
token_to_index = {}  # this is used to convert tokens to stem id (the true id) on the fly
stem_to_index = {}  # only tracks stems and their ID (this is the true vocab dict)

# example: changing -> chang, changed -> chang
# chang's stem_id = 42
# stem_to_index = {"chang": 42}  --> real vocab_dict
# token_to_index = {"changing": 42, "changed": 42}

# ...

for ...:
  if token not in token_to_index:
    stem = stemmer_fn(token)
    if stem not in stem_to_index:
      stem_to_index[stem] = len(stem_to_index)
    stem_id = stem_to_index[stem]
    token_to_index[token] = stem_id  # the token should now map to the stem's ID
  token_id = token_to_index[token]
# ...
vocab_dict = stem_to_index

xhluca / bm25s

On-the-fly stemming #31