Closed xhluca closed 2 months ago
Would proably make sense to have a Tokenizer class at this point to allow for generator/streaming. I.e.:
class Tokenizer:
def __init__(self):
self.vocab_dict = {}
def __call__(self, texts, stream=False):
for text in texts:
# ...
# update self.vocab_dict
if stream is True:
yield text_tokens
else:
tokens.append(text_tokens)
THis would allow it to be very memory efficient.
Right now, stemming is done after the strings are split and converted to IDs:
https://github.com/xhluca/bm25s/blob/73c7dea9ea7f88a23a7fa9a94e9a7bca48669f1c/bm25s/tokenization.py#L152-L177
However, it can probably be done here instead:
https://github.com/xhluca/bm25s/blob/73c7dea9ea7f88a23a7fa9a94e9a7bca48669f1c/bm25s/tokenization.py#L141-L142
Probably would need: