Open dhdaines opened 3 months ago
For (1) I can just extract them from the Node code, it's quite easy to do...
For (2), it seems like this might be on purpose:
https://github.com/yeraydiazdiaz/lunr.py/blob/master/lunr/lunr.py#L66
Can you explain why? Bug-compatibility with lunr.js? (EDIT: yes, bug-compatibility, it appears)
After digging a bit more it appears this is due to the difficulty of registering the necessary trimmers and stopword filters when the serialized index is reloaded? Only the stemmers are registered: https://github.com/yeraydiazdiaz/lunr.py/blob/master/lunr/languages/__init__.py#L99
The workaround I found is to explicitly add them to search_pipeline
in the builder, then explicitly call get_nltk_builder
for the language(s) in question before loading the serialized index, e.g.:
for funcname in ("lunr-multi-trimmer-fr", "stopWordFilter-fr",):
builder.search_pipeline.before(
builder.search_pipeline.registered_functions["stemmer-fr"],
builder.search_pipeline.registered_functions[funcname],
)
...
get_nltk_builder(["fr"])
index = Index.load(...)
(2) is addressed in #151 now
I've submitted a PR to lunr-langugages to fix the problem with the trimmer missing important characters (it wasn't passing its own test suite): https://github.com/MihaiValentin/lunr-languages/pull/115
I think that we can re-use the same JS code that generates the lunr-languages trimmers, stemmers, and stopword filters to generate Python code for lunr.py, I hope to make a new PR to address this issue which does that soon!
I notice that when using language support, some words cannot be searched:
This would seem to be due to the missing trimmer in the search pipeline:
Not sure really why, but it seems the trimmer thinks
ô
should be trimmed:So, there are really two problems: