yeraydiazdiaz / lunr.py

A Python implementation of Lunr.js 🌖
http://lunr.readthedocs.io
MIT License
188 stars 16 forks source link

Indexing and search pipelines are mismatched with language support #149

Open dhdaines opened 3 months ago

dhdaines commented 3 months ago

I notice that when using language support, some words cannot be searched:

index = lunr(
    ref="id",
    fields=["texte"],
    documents=[{"id": "1", "texte": "Allô tout le monde!"}],
    languages="fr",
)
print(index.search("allo"))  # prints [], should print something!

This would seem to be due to the missing trimmer in the search pipeline:

print(get_default_builder("fr").pipeline)
# <Pipeline stack="lunr-multi-trimmer-fr,stopWordFilter-fr,stemmer-fr">
print(index.pipeline)
# <Pipeline stack="stemmer-fr">

Not sure really why, but it seems the trimmer thinks ô should be trimmed:

print(index.serialize()["invertedIndex"])
# [['all', {'texte': {'1': defaultdict(<class 'list'>, {})}, '_index': 0}], ['mond', {'texte': {'1': defaultdict(<class 'list'>, {})}, '_index': 2}], ['tout', {'texte': {'1': defaultdict(<class 'list'>, {})}, '_index': 1}]]

So, there are really two problems:

  1. The trimmer has odd ideas about what characters are in the language (known problem, see https://github.com/yeraydiazdiaz/lunr.py/blob/master/lunr/languages/trimmer.py#L7 and https://github.com/yeraydiazdiaz/lunr.py/blob/master/lunr/languages/trimmer.py#L7)
  2. The trimmer and stopword filters are not in the search pipeline.
dhdaines commented 3 months ago

For (1) I can just extract them from the Node code, it's quite easy to do...

dhdaines commented 3 months ago

For (2), it seems like this might be on purpose:

https://github.com/yeraydiazdiaz/lunr.py/blob/master/lunr/lunr.py#L66

Can you explain why? Bug-compatibility with lunr.js? (EDIT: yes, bug-compatibility, it appears)

dhdaines commented 3 months ago

After digging a bit more it appears this is due to the difficulty of registering the necessary trimmers and stopword filters when the serialized index is reloaded? Only the stemmers are registered: https://github.com/yeraydiazdiaz/lunr.py/blob/master/lunr/languages/__init__.py#L99

The workaround I found is to explicitly add them to search_pipeline in the builder, then explicitly call get_nltk_builder for the language(s) in question before loading the serialized index, e.g.:

for funcname in ("lunr-multi-trimmer-fr", "stopWordFilter-fr",):
    builder.search_pipeline.before(
        builder.search_pipeline.registered_functions["stemmer-fr"],
        builder.search_pipeline.registered_functions[funcname],
    )

...

get_nltk_builder(["fr"])
index = Index.load(...)
dhdaines commented 3 months ago

(2) is addressed in #151 now

dhdaines commented 2 months ago

I've submitted a PR to lunr-langugages to fix the problem with the trimmer missing important characters (it wasn't passing its own test suite): https://github.com/MihaiValentin/lunr-languages/pull/115

I think that we can re-use the same JS code that generates the lunr-languages trimmers, stemmers, and stopword filters to generate Python code for lunr.py, I hope to make a new PR to address this issue which does that soon!