yeraydiazdiaz / lunr.py

A Python implementation of Lunr.js 🌖
http://lunr.readthedocs.io
MIT License
188 stars 16 forks source link

Trimmer and stop word filter are missing from search pipelines #151

Open dhdaines opened 1 week ago

dhdaines commented 1 week ago

They are not added, which will definitely cause problems with recall in the case where users add punctuation to their queries.

Unfortunately, this is a bug-compatibility with lunr.js issue: https://github.com/olivernn/lunr.js/blob/aa5a878f62a6bba1e8e5b95714899e17e8150b38/lunr.js#L49

But it should be documented and there should be a documented way to work around it. This is pretty easy:

builder = get_default_builder()
builder.search_pipeline.before(stemmer.stemmer, trimmer.trimmer)
builder.search_pipeline.before(stemmer.stemmer, stop_word_filter.stop_word_filter)
dhdaines commented 1 week ago

This documentation for lunr.js is incorrect, for instance: https://lunrjs.com/docs/lunr.Pipeline.html :

An instance of lunr.Index created with the lunr shortcut will contain a pipeline with a stop word filter and an English language stemmer

dhdaines commented 1 week ago

Here is a minimal example to show the problem, which I think you'll agree is pretty serious:

from lunr import lunr
index = lunr(
    ref="id",
    fields=["title", "body"],
    documents=[
        {"id": "1", "title": "To be or not to be?", "body": "That is the question!"}
    ],
)
print(index.search("What is the question?"))  # Should print something, but doesn't!