yeraydiazdiaz / lunr.py

A Python implementation of Lunr.js 🌖
http://lunr.readthedocs.io
MIT License
187 stars 16 forks source link

Cannot get some search results when typing the exact keyword match #112

Closed tristanlatr closed 2 years ago

tristanlatr commented 2 years ago

Hi @yeraydiazdiaz,

First thanks for the latest release, I've almost finished the search system I was working on (you can check it out here https://pydoctor.readthedocs.io/en/latest/api/index.html). But there is strange behaviour that makes two apprently very similar queries return radically different results.

Basically, when searching for "to_stan" it give the expected results but when looking for "to_node", it gives nothing (same with "ParsedRstDocstring"). The keywords are in the index JSON file and the search is working as expected if I search for "to_node*" or "to_node~1" instead (weirdly "ParsedRstDocstring*" works but "ParsedRstDocstring~1" doesn't). Ths issue probably happens with other objects.

I've written a test file case here: https://github.com/twisted/pydoctor/blob/b3d70b10fbeaf04de452e62dd686665360229800/docs/tests/test.py#L173-L216

The code generating the index is here : https://github.com/twisted/pydoctor/blob/b3d70b10fbeaf04de452e62dd686665360229800/pydoctor/templatewriter/search.py

I don't understand this behaviour, same behaviour if I turn off the trimmer function from the index pipeline.

Any insight appreciated!

Thanks

yeraydiazdiaz commented 2 years ago

Hi @tristanlatr, I had a quick look and I believe the reason is the search pipeline stemming to_node to to_nod, which does not happen in to_stan:

>>> index.pipeline.run_string("to_node")
['to_nod']
>>> index.pipeline.run_string("to_stan")
['to_stan']
>>> index.pipeline.run_string("ParsedRstDocstring~1")
['ParsedRstDocstring~1']
>>> index.pipeline.run_string("ParsedRstDocstring")
['ParsedRstDocstr']

Removing the stemmer from the index pipeline with index.pipeline.remove(index.pipeline._stack[0]) makes all the tests pass. You should be able to remove it as well in your index creation code via builder.search_pipeline which should result in an empty list in the serialised index.

A potential caveat is that non-symbol searches would not yield the expected results since there is no stemming but looks like that's not a problem in your case.

Nice work on pydoctor btw, it works great 👍🏻

tristanlatr commented 2 years ago

Hum, I see. Thanks for the information.

Would it be possible to make the stemmer keep the original word as well as the stemmed words ? I would need to custom the JavaScript stemmer as well I guess ?

Otherwise I’ll just disable the stemmer for the index of names and enable it for the index with docstrings.

thanks for your reply

yeraydiazdiaz commented 2 years ago

The search pipeline will be ran on the search terms and then used to match the terms in the inverted index so there is not concept of 'keeping' as the stemmed term is used immediately. If you want both cases I don't think you can avoid making two searches and combining the results.

If you have two indices one with names and another with docstrings, then the easiest approach is to disable the stemmer in the names one. You can disable it in lunr.py before the serialisation and it should not load in lunr.js, or you could use Pipeline.remove in JavaScript after the index is loaded.

tristanlatr commented 2 years ago

Thanks. Closing this issue.

I’ve simply removed the stemmer from the search pipeline.

tpederson commented 8 months ago

Removing the stemmer from the index pipeline with index.pipeline.remove(index.pipeline._stack[0]) makes all the tests pass. You should be able to remove it as well in your index creation code via builder.search_pipeline which should result in an empty list in the serialised index.

What is the proper way to remove the stemmer? Would this work, or do I need to remove it from (only) the main pipeline?

from lunr import lunr, get_default_builder, stemmer

builder = get_default_builder()
builder.search_pipeline.remove(stemmer)
idx = lunr(ref="page", fields=["text"], documents=documents, builder=builder)
serialized_idx = idx.serialize()

A potential caveat is that non-symbol searches would not yield the expected results since there is no stemming but looks like that's not a problem in your case.

Sorry, I am new to this. What does this mean?

yeraydiazdiaz commented 8 months ago

That's close, you need to change line 4 to builder.search_pipeline.remove(stemmer.stemmer) as stemmer refers to the module and remove will not do anything, which is incorrect and should raise an exception (I've opened #143 to track this).

You would need to remove the stemmer from both the index pipeline and the search pipeline, otherwise the index will contain stemmed words that would not match the search terms are they're not being stemmed by the search pipeline.

Sorry, I am new to this. What does this mean?

I mean that if you disable the stemming everywhere and index documents with natural language, similar words will fail to show up in searches. For example, if the document contains 'flying' and you search for 'fly' the document will not show up in the results because 'flying' would not have been stemmed, the search term would have to be exactly 'flying'.