weixsong / elasticlunr.js

Based on lunr.js, but more flexible and customized.
http://elasticlunr.com
MIT License
2.02k stars 147 forks source link

Stop words in hyphenated compounds #121

Open FrankKooij opened 3 years ago

FrankKooij commented 3 years ago

When two words are used together to yield a new meaning, a compound is formed. Compound words can be written in three ways: as open compounds (spelled as two words, e.g., ice cream), closed compounds (joined to form a single word, e.g., doorknob), or hyphenated compounds (two words joined by a hyphen, e.g., long-term). Sometimes, more than two words can form a compound (e.g., mother-in-law). (source: https://www.grammarly.com/blog/open-and-closed-compound-words)

If a word in a hyphenated compound is a stop word, elasticlunr will ignore it in its search and be less specific. It will not only find results for the compound, but also for the compound excluding all stop words. In one of the examples above, using the default stop words, elasticlunr will search for mother and law, since in is a stop word. The list of results may be much longer than for mother-in-law. A search for would-be, however, will not have any results at all, since both parts of this compound are stop words.

blackholeearth commented 8 months ago
elasticlunr.clearStopWords();

Will remove all stopwords.

Since , Td idf is already lowering the score of most common words. You dont need to worrying about them.

FrankKooij commented 8 months ago

I would like to see the score of compound word matches raised. If you are searching for mother-in-law, the score of an exact compound match should be higher than that of mother is in the house studying law (all parts of the compound word are present, but not as a compound) or if you ask the lawyer mother, law comes first (same but with stop words removed).