zhigang-qi / lucene-hunspell

Automatically exported from code.google.com/p/lucene-hunspell
0 stars 0 forks source link

Interaction with LowerCaseFilterFactory #3

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Most dictionary words are lowercase, but there are also a lot of person names 
etc in there.

This creates an interaction case with lowercasing.

If you apply LowerCaseFilterFactory before HunspellStemFilterFactory, you will 
match both "dogs" 
and "Dogs" in the original text. But You'll not match proper names with 
endings, such as 
Norwegian possessive - Christians bil, which should be reduced to Christian.

If you however apply LowerCaseFilterFactory after HunSpellStemFilterFactory, it 
is the other way 
around - you'll handle "Christians" but not "Dogs".

Would it make sense to assume that .dic files should remain unchanged, 
recommend to put 
LowerCaseFilterFactory after stemming, and then allow some fallback logic 
inside 
HunspellStemFilterFactory - if no match is found, try to lookup the lowercase 
version of the term. 
This should be configurable with fallbackLookupLowercase="true".

Original issue reported on code.google.com by cominv...@gmail.com on 25 May 2010 at 12:04

GoogleCodeExporter commented 9 years ago
i noticed this in a lot of dictionary files too. 

I like the idea of allowing hunspells dictionary matching to be either case 
sensitive or not. then you can have it work whichever way you want, by putting 
lowercasefilter before or after hunspell filter, and by setting this case 
sensitive option to true or false.

Original comment by rcm...@gmail.com on 13 Jun 2010 at 10:06

GoogleCodeExporter commented 9 years ago
Added to Solr JIRA as https://issues.apache.org/jira/browse/SOLR-2792

Original comment by cominv...@gmail.com on 25 Sep 2011 at 12:14