stemming in the analyzer chain vs reference corpus (affecting termex, glossex, weirdness)

The analyzer chain will normalize candidate terms by stemming/lemmatizing. Depending on the choice of stemmer/lemmatizer, this may cause candidate terms to be incorrectly transformed. For example, "analysis" => "analysi".

This will affect algorithms that look up word/term frequency in a reference corpus. The reference corpus must be processed using the same analyzer chain, and/or stemming/lemmatizing. Otherwise, words/terms may have mismatch and causing unexpected results.

ziqizhang / jate

stemming in the analyzer chain vs reference corpus (affecting termex, glossex, weirdness) #24