ziqizhang / jate

NEWS: JATE2.0 Beta.11 Released, see details below.
GNU Lesser General Public License v3.0
81 stars 29 forks source link

stemming in the analyzer chain vs reference corpus (affecting termex, glossex, weirdness) #24

Closed ziqizhang closed 8 years ago

ziqizhang commented 8 years ago

The analyzer chain will normalize candidate terms by stemming/lemmatizing. Depending on the choice of stemmer/lemmatizer, this may cause candidate terms to be incorrectly transformed. For example, "analysis" => "analysi".

This will affect algorithms that look up word/term frequency in a reference corpus. The reference corpus must be processed using the same analyzer chain, and/or stemming/lemmatizing. Otherwise, words/terms may have mismatch and causing unexpected results.