yohasebe / lemmatizer

Lemmatizer for text in English. Inspired by Python's nltk.corpus.reader.wordnet.morphy
MIT License
108 stars 15 forks source link

The exceptions collection does not include full list #14

Open Mifrill opened 3 years ago

Mifrill commented 3 years ago

The logic w, s = line.split(/\s+/) compute only for 2 first matches even for cases with 3 matches

https://github.com/yohasebe/lemmatizer/blob/af70f993b22c631c08d636e89082a7f6e0e80b48/lib/lemmatizer/lemmatizer.rb#L123-L129

For example:

https://github.com/yohasebe/lemmatizer/blob/af70f993b22c631c08d636e89082a7f6e0e80b48/lib/dict/noun.exc#L2046

      open_file(exc) do |io|
        io.each_line do |line|
          w, s = line.split(/\s+/)
          if line =~ /zamin/
            puts line
            puts w
            puts s
          end
          @exceptions[pos][w] ||= []
          @exceptions[pos][w] << s
        end
      end

# => 
# zamindaris zamindari zemindari
# zamindaris
# zamindari

The word zemindari is out of the compute range, is it a bug?

yohasebe commented 3 years ago

Yes, there can be two or more tokenized forms corresponding to a single surface form. In fact, the default dictionary file does contain a few entries having such alternatives. However, I thought it would be more convenient to get the result as a single tokenized form rather than an array. This decision may not sound very thoughtful--especially if you have your own dict files that have many entries with multiple tokenized forms. Still, I would like to keep it this way so as not to break the existing code of many users. Thanks anyway!

Mifrill commented 3 years ago

I would like to keep it this way so as not to break the existing code of many users.

well if it is a not quite expected behavior which we can improve then, maybe we can think about release with this kind of "breaking" change? Because thing like: "break the existing code of many users" should be guarded by the gem version.