Open Mifrill opened 3 years ago
Yes, there can be two or more tokenized forms corresponding to a single surface form. In fact, the default dictionary file does contain a few entries having such alternatives. However, I thought it would be more convenient to get the result as a single tokenized form rather than an array. This decision may not sound very thoughtful--especially if you have your own dict files that have many entries with multiple tokenized forms. Still, I would like to keep it this way so as not to break the existing code of many users. Thanks anyway!
I would like to keep it this way so as not to break the existing code of many users.
well if it is a not quite expected behavior which we can improve then, maybe we can think about release with this kind of "breaking" change? Because thing like: "break the existing code of many users" should be guarded by the gem version.
The logic
w, s = line.split(/\s+/)
compute only for 2 first matches even for cases with 3 matcheshttps://github.com/yohasebe/lemmatizer/blob/af70f993b22c631c08d636e89082a7f6e0e80b48/lib/lemmatizer/lemmatizer.rb#L123-L129
For example:
https://github.com/yohasebe/lemmatizer/blob/af70f993b22c631c08d636e89082a7f6e0e80b48/lib/dict/noun.exc#L2046
The word
zemindari
is out of the compute range, is it a bug?