wooorm / franc

Natural language detection
https://wooorm.com/franc/
MIT License
4.15k stars 175 forks source link

Fix incorrect results for Japanese #77

Closed lorumic closed 5 years ago

lorumic commented 5 years ago

Since Japanese regex included only Hiragana and Katakana, every input containing kana but more than 50% of Han characters was mistakenly detected as Chinese.

This fix is necessary because many Japanese texts (especially technical ones) may have a kanji ratio higher than 50%. Since the detect method counts the occurrences of matching expressions (characters, in this case), for Japanese it would match only kana (< 50% of total characters), hence returning Chinese as the top language. But if there is even just one kana, the text should be detected as Japanese.

A trivial example:

持込申請最悪状態完了

contains Han characters only so I would agree with it being detected as Chinese, but if I insert the hiragana は particle as follows:

持込申請最悪状態完了

it should be detected as Japanese, because it is a mostly correct Japanese sentence (abbreviated, if anything), but instead franc reports it as Chinese, due to the fact that 90% of the characters are Han and only 10% is kana.

In order to prevent that, I have added the regex for Japanese kanji, which in my opinion is cleaner than just throwing in the whole Han block, so that when detecting also kanji would be matched for Japanese. So, in the examples above, the first case would still be (correctly) detected as Chinese, but the second one would have a 90% score for Chinese and a 100% score for Japanese, hence being correctly detected as Japanese.