wooorm / franc

Natural language detection
https://wooorm.com/franc/
MIT License
4.07k stars 175 forks source link

Some Chinese sentences are detected as Japanese #84

Open kewang opened 4 years ago

kewang commented 4 years ago

sentence 1

特別推薦的必訪店家「ヤマシロヤ」,雖然不在阿美橫町上,但就位於JR上野站廣小路口對面

jpn 1
google translate result is Chinese correctly

sentence 2

特別推薦的必訪店家,雖然不在阿美橫町上,但就位於JR上野站廣小路口對面

cmn 1
google translate result is Chinese correctly

Sentence 1 almost are Chinese characters and contains 5 Katakana characters. But its result is jpn incorrectly.

Sentence 2 are Chinese characters fully, and its result is cmn correctly.

Maybe the result is related to #77

wooorm commented 4 years ago

Thanks. I don’t read, write, or speak Japanese or Chine so I can’t really help. PRs like with GH-77 are welcome!

kewang commented 4 years ago

Hi @wooorm, @the-worldly-monkey

From https://www.unicode.org/faq/han_cjk.html#4 (How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?)

A better solution is to look at the text as a whole: if there's a fair amount of kana, it's probably Japanese, and if there's a fair amount of hangul, it's probably Korean.

According to url, I will add some extra rules to getTopScript(value, scripts) when detect CJK sentence.

niftylettuce commented 4 years ago

@kewang PR would be great on this!!