wooorm / franc

Natural language detection
https://wooorm.com/franc/
MIT License
4.07k stars 175 forks source link

Question about language accuracy #40

Closed palmerabollo closed 7 years ago

palmerabollo commented 7 years ago

Is it normal that a sentence such as "show me my services" gets classified as spanish before english?

> franc.all('show me my services', {minLength:1, whitelist:['eng','spa']})
[ [ 'spa', 1 ], [ 'eng', 0.9074778200253486 ] ]

It looks weird to me since tokens like "sh", "my"... and some letters ("w" or "y" in a sentence) are really uncommon in spanish.

wooorm commented 7 years ago

Yup, it’s pretty normal for such a small text.

They’re really close to each other, the two matches, statistically. There’s just a slightly bigger chance of it being Spanish instead of English. Check out the data behind the guessing here: https://github.com/wooorm/franc/blob/45482a80634f595c460c097b7347c9db190e6959/lib/data.json#L3-L4.

really uncommon in spanish.

Interesting. The internal model contains things that are common, there’s no data based on uncommon things. It’s an interesting idea to model that too, maybe it’ll help. Downside is that it would increase the memory/size/speed of the library by a huge factor, and that it’s impossible for many of the other languages supported by franc, as there’s not enough data to model!

palmerabollo commented 7 years ago

Thanks @wooorm . One more question, how do you generate that data? For example, why "| my |" is not one of the english tokens?

wooorm commented 7 years ago

They’re all trigrams (three characters), and originate from the universal declaration of human rights (the document translated into the most languages, in the world). Bigrams (2) or other n-grams do not seem to get great results, especially because this library tries to stay small: doing so by only using the top-most 300 trigrams.

The scripts/ directory may provide insightful :)