Closed palmerabollo closed 7 years ago
Yup, it’s pretty normal for such a small text.
They’re really close to each other, the two matches, statistically. There’s just a slightly bigger chance of it being Spanish instead of English. Check out the data behind the guessing here: https://github.com/wooorm/franc/blob/45482a80634f595c460c097b7347c9db190e6959/lib/data.json#L3-L4.
really uncommon in spanish.
Interesting. The internal model contains things that are common, there’s no data based on uncommon things. It’s an interesting idea to model that too, maybe it’ll help. Downside is that it would increase the memory/size/speed of the library by a huge factor, and that it’s impossible for many of the other languages supported by franc, as there’s not enough data to model!
Thanks @wooorm . One more question, how do you generate that data? For example, why "| my |" is not one of the english tokens?
They’re all trigrams (three characters), and originate from the universal declaration of human rights (the document translated into the most languages, in the world). Bigrams (2) or other n-grams do not seem to get great results, especially because this library tries to stay small: doing so by only using the top-most 300 trigrams.
The scripts/
directory may provide insightful :)
Is it normal that a sentence such as "show me my services" gets classified as spanish before english?
It looks weird to me since tokens like "sh", "my"... and some letters ("w" or "y" in a sentence) are really uncommon in spanish.