BUG: Basic tests show that franc is extremely inaccurate

wooorm / franc

Natural language detection

https://wooorm.com/franc/

MIT License

4.15k stars 175 forks source link

BUG: Basic tests show that franc is extremely inaccurate #86

Closed niftylettuce closed 4 years ago

niftylettuce commented 4 years ago

> franc.all('Hola amiga', { only: [ 'eng', 'spa', 'por', 'ita', 'fra' ] })
[
  [ 'spa', 1 ],
  [ 'ita', 0.9323770491803278 ],
  [ 'fra', 0.5942622950819672 ],
  [ 'por', 0.5368852459016393 ],
  [ 'eng', 0 ]
]
> franc.all('Hola mi amiga', { only: [ 'eng', 'spa', 'por', 'ita', 'fra' ] })
[
  [ 'ita', 1 ],
  [ 'spa', 0.6840958605664488 ],
  [ 'fra', 0.6318082788671024 ],
  [ 'por', 0.08714596949891062 ],
  [ 'eng', 0 ]
]
> franc.all('Ciao amico!', { only: [ 'eng', 'spa', 'por', 'ita', 'fra' ] })
[
  [ 'spa', 1 ],
  [ 'por', 0.9940758293838863 ],
  [ 'ita', 0.9170616113744076 ],
  [ 'eng', 0.6232227488151658 ],
  [ 'fra', 0.46563981042654023 ]
]

These are all completely incorrect accuracies.

niftylettuce commented 4 years ago

Perhaps using Wikimedia word dictionary would be a better dataset for accuracy.

niftylettuce commented 4 years ago

Perhaps an error should be thrown if the string length doesn't reach a minimum number of characters, perhaps 200? Not sure if you've figured out what that magic number is.

niftylettuce commented 4 years ago

For insight into my comment on the Wikimedia dataset, there are basically downloadable tarballs of entire dictionaries of every language, which also includes topics/people/etc.

wooorm commented 4 years ago

There have been many issues about this, see my responses to most closed issues. It’s in the readme: https://github.com/wooorm/franc#whats-not-so-cool-about-franc

Perhaps an error should be thrown if the string length doesn't reach a minimum number of characters

See the example in the readme: pass minLength: 200.

Not sure if you've figured out what that magic number is.

There is none: this is data model based, there is no perfect answer. There is just “likeliness”.

Perhaps using Wikimedia word dictionary would be a better dataset for accuracy.

There is no bigger copyright-free dataset than the universal declaration of human rights. Franc focusses on supporting many languages. Checkout CLD-based projects if you care less about many languages.

niftylettuce commented 4 years ago

Awesome, it wasn't obvious at first that "und" meant undefined/not found. Might be useful to add this to the README and make more of an example.

wooorm commented 4 years ago

How about this?

 var franc = require('franc')

 franc('Alle menslike wesens word vry') // => 'afr'
 franc('এটি একটি ভাষা একক IBM স্ক্রিপ্ট') // => 'ben'
 franc('Alle menneske er fødde til fridom') // => 'nno'
-franc('') // => 'und'
-franc('the') // => 'und'
-
-/* You can change what’s too short (default: 10): */
-franc('the', {minLength: 3}) // => 'sco'
+
+// You can change what’s too short (default: 10):
+franc('the') // => 'und' (`und` is a language code which stands for undetermined)
+franc('the', {minLength: 3}) // => 'sco'

niftylettuce commented 4 years ago

👍 👍 👍

wooorm commented 4 years ago

Sweet, fixed!