wooorm / franc

Natural language detection
https://wooorm.com/franc/
MIT License
4.07k stars 175 forks source link

Issue in detecting English #38

Closed prasadKodeInCloud closed 7 years ago

prasadKodeInCloud commented 7 years ago

Hi, I found that language detection for basic English sentences is poor. ex: var lan = franc.all( "I am not good at detecting languages." )

result: [ [ "dan", 1 ], [ "pam", 0.9966273187183811 ], [ "cat", 0.9858347386172007 ], [ "tpi", 0.9021922428330522 ], [ "nob", 0.8954468802698146 ], [ "tgl", 0.8671163575042158 ], [ "swe", 0.8526138279932547 ], [ "nno", 0.8094435075885329 ], [ "eng", 0.8084317032040472 ], [ "ind", 0.7925801011804384 ], [ "afr", 0.7895446880269814 ], [ "bcl", 0.7736930860033727 ], [ "jav", 0.7602023608768971 ], [ "ace", 0.742327150084317 ], [ "hil", 0.736593591905565 ], [ "ceb", 0.736256323777403 ], [ "lav", 0.7251264755480606 ], [ "hms", 0.7234401349072512 ], [ "tzm", 0.7234401349072512 ], [ "bug", 0.6934232715008432 ], [ "sco", 0.6664418212478921 ], [ "fra", 0.6657672849915683 ], [ "ban", 0.6620573355817876 ], [ "min", 0.6590219224283305 ], [ "deu", 0.6586846543001686 ], [ "ssw", 0.6344013490725127 ], [ "nld", 0.6259696458684654 ], [ "sun", 0.6236087689713322 ], [ "mos", 0.6145025295109612 ], [ "aka", 0.6040472175379427 ], [ "wol", 0.5854974704890388 ], [ "ilo", 0.5517706576728499 ], [ "war", 0.5450252951096122 ], [ "bem", 0.5386172006745362 ], [ "glg", 0.5365935919055649 ], [ "tiv", 0.5342327150084317 ], [ "src", 0.5338954468802698 ], [ "mad", 0.5258010118043845 ], [ "ckb", 0.5204047217537943 ], [ "nso", 0.5166947723440135 ], [ "run", 0.512310286677909 ], [ "uzn", 0.5119730185497471 ], [ "toi", 0.5089376053962901 ], [ "bci", 0.500168634064081 ], [ "nds", 0.49409780775716694 ], [ "tsn", 0.478920741989882 ], [ "als", 0.47858347386172007 ], [ "por", 0.47386172006745364 ], [ "tso", 0.47082630691399663 ], [ "spa", 0.4674536256323777 ], [ "sot", 0.466441821247892 ], [ "bam", 0.45834738617200677 ], [ "nya", 0.457672849915683 ], [ "lit", 0.45059021922428333 ], [ "rmn", 0.4499156829679595 ], [ "ndo", 0.44957841483979766 ], [ "tuk", 0.4458684654300169 ], [ "nyn", 0.4441821247892074 ], [ "snk", 0.44215851602023604 ], [ "kin", 0.4411467116357505 ], [ "uig", 0.4404721753794266 ], [ "ron", 0.4300168634064081 ], [ "zul", 0.4269814502529511 ], [ "emk", 0.42495784148397975 ], [ "lun", 0.42495784148397975 ], [ "nhn", 0.4215851602023609 ], [ "rmy", 0.41787521079258005 ], [ "hat", 0.41483979763912315 ], [ "ita", 0.41483979763912315 ], [ "ewe", 0.41180438448566614 ], [ "xho", 0.4101180438448566 ], [ "yao", 0.40775716694772346 ], [ "sna", 0.40067453625632377 ], [ "umb", 0.39932546374367617 ], [ "knc", 0.3942664418212479 ], [ "cjk", 0.3942664418212479 ], [ "kng", 0.39291736930860033 ], [ "hun", 0.3709949409780776 ], [ "plt", 0.37032040472175376 ], [ "kde", 0.36998313659359194 ], [ "som", 0.3595278246205733 ], [ "suk", 0.3591905564924115 ], [ "quy", 0.35750421585160197 ], [ "tur", 0.3534569983136594 ], [ "snn", 0.35143338954468806 ], [ "swh", 0.35109612141652613 ], [ "epo", 0.3504215851602024 ], [ "lug", 0.34974704890387853 ], [ "quz", 0.3490725126475548 ], [ "gaa", 0.34839797639123105 ], [ "men", 0.3463743676222597 ], [ "kmb", 0.34569983136593596 ], [ "ces", 0.3365935919055649 ], [ "dip", 0.33524451939291733 ], [ "est", 0.3349072512647555 ], [ "ayr", 0.33423271500843166 ], [ "hau", 0.3247892074198988 ], [ "dyu", 0.3163575042158516 ], [ "lin", 0.31365935919055654 ], [ "bin", 0.30826306913996626 ], [ "gax", 0.3032040472175379 ], [ "sag", 0.2930860033726813 ], [ "srp", 0.29072512647554805 ], [ "lua", 0.2897133220910624 ], [ "vmw", 0.28364249578414835 ], [ "vie", 0.2789207419898819 ], [ "ibb", 0.23440134907251264 ], [ "azj", 0.2249578414839798 ], [ "pol", 0.2236087689713322 ], [ "bos", 0.2165261382799325 ], [ "slk", 0.20674536256323772 ], [ "hrv", 0.2020236087689713 ], [ "qug", 0.19999999999999996 ], [ "tem", 0.19999999999999996 ], [ "ada", 0.18549747048903875 ], [ "slv", 0.18111298482293425 ], [ "fin", 0.1615514333895447 ], [ "kbp", 0.15210792580101185 ], [ "ibo", 0.13929173693086006 ], [ "yor", 0.127150084317032 ], [ "fon", 0.1183811129848229 ] ]

prasadKodeInCloud commented 7 years ago

Another ex: var lan = franc.all( "School is a bad place for children" ) . Result sco: 1

wooorm commented 7 years ago

Duplicate of GH-27, GH-23, GH-16, and GH-8.

prasadKodeInCloud commented 7 years ago

ok I am using white list n black list options now. But still better to improve the logic of language detection. There is no open issue issue in the repo regarding this and its misleading.

Thanks!

wooorm commented 7 years ago

Great.

I spent ages on this, and you get it for free. I maintain this project. I went out of my way to get it to be MIT licensed. You can basically do anything you want with the code.

Create a PR for the docs / the algorithm if you want to contribute.

prasadKodeInCloud commented 7 years ago

No offence. If you maintain an open source library which has issues, you should not just close the issues. I really appreciate your work. Even this does not work when using only two languages. So "This is a not a trade-off: accuracy v.s. amount of supported languages" .I cannot accept any duplicate issues you mentioned here to close this issue. var matches = franc.all( 'I am not good at detecting languages.What is the best solution for this?' , { 'whitelist' : ['dan','en'] });