wooorm / franc

Natural language detection
https://wooorm.com/franc/
MIT License
4.07k stars 175 forks source link

English and Chinese mixed text results in invalid scottish match with 100% probability #21

Closed tommedema closed 8 years ago

tommedema commented 8 years ago

Run the following snippet:

    var
        text    = 'That man is the richest whose pleasure are the cheapest. 能处处寻求快乐的人才是最富有的人。— 梭罗',
        langs   = franc.all(text);

    console.log(langs);

The result is:

    [ [ 'sco', 1 ],
      [ 'eng', 0.9541225122770742 ],
      [ 'src', 0.7208581028689585 ],
      [ 'rmn', 0.7191780821917808 ],
      [ 'nds', 0.7121995347635048 ],
      [ 'ron', 0.6665805117601448 ],
      [ 'hat', 0.665158955802533 ],
      [ 'ita', 0.6585681054536056 ],
      [ 'als', 0.6544326699405532 ],
      [ 'fra', 0.6509433962264151 ],
      [ 'yao', 0.6367278366502973 ],
      [ 'ayr', 0.627681571465495 ],
      [ 'por', 0.6112690617730681 ],
      [ 'afr', 0.608942879296976 ],
      [ 'est', 0.6075213233393641 ],
      [ 'tzm', 0.6062289997415353 ],
      [ 'deu', 0.6039028172654433 ],
      [ 'bug', 0.6032566554665288 ],
      [ 'glg', 0.6000258464719566 ],
      [ 'nld', 0.5965365727578186 ],
      [ 'bin', 0.595890410958904 ],
      [ 'pam', 0.5922719048849832 ],
      [ 'ace', 0.5916257430860687 ],
      [ 'nso', 0.586585681054536 ],
      [ 'mad', 0.5864564486947532 ],
      [ 'nhn', 0.5861979839751874 ],
      [ 'sna', 0.5823210131817007 ],
      [ 'nno', 0.5753424657534247 ],
      [ 'run', 0.5721116567588524 ],
      [ 'cat', 0.5708193331610235 ],
      [ 'epo', 0.5692685448436288 ],
      [ 'ban', 0.569139312483846 ],
      [ 'min', 0.5682346859653657 ],
      [ 'snn', 0.5650038769707935 ],
      [ 'tiv', 0.5580253295425175 ],
      [ 'kin', 0.5569914706642543 ],
      [ 'tpi', 0.5568622383044715 ],
      [ 'tgl', 0.555052985267511 ],
      [ 'spa', 0.5547945205479452 ],
      [ 'gax', 0.553889894029465 ],
      [ 'quz', 0.5494959937968467 ],
      [ 'bci', 0.5478159731196692 ],
      [ 'war', 0.546911346601189 ],
      [ 'ibo', 0.5448436288446628 ],
      [ 'quy', 0.5403204962522616 ],
      [ 'jav', 0.5383820108555182 ],
      [ 'sot', 0.5377358490566038 ],
      [ 'tsn', 0.5373481519772552 ],
      [ 'snk', 0.5356681313000775 ],
      [ 'qug', 0.5339881106229 ],
      [ 'dip', 0.5324373223055052 ],
      [ 'dan', 0.5317911605065908 ],
      [ 'uig', 0.5306280692685448 ],
      [ 'bcl', 0.5273972602739726 ],
      [ 'ckb', 0.5252003101576634 ],
      [ 'hil', 0.5226156629620057 ],
      [ 'ilo', 0.5213233393641767 ],
      [ 'ndo', 0.5201602481261307 ],
      [ 'nya', 0.5160248126130783 ],
      [ 'tur', 0.5104678211424141 ],
      [ 'plt', 0.5089170328250194 ],
      [ 'ceb', 0.5064616179891445 ],
      [ 'aka', 0.5054277591108813 ],
      [ 'nob', 0.5045231325924011 ],
      [ 'ibb', 0.5036185060739209 ],
      [ 'emk', 0.5001292323597829 ],
      [ 'ind', 0.4957353321271647 ],
      [ 'sun', 0.4927629878521582 ],
      [ 'tem', 0.4919875936934608 ],
      [ 'ada', 0.4919875936934608 ],
      [ 'mos', 0.488239855259757 ],
      [ 'kde', 0.488239855259757 ],
      [ 'hau', 0.48216593434996124 ],
      [ 'rmy', 0.4797105195140863 ],
      [ 'hms', 0.47777203411734304 ],
      [ 'fuc', 0.4771258723184285 ],
      [ 'hun', 0.4768674075988627 ],
      [ 'ewe', 0.47389506332385634 ],
      [ 'bam', 0.47118118376841556 ],
      [ 'suk', 0.47066425432928405 ],
      [ 'uzn', 0.4685965365727578 ],
      [ 'tuk', 0.4609718273455673 ],
      [ 'lav', 0.4608425949857844 ],
      [ 'fin', 0.4605841302662187 ],
      [ 'pol', 0.4604548979064358 ],
      [ 'lit', 0.45993796846730417 ],
      [ 'som', 0.45838718014990953 ],
      [ 'xho', 0.4569656241922978 ],
      [ 'azj', 0.45463944171620574 ],
      [ 'vmw', 0.45076247092271904 ],
      [ 'bem', 0.45024554148358753 ],
      [ 'knc', 0.44339622641509435 ],
      [ 'swh', 0.44313776169552854 ],
      [ 'lin', 0.441457741018351 ],
      [ 'vie', 0.44029464978030497 ],
      [ 'ces', 0.44003618506073916 ],
      [ 'toi', 0.43874386146291033 ],
      [ 'zul', 0.4377100025846472 ],
      [ 'slk', 0.43473765830964073 ],
      [ 'ssw', 0.4340914965107263 ],
      [ 'cjk', 0.4334453347118118 ],
      [ 'gaa', 0.43254070819333157 ],
      [ 'men', 0.43228224347376587 ],
      [ 'srp', 0.4302145257172396 ],
      [ 'kbp', 0.4256913931248385 ],
      [ 'bos', 0.42401137244766085 ],
      [ 'lua', 0.4210390281726545 ],
      [ 'lun', 0.41664512794003616 ],
      [ 'hrv', 0.41250969242698377 ],
      [ 'tso', 0.40759886275523394 ],
      [ 'sag', 0.4073403980356681 ],
      [ 'slv', 0.40462651848022746 ],
      [ 'nyn', 0.40372189196174724 ],
      [ 'wol', 0.4025588007237012 ],
      [ 'fon', 0.4011372447660895 ],
      [ 'yor', 0.39622641509433965 ],
      [ 'swe', 0.3900232618247609 ],
      [ 'kng', 0.38097699663995865 ],
      [ 'umb', 0.37645386404755754 ],
      [ 'lug', 0.36495218402688034 ],
      [ 'kmb', 0.3509950891703283 ] ]

Obviously, this is invalid. Also, there is not a single occurance of cmn in the results list.

wooorm commented 8 years ago

Interesting! There's two thing at hand here:

First, Scottish over English: That's just because the sample is short, and the chances of it being Scottish is slightly higher than English. If you pass the latin-script part on its own to franc, Scottish should also be francs primary guess. If you'd like to avoid this, either use a prebuilt version of franc with less languages, or build your own (see README).

Second, the multiple scripts: instead of testing the input against all possible languages, franc first determines the most-used script, and then only checks against languages with that script. I don't think it's francs task to support multi-language and multi-script documents. That's a complexity I didn't account for, and should be done by another tool IMHO. You could write a tool which uses the regexes in franc (see the unicode-7.0 package) to extract different runs of text. Then, pass those through franc.

Those two problems shouldn't be related, please reopen this issue if I'm incorrect (I'm on holiday with slow Internet).

tommedema commented 8 years ago

Thanks.

I understand both points and certainly agree with number one not being an issue.

However, since Franc "detects language of text", and texts rarely contain only a single language, I'd argue that the probability factor returned by Franc should be influenced by all of the text given, and not just a fraction. This seems inside the scope of Franc, I. e. detecting the language of text.

If you disagree I'll look into your suggestion of running franc multiple times for different parts of my input text.

tommedema commented 8 years ago

For anyone with similar interests: I resorted to using node-cld instead, which returns results as expected.