Closed tommedema closed 8 years ago
Interesting! There's two thing at hand here:
First, Scottish over English: That's just because the sample is short, and the chances of it being Scottish is slightly higher than English. If you pass the latin-script part on its own to franc, Scottish should also be francs primary guess. If you'd like to avoid this, either use a prebuilt version of franc with less languages, or build your own (see README).
Second, the multiple scripts: instead of testing the input against all possible languages, franc first determines the most-used script, and then only checks against languages with that script. I don't think it's francs task to support multi-language and multi-script documents. That's a complexity I didn't account for, and should be done by another tool IMHO. You could write a tool which uses the regexes in franc (see the unicode-7.0 package) to extract different runs of text. Then, pass those through franc.
Those two problems shouldn't be related, please reopen this issue if I'm incorrect (I'm on holiday with slow Internet).
Thanks.
I understand both points and certainly agree with number one not being an issue.
However, since Franc "detects language of text", and texts rarely contain only a single language, I'd argue that the probability factor returned by Franc should be influenced by all of the text given, and not just a fraction. This seems inside the scope of Franc, I. e. detecting the language of text.
If you disagree I'll look into your suggestion of running franc multiple times for different parts of my input text.
For anyone with similar interests: I resorted to using node-cld instead, which returns results as expected.
Run the following snippet:
The result is:
Obviously, this is invalid. Also, there is not a single occurance of
cmn
in the results list.