Open thorn0 opened 4 years ago
That’s a good idea, it’s similar to how Google works!
However, I don‘t think it should be so “black and white”, as “the letter ы
is not available in bulgarian or macedonian” should still be matched as English.
We could do something with a special character list that enhances scores of certain scripts?
I remember there is a turkish i variant that isn’t used anywhere else as well, forgot what it was tho
The dotless i (ı) is used not only in Turkish. Other languages whose alphabets are based on the Turkish alphabet have it too. E.g. Azerbaijani and Crimean Tatar.
We could do something with a special character list that enhances scores of certain scripts?
Scripts like Latin, Cyrillic, etc.? You meant languages, not scripts then, right?
It's not only a matter of which characters the alphabet has, it's also about which ones it doesn't. In Чекаю цієї хвилини.
, there are 5 letters that aren't in the Uzbek alphabet. It's 31% of all the letters in the string. In no way should Uzbek get the highest ranking in such a situation.
@wooorm Do you happen to know a programmatic way to get the alphabet (the set of used characters) for a given language?
I think it’s vague what even an alphabet is, but I did found this list on wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Language_recognition_chart. Interesting stuff!
Franc supports the most languages possible, as it uses the biggest training set (UDHR). It’s designed to not discriminate against languages with few speakers, and I can how adding a feature such as this would (because there is no data about alphabets for lots of languages).
There are projects that focus on less language and do things like what you’re proposing. Have you looked at https://github.com/CLD2Owners/cld2?
I thought I saw something on the Unicode site where for each character there was information by which languages it is used, but now I can't find it.
I think it’s vague what even an alphabet is
Right. Some characters sometimes aren't considered separate letters of the alphabet (e.g. umlauts in German), etc. That's why I wrote "alphabet (the set of used characters)".
I don’t think there’s an automated way to do it.
I think it could be possible to either do it character-based, e.g., like so:
"э": [
"bul": -3,
"mkd": -3,
"rus": 3,
"bel": 3,
// ...or so
]
Or based on n-grams/regexes:
"tje$": [["nld", 2]]
"^z": [["nld", 1]]
But this is an error-prone and “soft” approach, compared to the current “hard” data-model
An alternative idea is to look at the TRY
field in hunspell dictionaries.
E.g., the Russian dictionary defines:
TRY оаитенрсвйлпкьыяудмзшбчгщюжцёхфэъАВСМКГПТЕИЛФНДОЭРЗЮЯБХЖШЦУЧЬЫЪЩЙЁ
And Macedonian:
TRY аеоинвтрслпкудмзбчгјшцњжфхќџѓљѕѐѝАЕОИНВТРСЛПКУДМЗБЧГЈШЦЊЖФХЌЏЃЉЅЀЍ-’!.
These are mostly ordered already based from frequent -> infrequent
Found it! http://cldr.unicode.org/translation/-core-data/exemplars
Letter frequency is an important thing too, but on the other hand letters that are unique to some language are often infrequent in it. E.g. ѕ
(Cyrillic) in Macedonian and є
in Ukrainian.
@thorn0 Is this something you’d be interested to work on?
It's unlikely I'll have time for this any time time soon.
@thorn0 @wooorm I would put a $50 bug bounty on this payable by PayPal if anyone had the time!
Setup:
const franc = require('franc'); const text = 'Was wäre, wenn ich heute mal mit ja antworten würde?'; const options = { only: ['deu', 'eng', 'fra', 'ita', 'jav', 'spa', 'nld', 'pol', 'por', 'rus', 'zho'] } console.log(franc.all(text, options));
result is:
[ [ 'nld', 1 ], [ 'deu', 0.9141248240262787 ], [ 'jav', 0.679962458939465 ], [ 'fra', 0.651337400281558 ], [ 'spa', 0.631628343500704 ], [ 'eng', 0.5945565462224307 ], [ 'por', 0.49835757860159546 ], [ 'ita', 0.4084936649460347 ], [ 'pol', 0.3779915532613797 ] ]
This result should be impossible, as of the German only special chars 'ä' and the 'ü', there are no such letters in Dutch.
Using the dictionaries and checking the alphabet as metioned above migth solve this, making this franc a real migthy tool. So upvote from me on this.
I remember there is a turkish i variant that isn’t used anywhere else as well, forgot what it was tho
@wooorm Yes, ı and İ are specific to Turkish.
Что это за язык?
is a Russian sentence, which is detected as Bulgarian (bul 1, rus 0.938953488372093, mkd 0.9353197674418605). However, neither Bulgarian nor Macedonian have the letters э and ы in their alphabets.Same with
Чекаю цієї хвилини.
, which is Ukrainian, but is detected as Northern Uzbek with probability 1 whereas Ukrainian gets only 0.33999999999999997. However, the letters є and ї are used only in Ukrainian whereas the Uzbek Cyrillic alphabet doesn't include as many as five letters from this sentence, namely: ю, ц, і, є and ї.I know that Franc is supposed to be not good with short input strings, but taking alphabets into account seems to be a promising way to improve the accuracy.