wooorm / franc

Natural language detection
https://wooorm.com/franc/
MIT License
4.12k stars 173 forks source link

Use languages' alphabets to make detection more accurate #83

Open thorn0 opened 4 years ago

thorn0 commented 4 years ago

Что это за язык? is a Russian sentence, which is detected as Bulgarian (bul 1, rus 0.938953488372093, mkd 0.9353197674418605). However, neither Bulgarian nor Macedonian have the letters э and ы in their alphabets.

Same with Чекаю цієї хвилини., which is Ukrainian, but is detected as Northern Uzbek with probability 1 whereas Ukrainian gets only 0.33999999999999997. However, the letters є and ї are used only in Ukrainian whereas the Uzbek Cyrillic alphabet doesn't include as many as five letters from this sentence, namely: ю, ц, і, є and ї.

I know that Franc is supposed to be not good with short input strings, but taking alphabets into account seems to be a promising way to improve the accuracy.

wooorm commented 4 years ago

That’s a good idea, it’s similar to how Google works! However, I don‘t think it should be so “black and white”, as “the letter ы is not available in bulgarian or macedonian” should still be matched as English.

We could do something with a special character list that enhances scores of certain scripts?

I remember there is a turkish i variant that isn’t used anywhere else as well, forgot what it was tho

thorn0 commented 4 years ago

The dotless i (ı) is used not only in Turkish. Other languages whose alphabets are based on the Turkish alphabet have it too. E.g. Azerbaijani and Crimean Tatar.

thorn0 commented 4 years ago

We could do something with a special character list that enhances scores of certain scripts?

Scripts like Latin, Cyrillic, etc.? You meant languages, not scripts then, right?

thorn0 commented 4 years ago

It's not only a matter of which characters the alphabet has, it's also about which ones it doesn't. In Чекаю цієї хвилини., there are 5 letters that aren't in the Uzbek alphabet. It's 31% of all the letters in the string. In no way should Uzbek get the highest ranking in such a situation.

thorn0 commented 4 years ago

@wooorm Do you happen to know a programmatic way to get the alphabet (the set of used characters) for a given language?

wooorm commented 4 years ago

I think it’s vague what even an alphabet is, but I did found this list on wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Language_recognition_chart. Interesting stuff!

Franc supports the most languages possible, as it uses the biggest training set (UDHR). It’s designed to not discriminate against languages with few speakers, and I can how adding a feature such as this would (because there is no data about alphabets for lots of languages).

There are projects that focus on less language and do things like what you’re proposing. Have you looked at https://github.com/CLD2Owners/cld2?

thorn0 commented 4 years ago

I thought I saw something on the Unicode site where for each character there was information by which languages it is used, but now I can't find it.

I think it’s vague what even an alphabet is

Right. Some characters sometimes aren't considered separate letters of the alphabet (e.g. umlauts in German), etc. That's why I wrote "alphabet (the set of used characters)".

wooorm commented 4 years ago

I don’t think there’s an automated way to do it.


I think it could be possible to either do it character-based, e.g., like so:

  "э": [
     "bul": -3,
     "mkd": -3,
     "rus": 3,
     "bel": 3,
     // ...or so
  ]

Or based on n-grams/regexes:

  "tje$": [["nld", 2]]
  "^z": [["nld", 1]]

But this is an error-prone and “soft” approach, compared to the current “hard” data-model


An alternative idea is to look at the TRY field in hunspell dictionaries. E.g., the Russian dictionary defines:

TRY оаитенрсвйлпкьыяудмзшбчгщюжцёхфэъАВСМКГПТЕИЛФНДОЭРЗЮЯБХЖШЦУЧЬЫЪЩЙЁ

And Macedonian:

TRY аеоинвтрслпкудмзбчгјшцњжфхќџѓљѕѐѝАЕОИНВТРСЛПКУДМЗБЧГЈШЦЊЖФХЌЏЃЉЅЀЍ-’!.

These are mostly ordered already based from frequent -> infrequent

thorn0 commented 4 years ago

Found it! http://cldr.unicode.org/translation/-core-data/exemplars

Letter frequency is an important thing too, but on the other hand letters that are unique to some language are often infrequent in it. E.g. ѕ (Cyrillic) in Macedonian and є in Ukrainian.

wooorm commented 4 years ago

Nice, we can crawl them from cldr: bg, ru, mk

wooorm commented 4 years ago

@thorn0 Is this something you’d be interested to work on?

thorn0 commented 4 years ago

It's unlikely I'll have time for this any time time soon.

niftylettuce commented 4 years ago

@thorn0 @wooorm I would put a $50 bug bounty on this payable by PayPal if anyone had the time!

Rakiiv commented 4 years ago

Setup:

const franc = require('franc'); const text = 'Was wäre, wenn ich heute mal mit ja antworten würde?'; const options = { only: ['deu', 'eng', 'fra', 'ita', 'jav', 'spa', 'nld', 'pol', 'por', 'rus', 'zho'] } console.log(franc.all(text, options));

result is:

[ [ 'nld', 1 ], [ 'deu', 0.9141248240262787 ], [ 'jav', 0.679962458939465 ], [ 'fra', 0.651337400281558 ], [ 'spa', 0.631628343500704 ], [ 'eng', 0.5945565462224307 ], [ 'por', 0.49835757860159546 ], [ 'ita', 0.4084936649460347 ], [ 'pol', 0.3779915532613797 ] ]

This result should be impossible, as of the German only special chars 'ä' and the 'ü', there are no such letters in Dutch.

Using the dictionaries and checking the alphabet as metioned above migth solve this, making this franc a real migthy tool. So upvote from me on this.

muratcorlu commented 3 years ago

I remember there is a turkish i variant that isn’t used anywhere else as well, forgot what it was tho

@wooorm Yes, ı and İ are specific to Turkish.