wooorm / franc

Natural language detection
https://wooorm.com/franc/
MIT License
4.07k stars 175 forks source link

Add support for Maghrebi Arabic #75

Closed imedadel closed 5 years ago

imedadel commented 5 years ago

Maghrebi Arabic is the variety of Arabic spoken (and written) across the Maghreb region. Depending on the situation, it can be written in Arabic script, Hebrew script, or Latin script.

For now, I would like to focus on the Arabic script.

Adding support for Maghrebi Arabic written in the Arabic script can be easy since the detection of the letters ڨ or ڥ or ڭ or ڜ is enough. The letters ڢ and ڧ are also attested, although in older documents or printed ones (rarely online).

wooorm commented 5 years ago

Unfortunately the way to add new languages is rather involved, and includes an external standards body. I recently wrote about it here: https://github.com/wooorm/franc/issues/74#issuecomment-490362308

imedadel commented 5 years ago

I see. Well, it seems that UDHR translations are quite outdated (the one for Tamazight goes back to 1998, before the official recognition of the language and the standardization of its orthography).

So, could you please explain the process of making franc? I would like to make a similar project but using Bible translations instead of UDHR 😄 Thanks!

PS: the part that I don't understand the most is the data.json files.

wooorm commented 5 years ago

Fun! Bible is a good idea as well. I believe there’s less available bibles (in unicode) though, but it is a longer text.

wooorm/udhr crawls and generates documents in unicode, wooorm/trigrams generates trigrams from them, script/build.js generates data

imedadel commented 5 years ago

I see 😃 Bible.com hosts around 1300 translations (including multiple varieties of Arabic and Tamazight), although they are not all complete. So I'll look for the most translated chapter and use it.

Yeah, this is gonna be fun :D