wooorm / franc

Natural language detection
https://wooorm.com/franc/
MIT License
4.12k stars 173 forks source link

Reference of source document #78

Closed DonaldTsang closed 4 years ago

DonaldTsang commented 4 years ago

It seems that NONE of the languages have sources to the data.json 3-gram model. Is it possible to provide document sources for each language such that we can review the material, and possibly generate 2-grams and 4-grams (or 2/3 or 3/4 or 2/3/4-gram combos) models?

wooorm commented 4 years ago

Franc is built from udhr, which has this. You’ll have to read the source code of this franc and the other projects, but I made sure everything does one thing well, to allow for these things!

DonaldTsang commented 4 years ago

Here are two questions:

  1. Some problems regarding using UDHR is that the biases within UDHR's linguistic characteristics would skew the data one way or another. Has it been fully researched about it?
  2. Are there any other datasets that covers as many languages as possible? Would like to try, test and compare as I really would like to apply this to vocabulary and short texts which require bigger datasets of Ngrams.
wooorm commented 4 years ago
DonaldTsang commented 4 years ago

@wooorm Maybe it is the choice of words, but I when I say "researched" I meant "optimized" as in if using the UDHR is the best route in having the highest accuracy.

There are many data sets out there, none that support so many languages as UDHR.

Do you know of any examples that have at least 75 or 100 or 125 languages? Maybe 400 languages is a bit too "extreme" but I would like to know if you have already encountered such data sets for people to share.

wooorm commented 4 years ago

When I say "researched" I meant "optimized" as in if using the UDHR is the best route in having the highest accuracy.

UDHR definitely does not give the highest accuracy, but it does support the most languages

Do you know of any examples that have at least 75 or 100 languages

I don’t. You could look into the Bible. There have been several issues over the years with conversations going into similar directions as this, e.g., https://github.com/wooorm/franc/issues/76 and https://github.com/wooorm/franc/issues/75. You can read through the closed issues to find out more.

wooorm commented 4 years ago

BTW, I think the non-ngram (CLD2) approach is often better than ngrams.

DonaldTsang commented 4 years ago

@wooorm yes so does CLD2 used codepoint filtering for detecting languages? Might need some primer on how it works. Because codepoint filtering is something that I would like to see data on.

Also wow Machine Learning for https://github.com/google/cld3 and https://github.com/ropensci/cld3

wooorm commented 4 years ago

I don't know; I maintain this project and give it away for free 🤷‍♂️

DonaldTsang commented 4 years ago

Okay, thanks for the help.

DonaldTsang commented 4 years ago

BTW I think we can possibly improvise with any collection of fictional and religious books, when we are provided a tool to remove proper nouns from such works to leave in common word structures. Problem: copyright. See: The Bible

BTW here is something unique https://github.com/pemistahl/lingua#4--how-good-is-it-top- Also these three uses Wikipedia as base:

There are also others who use http://wortschatz.uni-leipzig.de/en/download/ and even more exotic, https://github.com/google/corpuscrawler and with tweets, https://github.com/mitjat/langid_eval

https://github.com/davidjurgens/equilid#model-details is even more comprehensive But https://github.com/landrok/language-detector basically has a hidden dataset