Closed DonaldTsang closed 4 years ago
Franc is built from udhr, which has this. You’ll have to read the source code of this franc and the other projects, but I made sure everything does one thing well, to allow for these things!
Here are two questions:
@wooorm Maybe it is the choice of words, but I when I say "researched" I meant "optimized" as in if using the UDHR is the best route in having the highest accuracy.
There are many data sets out there, none that support so many languages as UDHR.
Do you know of any examples that have at least 75 or 100 or 125 languages? Maybe 400 languages is a bit too "extreme" but I would like to know if you have already encountered such data sets for people to share.
When I say "researched" I meant "optimized" as in if using the UDHR is the best route in having the highest accuracy.
UDHR definitely does not give the highest accuracy, but it does support the most languages
Do you know of any examples that have at least 75 or 100 languages
I don’t. You could look into the Bible. There have been several issues over the years with conversations going into similar directions as this, e.g., https://github.com/wooorm/franc/issues/76 and https://github.com/wooorm/franc/issues/75. You can read through the closed issues to find out more.
BTW, I think the non-ngram (CLD2) approach is often better than ngrams.
@wooorm yes so does CLD2 used codepoint filtering for detecting languages? Might need some primer on how it works. Because codepoint filtering is something that I would like to see data on.
Also wow Machine Learning for https://github.com/google/cld3 and https://github.com/ropensci/cld3
I don't know; I maintain this project and give it away for free 🤷♂️
Okay, thanks for the help.
BTW I think we can possibly improvise with any collection of fictional and religious books, when we are provided a tool to remove proper nouns from such works to leave in common word structures. Problem: copyright. See: The Bible
BTW here is something unique https://github.com/pemistahl/lingua#4--how-good-is-it-top- Also these three uses Wikipedia as base:
There are also others who use http://wortschatz.uni-leipzig.de/en/download/ and even more exotic, https://github.com/google/corpuscrawler and with tweets, https://github.com/mitjat/langid_eval
https://github.com/davidjurgens/equilid#model-details is even more comprehensive But https://github.com/landrok/language-detector basically has a hidden dataset
It seems that NONE of the languages have sources to the
data.json
3-gram model. Is it possible to provide document sources for each language such that we can review the material, and possibly generate 2-grams and 4-grams (or 2/3 or 3/4 or 2/3/4-gram combos) models?