wikimedia-france / Lingua-Libre

LinguaLibre – Massive Open Audio Recording system
http://v1.lingualibre.fr
GNU General Public License v3.0
14 stars 5 forks source link

[Mini-project] Build solid words frequency lists for most common languages #4

Closed hugolpz closed 6 years ago

hugolpz commented 7 years ago

Existing words lists by frequency

Words lists: Various researches have produced high quality word frequency list. Most notables are :

Ongoing efforts

Lang/Data Subtlex ? Worldlex ? Wiktionary ? note
AL lost worldlex ? no opendata + time = lost
CMN yes > Files S1 worldlex ? Subtlex-CH data > others
BG none none OpenSubtitle 5k, 50k /
tam no no ? TA-WP-word-list
heb none none OpenSubtitle 5k, 50k /
ory none none / Help needed
FRA yes worldlex OpenSubtitle 5k, 50k Lexique3=Subtlex-FR 2007. Use:
FRA-Lexique381-leme.txt,
FRA-Lexique381-ortho.txt

Indian languages

There should be some Indian language institutes providing lists for Indian languages.

Naming convention

See Wikipédia:Atelier identification/Nommage des photos d'animaux screenshot from 2017-06-14 12-49-31

Notes

hugolpz commented 7 years ago

Github > NLP tool and data (Hermit Dave, 2016) is a tremendous source ! As the author attacked half a hundred language, "words" may linguistically unclean. Curration would be welcome.

IDEA : for each language, create a tinder like app, so native speaker may tag each word as real | artifact. See https://github.com/wikimedia-france/Lingua-Libre/issues/14

hugolpz commented 7 years ago

Creating frequency data {item}{occurences} from corpus

Data cleanup to lists of {item}s

hugolpz commented 7 years ago

French wordlist added. I used http://textmechanic.com/text-tools/basic-text-tools/remove-duplicate-lines/ to merge duplicate lemes.

hugolpz commented 7 years ago

Indian languages

cc: @tshrinivasan .

Note : Wikipedia being an encyclopedia is not the best corpus for educational vocabulary lists. Academics of the Subtlex made the argument that subtitles such as Open Subtitles are the most relevant for language learning as the stick the most to real oral speech.

Academics

IIIT Hyderabad > Language Technologies Research Center > NLP-MT

iiit-new screenshot from 2017-06-12 16-54-14

Data can be downloaded... screenshot from 2017-06-12 16-53-38 ...but appears to me as : screenshot from 2017-06-12 17-20-41

CIIL

Central Institute for Indian Languages http://www.ciil.org/ http://www.ciil-lisindia.net/Tamil/Tamil_tech.html

See also :

NO DATA ONLINE ... #ridiculous #wasteTaxpayersMoney

Note:

screenshot from 2017-06-12 18-58-39

TAMIL

Wordlists

hugolpz commented 7 years ago

Contacted the author of Worldlex, asked them to free their data. screenshot from 2017-07-25 17-30-42