hugolpz commented 7 years ago

Existing words lists by frequency

Words lists: Various researches have produced high quality word frequency list. Most notables are :

The Subtlex movement : FR, EN-US, NL, ZH, ES, GR, VI, AL, PL.
Worldlex: Twitter and blog word frequencies for 66 languages (Manuel Gimenes & Boris New)
- data
Wiktionary>Frequency_lists
- Frequency Word Lists (Hermit Dave, 2016)
- Github > NLP tool and data (Hermit Dave, 2016)

Ongoing efforts

Lang/Data	Subtlex ?	Worldlex ?	Wiktionary ?	note
AL	lost	worldlex	?	no opendata + time = lost
CMN	yes > Files S1	worldlex	?	Subtlex-CH data > others
BG	none	none	OpenSubtitle 5k, 50k	/
tam	no	no	?	TA-WP-word-list
heb	none	none	OpenSubtitle 5k, 50k	/
ory	none	none	/	Help needed
FRA	yes	worldlex	OpenSubtitle 5k, 50k	Lexique3=Subtlex-FR 2007. Use: FRA-Lexique381-leme.txt, FRA-Lexique381-ortho.txt

Indian languages

There should be some Indian language institutes providing lists for Indian languages.

Naming convention

See Wikipédia:Atelier identification/Nommage des photos d'animaux screenshot from 2017-06-14 12-49-31

Notes

Hugo previously made some PhD theoretical research in the field of Chinese vocabulary learning. Accordingly, Hugo helps to assess the quality of online researches and data. Hugo's recommended academic sources are above.
Some amount of data curation and data-processing are needed to provide simple-to-use word list. Namely, the Worldlex data generally include 3 corpora (Blog, Twitter, News) and rankings, which are not summed up but need to be, with weights. Who could work on this ?

hugolpz commented 7 years ago

Github > NLP tool and data (Hermit Dave, 2016) is a tremendous source ! As the author attacked half a hundred language, "words" may linguistically unclean. Curration would be welcome.

IDEA : for each language, create a tinder like app, so native speaker may tag each word as real | artifact. See https://github.com/wikimedia-france/Lingua-Libre/issues/14

hugolpz commented 7 years ago

Creating frequency data {item}{occurences} from corpus

Data cleanup to lists of {item}s

See Wiki's tutorials

hugolpz commented 7 years ago

French wordlist added. I used http://textmechanic.com/text-tools/basic-text-tools/remove-duplicate-lines/ to merge duplicate lemes.

hugolpz commented 7 years ago

Indian languages

cc: @tshrinivasan .

Note : Wikipedia being an encyclopedia is not the best corpus for educational vocabulary lists. Academics of the Subtlex made the argument that subtitles such as Open Subtitles are the most relevant for language learning as the stick the most to real oral speech.

Academics

IIIT Hyderabad > Language Technologies Research Center > NLP-MT

iiit-new screenshot from 2017-06-12 16-54-14

IIIT Hyderabad : International Institute of Information Technology, Hyderabad https://www.iiit.ac.in/ >
- LTRC : Language Technologies Research Center https://ltrc.iiit.ac.in
- NLPMT : Machine Translation and Natural Language Processing Lab https://ltrc.iiit.ac.in/nlpmt/
  - High Frequency Words List for Indian Languages https://ltrc.iiit.ac.in/showfile.php?filename=ltrc/internal/nlp/corpus/index.html

Data can be downloaded... screenshot from 2017-06-12 16-53-38 ...but appears to me as : screenshot from 2017-06-12 17-20-41

CIIL

Central Institute for Indian Languages http://www.ciil.org/ http://www.ciil-lisindia.net/Tamil/Tamil_tech.html

CIIL Tamil corpus, 3M words
CIIL Telugu corpus, 3M words
CIIL Kannada corpus, 3M words
CIIL Malayalam corpus, 3M words
partially tagged corpus.
available in CD and one can get a free copy from CIIL for research purpose. XD

TAMIL

Wordlists

1000 common words -- light cleanup to do - wordlist (text-file): words.txt
~~Tamil-English Basic Vocabulary, University of Pennsylvania -- double column table to clean up~~ (bad list? seems like Thai, can someone with Tamil localization properly installed on their computer check?)
- TAM-University _Pennsylvania-words-1831.txt -- TAM-ENG, easier to clean up.
Tamil

hugolpz commented 7 years ago

Contacted the author of Worldlex, asked them to free their data. screenshot from 2017-07-25 17-30-42

wikimedia-france / Lingua-Libre

[Mini-project] Build solid words frequency lists for most common languages #4

Existing words lists by frequency

Ongoing efforts

Indian languages

Naming convention

Notes

Creating frequency data {item}{occurences} from corpus

Data cleanup to lists of {item}s

Indian languages

Academics

IIIT Hyderabad > Language Technologies Research Center > NLP-MT

CIIL

TAMIL

Wordlists