wikimedia-france / Lingua-Libre

LinguaLibre – Massive Open Audio Recording system
http://v1.lingualibre.fr
GNU General Public License v3.0
14 stars 5 forks source link

Load the words from a text file or database. #39

Closed tshrinivasan closed 6 years ago

tshrinivasan commented 7 years ago

We can query the wiktionary and get the list of words that dont have audio files. We can put those files in a text file in the root path of the application or somewhere, or into a database, using a command line script.

Lingua libre should load the words from this text file, so that users can contribute easily and avoid duplicate.

Once recorded, that word should be deleted from the text file, or database.

I can collect the list of words that dont have audio, from wiktionary. Provide details on where to store the list of words. Provide feature to load the words from there.

hugolpz commented 7 years ago

Issue #4 lead the creation of a directory gathering solid, curated word lists.

Structure like :

./lists/
|-- {ISO_639-3}-{source}-{kind}.txt
|-- {ISO_639-3}-{source}-{kind}.txt
|-- {ISO_639-3}-{source}-{kind}.txt
|-- {ISO_639-3}-{source}-{kind}.txt

Kind is for word forms, ex:eat, ate, eaten, dog, dogs, doggy ; or word-family / lemes, ex : eat, dog.

The aim is to provide a starting wordlist for each language we work on.

On #4 is also a list of potential sources for major languages : wordlex, subtlex, others.