yogurt-cultures / kefir

🥛turkic morphology project
Other
456 stars 29 forks source link

Consider using a dictionary #6

Open mdakin opened 6 years ago

mdakin commented 6 years ago

By using a dictionary you can identify exceptional cases which is not possible by just inspection of the letters.

One possiblity is zemberek: https://github.com/ahmetaa/zemberek-nlp/blob/master/morphology/src/main/resources/tr/master-dictionary.dict

These are the rules: https://github.com/ahmetaa/zemberek-nlp/wiki/Text-Dictionary-Rules

Zemberek has a binary version based on protocol buffers as well for pre processed attributes and fast loading. it should be possible to read it with python directly: Proto: https://github.com/ahmetaa/zemberek-nlp/blob/master/morphology/src/main/proto/lexicon.proto

Binary dictionary: https://github.com/ahmetaa/zemberek-nlp/blob/master/morphology/src/main/resources/tr/lexicon.bin