Better n-gram language models

goodmami commented 7 years ago

The word and character n-gram models are pretty simple. The feature, as defined, is set as True if some percentage of the tokens on the line exist in the language model for the given language. We could try to build a more typical n-gram model, but data sparsity would be a problem. Consider using a resource like Ethnologue to get language family hierarchies (this data can be extracted from the included Crubadan.csv file), then combining data from related languages to create a more general model. This could help distinguish between language candidates that are radically different. These n-gram langauge models could be arranged in a decision tree (or similar structure) in order to more finely select the matching language.

Note that spelling differences across even closely related languages could make this method infeasible, but it's worth considering if there's time.

rgeorgi commented 7 years ago

Generate Language Models

MackieBlackburn commented 7 years ago

Should this be implemented as a method that will return larger LMs made by combining the LMs of languages that are distance n away in a decision tree?

goodmami commented 7 years ago

(note: this enhancement probably shouldn't be attempted until we have a working system using the existing feature definitions)

Should this be implemented as a method that will return larger LMs made by combining the LMs of languages that are distance n away in a decision tree?

Essentially yes. What I had in mind was to bisect all the data in some way (e.g. by language families, lowest perplexity or cross entropy of individual models, etc.), then to use the resulting splits to train a language model classifier (e.g., does the language belong in class A or B), the continue splitting. The end case (i.e. leaf classifiers) would discriminate between individual languages (it may be that A is a specific language model and B is a combined class, which is then further split, etc.). This structure would essentially be a decision tree.

The motivation is that we don't have enough language data to do multi-class classification with a single classifier. But this is just my idea made without much experience in language classification. See if there is literature on n-gram language classification of low-resource languages.

MackieBlackburn commented 7 years ago

I'll do some reading and try out a few things to extend the language models. I'll try to train and test the lgid system to look at performance improvements but where should I get more freki files?

goodmami commented 7 years ago

where should I get more freki files?

Check under patas:/projects/grants/riples/odin2.1-pdfs/4-match for training and patas:/projects/grants/riples/odin2.1-pdfs/5-gold* for testing.

xigt / lgid

Better n-gram language models #7