xigt / lgid

language identification of linguistic examples
MIT License
1 stars 0 forks source link

lgid

Language identification of linguistic examples

This utility identifies the subject language of linguistic examples, specifically for use in the ODIN data-acquisition pipeline. Unlike common language identification methods that look for characteristic n-grams of characters or words, this tool attempts to identify languages for which there might be little or no other example data. This is accomplished by looking for language mentions in the context of the examples. If a language name is mentioned in the document containing the example, it is a candidate for the subject language of the example. Other features, such as the proximity of language mentions to the example, or n-gram matches to a language model (if such a model exists) also contribute to the determination.

In addition to accounting for a lack of language model data, this package also does not assume any supervised training data. In order to make a classifier that can predict a language when no positive instances of the language have been seen, we generalize the problem so the classifier predicts the probability that some mentioned language is associated with the current example. The language with the highest probability is then chosen.

The lgid package includes functionality for building resources from training data, listing language mentions in a document, training models, classifying data, and testing the trained models.

Installation

The basic requirements are probably satisfied by any modern Linux or Mac setup:

First clone the repository:

~$ git clone https://github.com/xigt/lgid.git

If you wish to install a pre-trained model and ODIN language model files alongside the code, you can download the latest code_and_data.zip file from the releases page and unzip that instead of cloning the git repository. You may also clone the git repository, then download the latest data_only.zip file from the releases page and unzip that in the cloned directory.

After cloning this repository, run the setup-env.sh script to create a virtual environment and download the necessary dependencies into it:

~$ cd lgid/
~/lgid$ bash setup-env.sh

The dependencies installed by this script are:

When completed, run lgid.sh as the front end to all tasks, as it manages the activation/deactivation of the virtual environment:

~/lgid$ ./lgid.sh 
Usage:
  lgid [-v...] train    --model=PATH [--vectors=DIR] CONFIG INFILE...
  lgid [-v...] test     --model=PATH [--vectors=DIR] CONFIG INFILE...
  lgid [-v...] validate     --model=PATH [--vectors=DIR] CONFIG INFILE...
  lgid [-v...] classify --model=PATH --out=PATH [--vectors=DIR] CONFIG INFILE...
  lgid [-v...] get-lg-recall    CONFIG INFILE...
  lgid [-v...] list-model-weights   --model=PATH    CONFIG
  lgid [-v...] list-mentions          CONFIG INFILE...
  lgid [-v...] count-mentions         CONFIG INFILE...
  lgid [-v...] find-common-codes      CONFIG INFILE...
  lgid [-v...] download-crubadan-data CONFIG
  lgid [-v...] build-odin-lm          CONFIG

Try lgid.sh --help for more usage information.

Examples of Usage

Test the performance of the prebuilt model on the included sample input:

./lgid.sh -v test --model=model/sample_model config.ini sample/input/*

Classify the sample input and output the files with predicted languages marked:

./lgid.sh -v classify --model=model/sample_model --out=sample/output config.ini sample/input/*

Code and Resources

Here is an overview of the imporant files in the repository:

lgid
├── setup-env.sh        # first-time setup
├── lgid.sh             # main script for running the program
├── config.ini          # configuration parameters
├── lgid                # source code
│   ├── analyzers.py    # functions for extracting info from input documents
│   ├── buildlms.py     # functions for building language model files out of ODIN data
│   ├── features.py     # functions for activating model features
│   ├── main.py         # main process; entry point for the lgid.sh command
│   ├── models.py       # abstraction of the machine learning model(s)
│   └── util.py         # utility functions for reading/transforming resources
├── res                 # resources for running the program
│    ├── Crubadan.csv    # index file for downloading Crúbadán data
│    ├── lang_table.txt  # language name-code mapping
│    ├── common_codes.txt  # Shows code most commonly paired with each language name
│    ├── english_word_language_names.txt # list of language names that are English words
│    ├── crubadan_directory_index.csv # table of what directory holds Crúbadán data for each language
│    ├── language_index.txt # table of each language and its ID
│    ├── word_index.txt # table of each word present in a language name and its ID
│    └── word_language_mapping.txt # table of each word present in a language name and which languages it appears in (in ID form)
├── sample              # results from sample runs
└── test                # files for testing the program
     ├── mentions_gold_output.txt   # the gold standard output when running list-mentions on mentions_test.freki
     ├── mentions_single_gold_output.txt   # the gold standard output when running list-mentions on mentions_test.freki with the setting `single-longest-mention = yes`
     └── mentions_test.freki  # freki file for testing list-mentions

In the repository, the lgid/ subdirectory contains all code for data analysis, model building, and document classification. The res/ subdirectory contains resource files used in model-building. Only static files that we have rights to, like the language name-code mapping table, should be checked in. Other resources, like the compiled language model or Crúbadán data, may reside here on a local machine, but they should not be committed to the remote repository.

File Formats

All of the functions that take INFILE as an argument expect that file or files to be in Freki format. The classify function produces Freki files as output.

The build-odin-lm function expects its input files (location specified in the config file) to be in the Xigt format, version 2.1.

The ODIN language model files have one ngram on each line, with the format <ngram>\t<count>. There are no special symbols used for beginning or end of line. Each file contains ngrams for all values of n, 1-3 for characters and 1-2 for words. The morpheme language models are built using the word data.

The Crúbadán language model files have one ngram on each line, with the format <ngram> <count>. The \n character is used to indicate beginning or end of line for word ngrams. The < and > characters are used for the beginning and end of word, respectively, for character ngrams. Each file contains ngrams for only one value of n. The Crúbadán language models have only trigrams for characters and both unigrams and bigrams for words.

Configuration

The config.ini file contains parameters for managing builds of the model.

The [locations] section contains paths for finding various resources.

location description
lgid-dir location of the system files on disk
language-table location of the master language table
most-common-codes location of table of most common codes for each language
english-word-names location of list of languages whose names are also English words
word-index location of the file mapping words to IDs
language-index location of the file mapping languages to IDs
word-language-mapping location of the file mapping words to the languages they appear in
odin-source location of the Xigt files for building ODIN language models from
odin-language-model directory containing the ODIN language model files
crubadan-index location of file containing index and download location for Crúbadán language model data
crubadan-base-uri base URL where Crúbadán data files are downloaded from
crubadan-language-model directory containing the Crúbadán language model
crubadan-directory-index location of table tracking location of Crúbadán language model for each language
classify-error-file language id errors are written to this text file

The [parameters] section contains parameters for modifying the behavior of feature functions. Available parameters are described below:

parameter description
window-size number of lines before an IGT to consider
after-window-size number of lines after an IGT to consider
close-window-size smaller window before an IGT
after-close-window-size smaller window after an IGT
word-n-gram-size number of tokens in word-lm n-grams
morpheme-n-gram-size number of tokens in morpheme-lm n-grams
character-n-gram-size number of chars in character-lm n-grams
crubadan-char-size number of chars in Crúbadán character-lm n-grams
crubadan-word-size number of tokens in Crúbadán word-lm n-grams
morpheme-delimiter regular expression for tokenizing morphemes
frequent-mention-threshold minimum window mentions to be "frequent"
after-frequent-mention-threshold min. mentions after an IGT to be "frequent"
article-frequent-mention-threshold min. mentions in document to be "frequent"
mention-capitalization case-normalization for language mentions
short-name-size a language name shorter than or equal to this length is flagged, as very short names are often false positive mentions
single-longest-mention return all mentions in a given span or only the single longest one. yes for on, anything else for off
code-only-odin-lms use ODIN language models for code+language or just the code. yes for on, anything else for off

The [features] section has boolean flags for turning on/off specific features. The value yes turns the feature on, any other value turns it off. The available features are:

feature name description notes
GL-first-lines language mentioned in the first window of the document
GL-last-lines language mentioned in the last window of the document
GL-frequent language mentioned N+ times in the document
GL-most-frequent language is the most frequently mentioned one
GL-most-frequent-code code is the most frequent one paired with language
GL-possible-english-word language name is possibly an English word or name
GL-short-lang-name language name is shorter than short-name-size, and may be a false positive because it occurs as a word in some language
GL-is-english language is English
GL-multi-word-name language name is multiple words
W-prev language mentioned within the IGT's preceding window
W-close language mentioned within a smaller preceding window
W-closest language is closest to the IGT in the preceding window
W-frequent language mentioned N+ times in the preceding window
W-after language mentioned within the IGT's following window
W-close-after language mentioned within a smaller following window
W-closest-after language is closest to the IGT in the following window
W-frequent-after language mentioned N+ in the following window
L-in-line language mentioned in the IGT's language line
G-in-line language mentioned in the IGT's gloss line
T-in-line language mentioned in the IGT's translation line
M-in-line language mentioned in the IGT's meta lines
L-LMw more than M% of word ngrams occur in the training data, using ODIN data
L-LMm more than M% of morpheme ngrams occur in training data, using ODIN data
L-LMc more than M% of character ngrams occur in training data, using ODIN data
L-CR-LMw same as L-LMw, but using Crúbadán data
L-CR-LMc same as L-LMc, but using Crúbadán data
G-overlap at least M% of gloss tokens occur in the training data not implemented
W-prevclass language is predicted for the previous IGT not implemented

Note that the features have prefixes that group them into categories. The categories are:

feature prefix description
GL- feature is relevant globally
W- feature is relevant within a window
L- feature is relevant for the language line
G- feature is relevant for the gloss line
T- feature is relevant for the translation line
M- feature is relevant for a meta line