yunsukim86/wbw-lm - Githubissues

Context-aware Beam Search for Unsupervised Word-by-Word Translation

This code implements a simple beam search where cross-lingual word embedding is combined with a language model. It is compatible with MUSE embeddings and kenlm language models. The output translation can be further fed to a denoising autoencoder for improved reordering.

If you use this code, please cite:

Yunsu Kim, Jiahui Geng and Hermann Ney. Improving Unsupervised Word-by-Word Translation Using Language Model and Denoising Autoencoder. EMNLP 2018.
Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer and Hervé Jégou. Word Translation Without Parallel Data. arXiv preprint.

If you are looking for the denoising autoencoder, please go to sockeye-noise.

Installation

First, please install all dependencies:

Python 2/3 with NumPy/SciPy
PyTorch
Faiss (recommended) for fast nearest neighbor search (CPU or GPU).
kenlm (with Python bindings)

Then clone this repository.

Usage

Here is a simple example for translation:

> cat {input_corpus} | python translate.py --src_emb {source_embedding} \
                                           --tgt_emb {target_embedding} \
                                           --emb_dim {embedding_dimension} \
                                           --lm {language_model} > {output_translation}

Please refer to help message (-h) for other detailed options.