This repository contains the code used for WMT14 translation experiments in Mixture of Attention Heads: Selecting Attention Heads Per Token paper.
Python 3, fairseq and PyTorch are required for the current codebase.
Install PyTorch and fairseq
Generate WMT14 translation dataset with Transformer Clinic.
Scripts and commands
Train Language Modeling
sh run.sh /path/to/your/data
Test Unsupervised Parsing
sh test.sh /path/to/checkpoint
In default setting, the MoA achieves a BLEU of approximately 28.4
on WMT14 EN-DE test set.