yikangshen / MoA

Mixture of Attention Heads
BSD 3-Clause "New" or "Revised" License
39 stars 4 forks source link

Mixture of Attention Heads

This repository contains the code used for WMT14 translation experiments in Mixture of Attention Heads: Selecting Attention Heads Per Token paper.

Software Requirements

Python 3, fairseq and PyTorch are required for the current codebase.

Steps

  1. Install PyTorch and fairseq

  2. Generate WMT14 translation dataset with Transformer Clinic.

  3. Scripts and commands

    • Train Language Modeling sh run.sh /path/to/your/data

    • Test Unsupervised Parsing sh test.sh /path/to/checkpoint

    In default setting, the MoA achieves a BLEU of approximately 28.4 on WMT14 EN-DE test set.