ybisk / charNMT-noise

Scripts and noise data for Belinkov & Bisk 2018
29 stars 8 forks source link

charNMT-noise

Scripts and noise data for Belinkov & Bisk Synthetic and Natural Noise Both Break Neural Machine Translation ICLR 2018

MT Data

The experiments reported in the paper are conducted on the TED talks corpus prepared for IWSLT 2016, which is available on the WIT3 website.

Pretrained Models

Nematus: http://data.statmt.org/rsennrich/wmt16_systems/

char2char: https://github.com/nyu-dl/dl4mt-c2c

Sources of Natural Noise

French:

Aurlien Max and Guillaume Wisniewski. Mining Naturally-occurring Corrections and Paraphrases from Wikipedias Revision History LREC 2010 corpus

German:

Katrin Wisniewski et al. MERLIN: an online trilingual learner corpus empirically grounding the European Reference Levels in authentic learner data 2013 corpus1 corpus2

Czech:

Karel Sebesta et al. CzeSL grammatical error correction dataset (CZeSL-GEC) Tech Report LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University 2017 corpus