Scripts and noise data for Belinkov & Bisk Synthetic and Natural Noise Both Break Neural Machine Translation ICLR 2018
The experiments reported in the paper are conducted on the TED talks corpus prepared for IWSLT 2016, which is available on the WIT3 website.
Nematus: http://data.statmt.org/rsennrich/wmt16_systems/
char2char: https://github.com/nyu-dl/dl4mt-c2c
French:
Aurlien Max and Guillaume Wisniewski. Mining Naturally-occurring Corrections and Paraphrases from Wikipedias Revision History LREC 2010 corpus
German:
Katrin Wisniewski et al. MERLIN: an online trilingual learner corpus empirically grounding the European Reference Levels in authentic learner data 2013 corpus1 corpus2
Czech:
Karel Sebesta et al. CzeSL grammatical error correction dataset (CZeSL-GEC) Tech Report LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University 2017 corpus