ybisk / charNMT-noise

Scripts and noise data for Belinkov & Bisk 2018
29 stars 8 forks source link

How is en.natural construct #4

Closed yuanliping closed 2 years ago

yuanliping commented 2 years ago

The paper 'Synthetic and Natural Noise Both Break Neural Machine Translation' only introduces how to harvest naturally occurring errors for French, German, and Czech. How are the naturally occurring errors for English (listed in en.natural) constructed?

boknilev commented 2 years ago

I don't remember where the English natural errors are from (maybe @ybisk does?), but in any case, I don't think we ended up using them, since we only insert noise to the source-side and we always translate from another language into English.

ybisk commented 2 years ago

Agree that it wasn't used, but yeah, unfortunately, I'm having trouble tracking this source down. My best guess is that it must be based on Torsten Zesch's work, but I'm having trouble finding an exact match. I did find some data here but the individual instances examples line up https://github.com/zesch/spelling-experiments . He also has some work from Wikipedia edits which might be a good place to look. Sorry!