I'm sorry that I did not clean up the scripts (including tokenize, sentencepiece, detokenize, and etc.).
However, I can introduce the work line.
We extract en->{nl, it, ro} from the bilingual corpus of IWSLT17 as monolingual corpus, since they are parallel.
We employed the mosedecoder to tokenize & lowercase & normalize.
We combined monolingual corpus together, then trained the sentencepiece model by the hyperparameters we mentioned in the paper. BTW, the script is as same as the example provided by Fairseq.
We added the artificial language tokens at the beginning of each sentence.
Preprocess by Fairseq script without any additional hyperparameters.
When generating, we set --remove-bpe sentencepiece in fairseq.
After generating output for the test, we employ the mosedecoder to detokenize.
Hey, thanks for the released code. Is it convenient to provide scripts for the IWSLT baseline?