scripts for training baseline model

zanchangtong commented 1 year ago

Hey, thanks for the released code. Is it convenient to provide scripts for the IWSLT baseline?

zhiqu22 commented 1 year ago

I'm sorry that I did not clean up the scripts (including tokenize, sentencepiece, detokenize, and etc.). However, I can introduce the work line.

We extract en->{nl, it, ro} from the bilingual corpus of IWSLT17 as monolingual corpus, since they are parallel.
We employed the mosedecoder to tokenize & lowercase & normalize.
We combined monolingual corpus together, then trained the sentencepiece model by the hyperparameters we mentioned in the paper. BTW, the script is as same as the example provided by Fairseq.
We added the artificial language tokens at the beginning of each sentence.
Preprocess by Fairseq script without any additional hyperparameters.
When generating, we set --remove-bpe sentencepiece in fairseq.
After generating output for the test, we employ the mosedecoder to detokenize.
Finally, we comput BLEU scores.

zanchangtong commented 1 year ago

Thanks, I will try to further reproduce the experiments.

zhiqu22 / AdapNonCenter