pre-process scripts for the WMT14 dataset or sentencepiece model - Githubissues

zomux / lanmt

LaNMT: Latent-variable Non-autoregressive Neural Machine Translation with Deterministic Inference

MIT License

79 stars 4 forks source link

pre-process scripts for the WMT14 dataset or sentencepiece model #1

Closed BogdanDidenko closed 4 years ago

BogdanDidenko commented 4 years ago

Thank you for great research! Could you please provide tokenization scripts what you use for pre-process dataset? I want try eval my own data using your pre-trained model.

zomux commented 4 years ago

Let me check ...

zomux commented 4 years ago

You mean the English to German dataset right? I'm using sentence piece for that one. Let me copy the command

zomux commented 4 years ago

Oh, I'm just using the OpenNMT procedure.

Please check here: https://github.com/OpenNMT/OpenNMT-tf/tree/master/scripts/wmt

BogdanDidenko commented 4 years ago

I just looking for way for converting text like: Schulen werden zu größerem Fokus auf Mathematik, Rechtschreibung und Grammatik angehalten -> ▁Schulen ▁werden ▁zu ▁größere m ▁F okus ▁auf ▁Mathematik , ▁Recht schreibung ▁und ▁Gramm atik ▁an gehalten

As I understand "./prepare_data.sh raw_data" train new sentencepiece model. And if I want convert my own data in same manner I should use model file with '.model' ext (something like data/wmtende.model).

zomux commented 4 years ago

@BogdanDidenko Yes you are right, I'm also using wmtende.model for conversion.

BogdanDidenko commented 4 years ago

@BogdanDidenko Yes you are right, I'm also using wmtende.model for conversion.

Could you please publish it or add to gdown.pl downloading step?

zomux commented 4 years ago

@BogdanDidenko I'm now uploading the model to the github repo, wait few minutes.

zomux commented 4 years ago

@BogdanDidenko I just uploaded the sentencepiece model for wmt14, you can find it here: https://github.com/zomux/lanmt/blob/master/mydata/preprocessing/wmtende.model

I'm now testing it to see whether it works correctly.

zomux commented 4 years ago

@BogdanDidenko Okay, the model is working. The command for segmentation is

spm_encode --model=wmtende.model --output_format=piece < text.en > text.en.sp
spm_encode --model=wmtende.model --output_format=piece < text.de > text.de.sp

The same model can be used in both languages.