Closed BogdanDidenko closed 4 years ago
Let me check ...
You mean the English to German dataset right? I'm using sentence piece for that one. Let me copy the command
Oh, I'm just using the OpenNMT procedure.
Please check here: https://github.com/OpenNMT/OpenNMT-tf/tree/master/scripts/wmt
I just looking for way for converting text
like:
Schulen werden zu größerem Fokus auf Mathematik, Rechtschreibung und Grammatik angehalten
->
▁Schulen ▁werden ▁zu ▁größere m ▁F okus ▁auf ▁Mathematik , ▁Recht schreibung ▁und ▁Gramm atik ▁an gehalten
As I understand "./prepare_data.sh raw_data" train new sentencepiece model. And if I want convert my own data in same manner I should use model file with '.model' ext (something like data/wmtende.model).
@BogdanDidenko Yes you are right, I'm also using wmtende.model
for conversion.
@BogdanDidenko Yes you are right, I'm also using
wmtende.model
for conversion.
Could you please publish it or add to gdown.pl downloading step?
@BogdanDidenko I'm now uploading the model to the github repo, wait few minutes.
@BogdanDidenko I just uploaded the sentencepiece model for wmt14, you can find it here: https://github.com/zomux/lanmt/blob/master/mydata/preprocessing/wmtende.model
I'm now testing it to see whether it works correctly.
@BogdanDidenko Okay, the model is working. The command for segmentation is
spm_encode --model=wmtende.model --output_format=piece < text.en > text.en.sp
spm_encode --model=wmtende.model --output_format=piece < text.de > text.de.sp
The same model can be used in both languages.
Thank you for great research! Could you please provide tokenization scripts what you use for pre-process dataset? I want try eval my own data using your pre-trained model.