subword-nmt - Githubissues

528031207 commented 5 years ago

为什么我用subword-nmt get-vocab --input tmp/raw-train.zh-en.en --output en.vocab生成不了数据呀，是版本不兼容吗？

yanwii commented 5 years ago

建议英文分词参考一下subword-mnt, problem中也换成原生的SubwordEncoder, 这么弄下来中译英效果会好很多，

528031207 commented 5 years ago

谢谢，我尝试一下，这个模型摆弄好几天了没有什么进展，你的建议对我很有帮助！

------------------ 原始邮件 ------------------ 发件人: "Ken"notifications@github.com; 发送时间: 2019年7月10日(星期三) 下午3:03 收件人: "yanwii/machine-translation"machine-translation@noreply.github.com; 抄送: "忘尘居"528031207@qq.com; "Author"author@noreply.github.com; 主题: Re: [yanwii/machine-translation] subword-nmt (#2)

建议英文分词参考一下subword-mnt, problem中也换成原生的SubwordEncoder, 这么弄下来中译英效果会好很多，

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

528031207 commented 5 years ago

我做英译汉按照训练代码没有问题，但是在做汉译英的时候总是报下面的错，即使我把batch_size调为128也报错，这种情况应该怎么处理呢 (0) Resource exhausted: Ran out of GPU memory when allocating 688855104 bytes for [[{{node transformer/parallel_0_5/transformer/transformer/padded_cross_entropy/smoothing_cross_entropy/softmax_cross_entropy_with_logits}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[training/control_dependency/_6751]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

528031207 commented 5 years ago

大佬，我把--decode_hparams="batch_size=1024"改为--hparams="batch_size=1024"就可以正常运行了，这两个参数的区别是什么呢，对结果又有什么影响呢？现在每100step的运行时间是之前的一半了。

yanwii commented 5 years ago

hparams应该是作为训练时候的参数， decode_hparams作为解码时候的参数，默认的batch_size为2048，所以现在时间减半了。另外OOM是显存超了

528031207 commented 5 years ago

大佬我用你提供的方式做英译中可以达到很好的效果，为什么做中译英loss会卡在4.9左右不降呢，bleu也只有不到2，以下是我的参数 export CUDA_VISIBLE_DEVICES=0 t2t-trainer --data_dir=data --output_dir=model_rev --problem=translate_enzh_sub50k_rev --model=transformer --hparams_set=transformer_big --train_steps=200000 --eval_steps=100 --t2t_usr_dir=user_dir --tmp_dir=tmp/ --hparams="batch_size=2048" --worker_gpu_memory_fraction=0.92 --decode_hparams="batch_size=1024" 当我调小学习率的时候，会OOM，即使batchsize调到512也还是OOM 是哪里出了问题吗?还望大佬指点迷津！

yanwii commented 5 years ago

中译英的表现跟你分词的结果很有关系，我当时中文分字，英文使用bpe分词，虽然loss降不太下去，bleu在20左右，但实际测试效果还是可以的。

cfwin commented 5 years ago

为什么不用jieba的分词呢？效果不好吗？

yanwii commented 5 years ago

中文分词词典大小不好控制，很容易OOV，按字来会好很多。

cfwin commented 4 years ago

对中文中的英语单词和特殊字段（例如URL）是怎么预处理的呢？需要处理吗？在翻译系统在线翻译的时候，OOV 应该怎么处理呢？

yanwii / machine-translation

subword-nmt #2