Run the code on DailyDialog but have terrible result

gmftbyGMFTBY commented 5 years ago

Hi, thanks for your open source codes of this work. I try to apply your code on a new dataset DialogDialog, but I found that the outputs of the model are all the token '.' which means nothing.

So, I'm very curious that if this code is not appropriate to other datasets? Can you help me troubleshoot the issue?

zhanghainan commented 4 years ago

It is suitable for other data. You can change the data input process in data_load.py.

gmftbyGMFTBY commented 4 years ago

Hi, thanks for your response. I have a question when I try to reproduce ReCoSa. After completely analyzing the model architecture, I think that the main difference between the ReCoSa and HRED-based hierarchical models are the transformer-based context encoder.
Can I just simply think that the ReCoSa replace the RNN context encoder with the transformer?

I hope to get the responses from you. Thank you very much.

zhanghainan commented 4 years ago

yes，recosa is add a lstm encoder for transformer

ttzhang511 commented 4 years ago

Hi, thanks for your open source codes of this work. I try to apply your code on a new dataset DialogDialog, but I found that the outputs of the model are all the token '.' which means nothing.

So, I'm very curious that if this code is not appropriate to other datasets? Can you help you troubleshoot the issue?

I have the same problem. And I have changed the data input process in data_load.py. So how to solve this problem?

zhanghainan commented 4 years ago

Have you ever "python propro.py" to conduct the vocab? In the train.py at 216 line, you could print the x and y to see if they are all correct with the vocab in prepocessed folder.

ttzhang511 commented 4 years ago

Yes，x ,y are correct, but the preds like this :

gmftbyGMFTBY commented 4 years ago

Oh, I reproduce the ReCoSa on the Dailydialog and Cornell dataset, and the performance is slightly worse than HRED with the attention module.

It should be noted that the performance of the code in this repo is bad, so I reproduce the ReCoSa based on your paper. (Replace the RNN Context encoder with the Transformer, and the implementation of the Transformer is borrowed from the PyTorch 1.3)

zhanghainan commented 4 years ago

Yes，x ,y are correct, but the preds like this :

could you give me 100 lines of training data? I run this data to see the problem.

zhanghainan commented 4 years ago

Oh, I reproduce the ReCoSa on the Dailydialog and Cornell dataset, and the performance is slightly worse than HRED with the attention module.

It should be noted that the performance of the code in this repo is bad, so I reproduce the ReCoSa based on your paper. (Replace the RNN Context encoder with the Transformer, and the implementation of the Transformer is borrowed from the PyTorch 1.3)

Maybe the characteristic of the dataset is different. In JDC, it is the customer-servers dataset with topic drift phenomenon.

gmftbyGMFTBY commented 4 years ago

Thank you for your response, I agree with you. And I will try to modify the parameters of my implementation and find out the better setting.

ttzhang511 commented 4 years ago

谢谢您的回复！我刚刚开始了解对话生成，读了您的文章很感兴趣，我实验用的是ubuntu数据集，我不确定这样的格式是否有问题，下面是大约600条数据。

ubuntu_train_answer_8.txt ubuntu_train_query_8.txt

zhanghainan commented 4 years ago

谢谢您的回复！我刚刚开始了解对话生成，读了您的文章很感兴趣，我实验用的是ubuntu数据集，我不确定这样的格式是否有问题，下面是大约600条数据。

ubuntu_train_answer_8.txt ubuntu_train_query_8.txt

格式是正确的，你需要多训练一下，我试了一下这600条数据，大概100epoch才能看到效果，整个模型训练大约需要2万epoch in my code。对话生成模型是否收敛不仅仅看dev set的指标，不用担心过拟合，而是要尽可能地多训练，才能看到效果。最后的生成效果会有瓶颈，但是dev指标是很早就开始跳跃了，但是效果还在变好，主要是因为指标不能反映对话生成的效果。

zhanghainan commented 4 years ago

Thank you for your response, I agree with you. And I will try to modify the parameters of my implementation and find out the better setting.

I have try the ttzhang511's data about 600 lines, it should have 100 epochs to train this small dataset. For my code , it need at least 20,000 epoches. You could run the model more times, regardless of the dev measures.

gmftbyGMFTBY commented 4 years ago

Emmm, but actually the dailydialog dataset is big. So many epochs lead to so much time to converge.

zhanghainan commented 4 years ago

Yes, you could see the generation sentences with more times.

gmftbyGMFTBY commented 4 years ago

Now, I run 30 epoches on Dailydialog (Cornell is bigger than dailydialog). And through the performance curve, I found nearly all the metrics converges. (bleu1~4, rouge, dist1,dist2, BERTScore)

gmftbyGMFTBY commented 4 years ago

I will try to use 100 epoches and analyze the performance. Then I will continue to report the result. Thank you for your response.

ttzhang511 commented 4 years ago

I run 500 epoches on Dailydialog and got a lot of " I'm sorry", so do I need to run more epoches ?

gmftbyGMFTBY commented 4 years ago

I reproduce the ReCoSa in PyTorch 1.3 and use the official Transformer implementation of the PyTorch 1.3. 30 epochs are used to train the ReCoSa on Dailydialog dataset and here are my partial results:

Due to some special reasons, I don't show more information about the comparison. It can be found that the ReCoSa is slightly worse than the baselines on some automatic evaluation methods. We will use the human annotations to test the performance of the ReCoSa in the future.

@ttzhang511, hope the comparison can be helpful. I will make my repo public in about a month.

gmftbyGMFTBY commented 4 years ago

Oh, I forget some essential information about my experiments:

Training ReCoSa by 30 epochs cost nearly 8.5 hours.
I didn't use the whole Dailydialog datasets due to some special reasons (So the low BLEU is explicable). If the whole Dailydialog dataset is used, the training time will be higher.

ttzhang511 commented 4 years ago

Thank you so much! @gmftbyGMFTBY

katherinelyx commented 4 years ago

@gmftbyGMFTBY Hi. When I conduct ReCoSa on DailyDialog, I got some frequently existed bad examples.

'yes, and and and and and', many repetitions.
'i i will take it', always starts with the word 'i'. Have you seen such problems? Or maybe something thing is wrong with my operations? Could you please provide some suggestions about handling such issues? Thank you very much.

gmftbyGMFTBY commented 4 years ago

Hi, I run the codes of the author in this repo, but actually it didn't work. So I reproduce the ReCoSa by myself in PyTorch (source codes in this repo are written by tensorflow, but I don't think the issue is caused by the deep learning framework.).

In my opinion and has been recognized by the author, the ReCoSa just simply replace the RNN-based context encoder with the vanilla transformer encoder. So in my implementation, I use the official transformer architecture in PyTorch 1.3 (I also think you can try the transformer2.0) and the generation seems to be normal.

But after comparing with the HRED and other baselines, I found that ReCoSa is slightly worse than these baselines (Although the BLEU and other automatic evaluations may not be suitable for measuring the open-domain dialogue systems). I also tried some hyperparameters but the conclusion is the same. You can try to reproduce the ReCoSa by yourself and see the result.

katherinelyx commented 4 years ago

Thank you. Do you have any suggestion about high-quality tensorflow (<2.0) implement of Transformer ? I am not sure that whether it is the implementation I used cause these bad examples.

发自我的iPhone

在 2019年12月28日，14:40，GMFTBY notifications@github.com 写道：

Hi, I run the codes of the author in this repo, but actually it didn't work. So I reproduce the ReCoSa by myself in PyTorch (source codes in this repo are written by tensorflow, but I don't think the issue is caused by the deep learning framework.).

In my opinion and has been recognized by the author, the ReCoSa just simply replace the RNN-based context encoder with the vanilla transformer encoder. So in my implementation, I use the official transformer architecture in PyTorch 1.3 (I also think you can try the transformer2.0) and the generation seems to be normal.

But after comparing with the HRED and other baselines, I found that ReCoSa is slightly worse than these baselines (Although the BLEU and other automatic evaluations may not be suitable for measuring the open-domain dialogue systems). I also tried some hyperparameters but the conclusion is the same. You can try to reproduce the ReCoSa by yourself and see the result.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

gmftbyGMFTBY commented 4 years ago

I'm not very familiar with the TensorFlow. Transformer 2.0 (huggingface) seems that support the TensorFlow and you can try it.

gmftbyGMFTBY commented 4 years ago

@ttzhang511 @katherinelyx @zhanghainan Hi guys, I make my repo public and it contains the PyTorch version ReCoSa and other multi-turn dialogue models. Welcome to use my repo. If you have any questions, feel free to raise the issue and let me know. I will try my best to give my solutions and responses. Thank you so much, and I will close the issue.

Repo: MultiTurnDialogZoo

zhanghainan / ReCoSa

Run the code on DailyDialog but have terrible result #5