How to reproduce the performance on the ReDial dataset?

dandyxxxx commented 1 year ago

I trained according to the code provided on GitHub, but since the dataset link you provided cannot be opened, I used mapping based objects Lang=en 202112.ttl dataset. The final results of my training are as follows:

conv: 'test/dist@2': 0.310709750246931, 'test/dist@3': 0.49851841399746016, 'test/dist@4': 0.6383519119514605 rec: 'test/recall@1': 0.029324894514767934, 'test/recall@10': 0.16729957805907172, 'test/recall@50': 0.37953586497890296

(1)These results differ greatly from the results presented in the paper. Can you give me some guidance? I hope to reproduce results similar to yours. Thank you very much. (2)According to your paper, do I need to set --n_prefix_conv 50 in the train_conv.py and --use_resp in the train_rec. py?

linshan-79 commented 11 months ago

I also encountered the same problem as you. Since the author provided the missing files, I trained according to the guidance. However the metric results also similar to yours. Here are the detail of redial dataset metrics:

conv:

'test/dist@2': 0.26710879074361504, 'test/dist@3': 0.4199238041484408, 'test/dist@4': 0.5233526174686045,

rec:

'test/recall@1': 0.035443037974683546, 'test/recall@10': 0.1729957805907173, 'test/recall@50': 0.3744725738396624,

Here is my config of conversational task:

accelerate launch train_conv.py \
           --dataset redial \
           --tokenizer ~/model/DialoGPT-small \
           --model ~/model/DialoGPT-small \
           --text_tokenizer ~/model/roberta-base \
           --text_encoder ~/model/roberta-base \
           --n_prefix_conv 50 \
           --prompt_encoder ${prompt_encoder_dir}/final \
           --num_train_epochs 10 \
           --gradient_accumulation_steps 1 \
           --ignore_pad_token_for_loss \
           --per_device_train_batch_size 8 \
           --per_device_eval_batch_size 16 \
           --num_warmup_steps 6345 \
           --context_max_length 200 \
           --resp_max_length 183 \
           --prompt_max_length 200 \
           --entity_max_length 32 \
           --learning_rate 1e-4 \
           --output_dir ${output_dir} \
           --log_all

(1)@dandyxxxx, you can see that I set 'n_prefix_conv=50,' but the results don't match the paper. Could you share your configuration details? Maybe we can work together to solve the problem. Thank you very much! (2)@wxl1999 Thanks for your work! I learned a lot from your paper and code as a beginner. I'm thinking the issue might be related to the kg module not correctly capturing relations from the dataset. Could you provide some guidance? Thank you very much!

wxl1999 commented 11 months ago

Sorry for the late reply!

The pre-training stage is very important for the final performance. You should observe a very good performance since the answer is actually provided in the response.
- If your pre-training is not so good, you cannot observe a continuous drop in the loss curve.
Once your pre-training is well conducted, you will observe similar performance for the recommendation task with fine-tuning.
As for the conversation task, since distinct is not a very reliable metric (you can observe continuous performance gain if you do not stop training), I suggest you do not pay too much attention to this, but focus more on human evaluation. This is also the practice for large language models.
About the evaluation for conversational recommendation, you can also refer to this paper: Rethinking the Evaluation for Conversational Recommendation in the Era of Large Language Models

Hope this can help you!

linshan-79 commented 11 months ago

Thanks for your replying! This help me a lot.

careerists commented 10 months ago

@linshan-79 I have the same problem. Did you finally solve it? Thank you so much.

wxl1999 / UniCRS

How to reproduce the performance on the ReDial dataset? #8