xlang-ai / UnifiedSKG

[EMNLP 2022] Unifying and multi-tasking structured knowledge grounding with language models
https://arxiv.org/abs/2201.05966
Apache License 2.0
550 stars 58 forks source link

The different results between eval mode and test mode. #26

Open eyuansu62 opened 2 years ago

eyuansu62 commented 2 years ago

Why I get the different results between eval mode and test mode? image

ChenWu98 commented 2 years ago

Hi,

Could you share the command you ran for this experiment?

eyuansu62 commented 2 years ago

The command is as follows:

python -m torch.distributed.launch --nproc_per_node 4 --master_port 12 train.py --seed 2 --cfg Salesforce/T5_3b_finetune_spider_with_cell_value.cfg --run_name T5_3b_finetune_spider_with_cell_value --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 8 --num_train_epochs 400 --adafactor true --learning_rate 1e-4 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_large_finetune_spider_with_cell_value  --overwrite_output_dir --per_device_train_batch_size 2 --per_device_eval_batch_size 8 --generation_num_beams 1 --generation_max_length 512 --input_max_length 512 --ddp_find_unused_parameters true
ChenWu98 commented 2 years ago

Is the highest eval score the same as the test score?

eyuansu62 commented 2 years ago

The ckpt I chosen is the highest eval score during the training steps. As you can see, it is different from the test score.

ChenWu98 commented 2 years ago

Can you run the following command on the same machine (which means that the previous checkpoints are still there) and see if the results are different?

python -m torch.distributed.launch --nproc_per_node 4 --master_port 12 train.py --seed 2 --cfg Salesforce/T5_3b_finetune_spider_with_cell_value.cfg --run_name T5_3b_finetune_spider_with_cell_value --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 8 --num_train_epochs 0 --adafactor true --learning_rate 1e-4 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_large_finetune_spider_with_cell_value --per_device_train_batch_size 2 --per_device_eval_batch_size 8 --generation_num_beams 1 --generation_max_length 512 --input_max_length 512 --ddp_find_unused_parameters true
Timothyxxx commented 2 years ago

@eyuansu62 Hi, any new progress over there? We double-checked our experiments log before and didn't find the case you showed, and we looked through the issues of PICARD and saw that you made similar issue in there too. It is very likely we are facing the same issue and same factor in your machine.

Hope we can figure that out together!

eyuansu62 commented 2 years ago

They are still a little different. image

Timothyxxx commented 2 years ago

Could you double-check the evaluation and prediction json file? It could help us with where the problem lies.

eyuansu62 commented 2 years ago

I check the evaluation and prediction json file, and find they are indeed different, no matter when do_train=False or num_train_epoch=0.

The different sqls are like follows, just a few conditions are wrong: select singer.name from concert join singer_in_concert on concert.concert_id = singer_in_concert.concert_id where concert.year = 2014 select singer.name from concert join singer_in_concert on concert.concert_id = singer_in_concert.singer_id where concert.year = 2014

Timothyxxx commented 2 years ago

Okay, I will keep this issue active and see if anyone find similar problem!

ChenWu98 commented 2 years ago

I just realized that the command you provided is for T5-3b without using deepspeed. I remember that we didn't manage to run without deepspeed even on an A100. What kind of GPU are you using, if you remember?

eyuansu62 commented 2 years ago

Well, it is actually t5-large in this cfg file. I forget to change the file name.

Timothyxxx commented 2 years ago

Hey, we asked someone else for help to test it on his side and didn't get different result between eval mode and test mode(which is consistent with ours). Therefore we think it may because the machine in your side. Could you provide more info about hardware and system then?

eyuansu62 commented 2 years ago

image image