Closed ShaneTian closed 2 years ago
Hi,
Sorry for waiting, we prepared a guidance to produce all the results of our paper below, contact us f you have any further questions in reproduce.
The examples below are T5-base
ones on single GPU. Just change prefix to T5_large
or T5_3b
for scaled up ones, and add adjustment(multiple GPUs and deepspeed) as shown in README to fix your machines.
We just take spider as a full example and provide specific info for each experiments.
python train.py --seed 2 --cfg Salesforce/T5_base_finetune_spider_with_cell_value.cfg --run_name T5_base_finetune_spider_with_cell_value --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 8 --num_train_epochs 400 --adafactor true --learning_rate 5e-5 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_base_finetune_spider_with_cell_value --overwrite_output_dir --per_device_train_batch_size 4 --per_device_eval_batch_size 16 --generation_num_beams 1 --generation_max_length 128 --input_max_length 512 --ddp_find_unused_parameters true
input_max_length
and generation_max_length
: essential to performance since grounding based on concatenation consume a lot of tokens, please check the Table 14, 15, 16 for a proper number to avoid performance loss caused by truncation! (Thanks @Chacha-Chen for pointing out our missing here!)
run_name,loggingstrategy, logging_first_step,logging_steps
: anything, won't affect result.
saving_steps, save_total_limit, load_best_model_at_end
: will affect the ckpt we saved, adjust according to the disk space.
metric_for_best_model
: set to avg
, which means the average value of all metric evluated on dev set, and could be set for all experiments.
adafactor
: use adafactor when we run experiments on T5-base and T5-large, and set to false to use AdamW instead when tuning with deepspeed.
gradient_accumulation_steps, num_train_epochs, learning_rate
: set these parameters according to Table 17 in the Appendix.
do_train, do_eval, do_predict, predict_with_generate
: whether train, or just test.
output_dir, overwrite_output_dir
: where to save the ckpts and whether to overwrite the previous ones(if run before).
ddp_find_unused_parameters
: make our experiments reproducible.
Change the cfg
as below and gradient_accumulation_steps, num_train_epochs, learning_rate
: set these parameters according to Table 17 in the Appendix.
--cfg Salesforce/T5_base_finetune_grailqa.cfg
--cfg Salesforce/T5_base_finetune_webqsp.cfg
--cfg Salesforce/T5_base_finetune_mtop.cfg
--cfg Salesforce/T5_base_finetune_wikitq.cfg
--cfg Salesforce/T5_base_finetune_wikisql.cfg
--cfg Salesforce/T5_base_finetune_compwebq.cfg
--cfg Salesforce/T5_base_finetune_hybridqa.cfg
--cfg Salesforce/T5_base_finetune_mmqa.cfg
--cfg Salesforce/T5_base_finetune_fetaqa.cfg
--cfg Salesforce/T5_base_finetune_dart.cfg
--cfg Salesforce/T5_base_finetune_totto.cfg
--cfg Salesforce/T5_base_finetune_multiwoz.cfg
--cfg Salesforce/T5_base_finetune_kvret.cfg
--cfg Salesforce/T5_base_finetune_sparc_with_cell_value.cfg
--cfg Salesforce/T5_base_finetune_cosql_with_cell_value.cfg
--cfg Salesforce/T5_base_finetune_sqa.cfg
Note: the result may be a little higher than we reported, since we fix a bug in huggingface msr_sqa dataset loader as shown in [this commit](https://github.com/HKUNLP/UnifiedSKG/commit/95fbd503e0d1fab890de3198685115857168bcb5).
--cfg Salesforce/T5_base_finetune_tab_fact.cfg
--cfg Salesforce/T5_base_finetune_feverous.cfg
--cfg Salesforce/T5_base_finetune_sql2text.cfg
--cfg Salesforce/T5_base_finetune_logic2text.cfg
Change the configure file's prefix into T0_xxx
to and run the same code.
This table's experiments is 3 folds, the prefix tuning ones, multi-task finetuning ones and multi-task-prefix tuning ones.
Just change the configure file's prefix from T5_xxx_finetunexxx
into T5_xxx_finetunexxx
, need to mention it gonna take longer training steps to reach a comparable performance with finetune, see Table 18 for more details.
Just change the configure file's prefix to T5_base_finetune_all_tasks_2upsample.cfg
. to run the experiment.
Train a shared prefix module on multiple tasks, using T5_base_prefix_all_tasks_2upsample.cfg
as configuration.
Load in the shared weight by parameter load_weights_from
and continue to train prefix on separate tasks(we just asumed you save your shared prefix in dir output/T5_base_prefix_all_tasks_2upsample2/checkpoint-220000
).
Here is an example:
export RUN_NAME=T5_base_prefix_grailqa
export SEED=2
python -m train.py --seed $SEED --cfg Salesforce/$RUN_NAME.cfg --run_name from_all_$RUN_NAME$SEED --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 4 --num_train_epochs 200 --adafactor true --learning_rate 5e-5 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/from_all_$RUN_NAME$SEED --per_device_train_batch_size 4 --per_device_eval_batch_size 8 --generation_num_beams 4 --generation_max_length 128 --input_max_length 576 --ddp_find_unused_parameters true --load_weights_from output/T5_base_prefix_all_tasks_2upsample2/checkpoint-220000 > $RUN_NAME$SEED.log 2>&1 &
Last but not least, all the weights we trained could be loaded from https://huggingface.co/hkunlp.
For A to B
tasks, first train&eval a weight on A and continue to train&eval it on B by load_weights_from
the dir you saved the ckpt of A.
See C.3 and C.4 in Appendix, unfortunately due to the update from GPT-3 and Codex, we are not sure about whether the result will be the exact the same, since they made large improvement on that and the latest version is much powerful.
Done by modifying the order of text_in(query part)
, struct_in
and text_in(context_part)
in utils/dataset
.
Done by modifying the order of the order inside struct_in
in each file in ./seq2seq_construction
.
See H part: Natural Language Template Examples of Appendix.
The experiment of spider could be run by the cfg T5_base_finetune_spider_with_cell_value_nl.cfg
.
Here is the expriments on KVRET GLMP processed version, which is the-facto used one among Task-oriented Dialogue System field.
Sorry I just found we missed to add the T5_base_finetune_kvret_glmp.cfg
in the config/Salesforce
, please replace the kvret.cfg
to kvret_glmp.cfg
in the T5_base_finetune_kvret.cfg
to make T5_base_finetune_kvret_glmp.cfg
.
Then run this will preoduce our result:
python train.py --seed 2 --cfg Salesforce/T5_base_finetune_kvret_glmp.cfg --run_name T5_base_finetune_kvret_glmp --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 8 --num_train_epochs 400 --adafactor true --learning_rate 5e-5 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_base_finetune_kvret_glmp --overwrite_output_dir --per_device_train_batch_size 4 --per_device_eval_batch_size 16 --generation_num_beams 4 --generation_max_length 128 --input_max_length 1024 --ddp_find_unused_parameters true
We tokenize the sequence by tokenizer form T5 every time when we load in a dataset from our ./datasets
to calculate the distribution of length, the plot code is simple so we won't show them here.
Hope these information helpful!
Thanks!
Wow, that is a very detailed answer.👍 Thank you very much! Maybe you can put this guidance in README or somewhere else.
Is there a unified entrance to reproduce all paper experiments?
like
scripts
folder:Or, for all datasets, the command to run the experiment is exactly the same as README Training