xlang-ai / UnifiedSKG

[EMNLP 2022] Unifying and multi-tasking structured knowledge grounding with language models
https://arxiv.org/abs/2201.05966
Apache License 2.0
550 stars 58 forks source link

Is there a unified entrance to reproduce all paper experiments? #14

Closed ShaneTian closed 2 years ago

ShaneTian commented 2 years ago

Is there a unified entrance to reproduce all paper experiments?

like scripts folder:

scripts/train_spider.sh
scripts/train_wikitq.sh
...

Or, for all datasets, the command to run the experiment is exactly the same as README Training

Timothyxxx commented 2 years ago

Hi,

Sorry for waiting, we prepared a guidance to produce all the results of our paper below, contact us f you have any further questions in reproduce.

Reproduce guidance

For results in Table 2, Table 11 and Table 12

The examples below are T5-base ones on single GPU. Just change prefix to T5_large or T5_3b for scaled up ones, and add adjustment(multiple GPUs and deepspeed) as shown in README to fix your machines.

We just take spider as a full example and provide specific info for each experiments.

Spider:

python train.py --seed 2 --cfg Salesforce/T5_base_finetune_spider_with_cell_value.cfg --run_name T5_base_finetune_spider_with_cell_value --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 8 --num_train_epochs 400 --adafactor true --learning_rate 5e-5 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_base_finetune_spider_with_cell_value --overwrite_output_dir --per_device_train_batch_size 4 --per_device_eval_batch_size 16 --generation_num_beams 1 --generation_max_length 128 --input_max_length 512 --ddp_find_unused_parameters true

input_max_length and generation_max_length: essential to performance since grounding based on concatenation consume a lot of tokens, please check the Table 14, 15, 16 for a proper number to avoid performance loss caused by truncation! (Thanks @Chacha-Chen for pointing out our missing here!) run_name,loggingstrategy, logging_first_step,logging_steps: anything, won't affect result.

saving_steps, save_total_limit, load_best_model_at_end: will affect the ckpt we saved, adjust according to the disk space.

metric_for_best_model: set to avg, which means the average value of all metric evluated on dev set, and could be set for all experiments.

adafactor: use adafactor when we run experiments on T5-base and T5-large, and set to false to use AdamW instead when tuning with deepspeed.

gradient_accumulation_steps, num_train_epochs, learning_rate: set these parameters according to Table 17 in the Appendix.

do_train, do_eval, do_predict, predict_with_generate: whether train, or just test.

output_dir, overwrite_output_dir: where to save the ckpts and whether to overwrite the previous ones(if run before).

ddp_find_unused_parameters: make our experiments reproducible.

Change the cfg as below and gradient_accumulation_steps, num_train_epochs, learning_rate: set these parameters according to Table 17 in the Appendix.

GrailQA:

--cfg Salesforce/T5_base_finetune_grailqa.cfg

WebQSP

--cfg Salesforce/T5_base_finetune_webqsp.cfg

MTOP

--cfg Salesforce/T5_base_finetune_mtop.cfg

WikiTQ

--cfg Salesforce/T5_base_finetune_wikitq.cfg

WikiSQL

--cfg Salesforce/T5_base_finetune_wikisql.cfg

CompWebQ

--cfg Salesforce/T5_base_finetune_compwebq.cfg

HybridQA

--cfg Salesforce/T5_base_finetune_hybridqa.cfg

MultiModalQA

--cfg Salesforce/T5_base_finetune_mmqa.cfg

FeTaQA

--cfg Salesforce/T5_base_finetune_fetaqa.cfg

DART

--cfg Salesforce/T5_base_finetune_dart.cfg

ToTTo

--cfg Salesforce/T5_base_finetune_totto.cfg

MultiWoZ2.1

--cfg Salesforce/T5_base_finetune_multiwoz.cfg

KVRET

--cfg Salesforce/T5_base_finetune_kvret.cfg

SParC

--cfg Salesforce/T5_base_finetune_sparc_with_cell_value.cfg

CoSQL

--cfg Salesforce/T5_base_finetune_cosql_with_cell_value.cfg

SQA

--cfg Salesforce/T5_base_finetune_sqa.cfg

Note: the result may be a little higher than we reported, since we fix a bug in huggingface msr_sqa dataset loader as shown in [this commit](https://github.com/HKUNLP/UnifiedSKG/commit/95fbd503e0d1fab890de3198685115857168bcb5).

TabFact

--cfg Salesforce/T5_base_finetune_tab_fact.cfg

FEVEROUS

--cfg Salesforce/T5_base_finetune_feverous.cfg

SQL2Text

--cfg Salesforce/T5_base_finetune_sql2text.cfg

Logic2Text

--cfg Salesforce/T5_base_finetune_logic2text.cfg

For results in Table 3

Change the configure file's prefix into T0_xxx to and run the same code.

For results in Table 4

This table's experiments is 3 folds, the prefix tuning ones, multi-task finetuning ones and multi-task-prefix tuning ones.

Prefix tuning experiments

Just change the configure file's prefix from T5_xxx_finetunexxx into T5_xxx_finetunexxx, need to mention it gonna take longer training steps to reach a comparable performance with finetune, see Table 18 for more details.

Multi-task finetuning experiments

Just change the configure file's prefix to T5_base_finetune_all_tasks_2upsample.cfg. to run the experiment.

Multi-task-prefix tuning experiments

Step 1,

Train a shared prefix module on multiple tasks, using T5_base_prefix_all_tasks_2upsample.cfg as configuration.

Step 2,

Load in the shared weight by parameter load_weights_from and continue to train prefix on separate tasks(we just asumed you save your shared prefix in dir output/T5_base_prefix_all_tasks_2upsample2/checkpoint-220000).

Here is an example:

export RUN_NAME=T5_base_prefix_grailqa
export SEED=2
python -m train.py --seed $SEED --cfg Salesforce/$RUN_NAME.cfg --run_name from_all_$RUN_NAME$SEED --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 4 --num_train_epochs 200 --adafactor true --learning_rate 5e-5 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/from_all_$RUN_NAME$SEED --per_device_train_batch_size 4 --per_device_eval_batch_size 8 --generation_num_beams 4 --generation_max_length 128 --input_max_length 576 --ddp_find_unused_parameters true --load_weights_from output/T5_base_prefix_all_tasks_2upsample2/checkpoint-220000 > $RUN_NAME$SEED.log 2>&1 &

Last but not least, all the weights we trained could be loaded from https://huggingface.co/hkunlp.

For results in Table 5

For A to B tasks, first train&eval a weight on A and continue to train&eval it on B by load_weights_from the dir you saved the ckpt of A.

For results in Table 6

See C.3 and C.4 in Appendix, unfortunately due to the update from GPT-3 and Codex, we are not sure about whether the result will be the exact the same, since they made large improvement on that and the latest version is much powerful.

For results in Table 7

Done by modifying the order of text_in(query part), struct_in and text_in(context_part) in utils/dataset.

For results in Table 8

Done by modifying the order of the order inside struct_in in each file in ./seq2seq_construction.

For results in Table 9

See H part: Natural Language Template Examples of Appendix.

The experiment of spider could be run by the cfg T5_base_finetune_spider_with_cell_value_nl.cfg.

For result in Table 13

Here is the expriments on KVRET GLMP processed version, which is the-facto used one among Task-oriented Dialogue System field.

Sorry I just found we missed to add the T5_base_finetune_kvret_glmp.cfg in the config/Salesforce, please replace the kvret.cfg to kvret_glmp.cfg in the T5_base_finetune_kvret.cfg to make T5_base_finetune_kvret_glmp.cfg.

Then run this will preoduce our result:

python train.py --seed 2 --cfg Salesforce/T5_base_finetune_kvret_glmp.cfg --run_name T5_base_finetune_kvret_glmp --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 8 --num_train_epochs 400 --adafactor true --learning_rate 5e-5 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_base_finetune_kvret_glmp --overwrite_output_dir --per_device_train_batch_size 4 --per_device_eval_batch_size 16 --generation_num_beams 4 --generation_max_length 128 --input_max_length 1024 --ddp_find_unused_parameters true

For result in Figure 5 and Table 14, 15, 16

We tokenize the sequence by tokenizer form T5 every time when we load in a dataset from our ./datasets to calculate the distribution of length, the plot code is simple so we won't show them here.

Hope these information helpful!

Thanks!

ShaneTian commented 2 years ago

Wow, that is a very detailed answer.👍 Thank you very much! Maybe you can put this guidance in README or somewhere else.