Reproducing results from the paper with roberta using fairseq

bminixhofer commented 4 years ago

Hi! Thanks for this repository.

I've been trying to reproduce the results from the paper but ran into some problems. I tried the script in fairseq-RoBERTa/launch/FreeLB/rte-fp32-clip.sh which I would've expected to score 88.13 on average as shown in Table 1 in the paper.

I tried:

five seeds with the setup currently checked into the repo:

# run_exp   GPU    TOTAL_NUM_UPDATES    WARMUP_UPDATES  LR      NUM_CLASSES MAX_SENTENCES   FREQ    DATA    ADV_LR  ADV_STEP  INIT_MAG  SEED    MNORM
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     1  1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     2  1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     3  1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     4  1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     9016  1.4e-1

and got the scores 0.8597, 0.8884, 0.8057, 0.8669, 0.8633 (mean 0.8568). logs here.

five seeds with the parameters from Table 7 in the paper, using the default fairseq parameters (from https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.glue.md#3-fine-tuning-on-glue-task) for parameters which are not specified:

# run_exp   GPU    TOTAL_NUM_UPDATES    WARMUP_UPDATES  LR      NUM_CLASSES MAX_SENTENCES   FREQ    DATA    ADV_LR  ADV_STEP  INIT_MAG  SEED    MNORM
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     1  0
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     2  0
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     3  0
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     4  0
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     5  0

and got the scores 0.8741, 0.7949, 0.8417, 0.6330, 0.6007 (mean 0.7488). logs here.

Appreciate any help!

zhuchen03 commented 4 years ago

Hi Benjamin,

Sorry for the confusion. I took a look at the hyperparameters. The result reported paper was actually using adv_lr=3e-2, but I did not check it carefully and included some other hyperparameters I tried for grid search in the released launch script. In other words, could you try run_exp 1 2036 122 1e-5 2 2 8 RTE 3e-2 3 1.6e-1 1 1.4e-1 and see if it reproduces the results?

Also, as the smallest dataset, RTE's result might also have the highest variance. You could try more runs or tune adv_lr around this value a little bit.

bminixhofer commented 4 years ago

Thanks! I tried five runs with that setup:

run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         1     1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         2     1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         3     1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         4     1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         5     1.4e-1

and get the scores 0.8597, 0.8597, 0.8669, 0.8776, 0.8525 (mean 0.8633). logs here.

That is significantly better than before but still 1.8% worse than the mean in the paper, which is quite high for just being variance.

I guess I'll try tuning the parameters a bit, especially adv_lr (and I'll do some more runs with the parameters from above as well to make sure it isn't just "bad" random seeds).

zhuchen03 commented 4 years ago

Another potential issue is how the scores are defined. In the paper, the scores are the highest result evaluated at multiple checkpoints for each run, but it is possible that you are looking at the results from the last checkpoint.

Regarding the variance, if you compare your scores of the 5 runs, the variance is somewhat large, especially for the RoBERTa baseline.

bminixhofer commented 4 years ago

Hmm yes, that's strange. I did use the highest scores from multiple checkpoints (shown as e. g. Best metric: 0.8525179856115108 in the log files) not the last score so that is not an issue.

bminixhofer commented 4 years ago

I can reproduce the results now! I ran five more seeds with the same setup:

run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         123     1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         456     1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         789     1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         10112   1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         131415  1.4e-1

and get the results: 0.8849, 0.8812, 0.8669, 0.8849, 0.8777 (mean 0.8791) which is definitely within margin of error of the paper.

The only thing I changed are the seeds. From these scores I'd be inclined to think that low seeds behave strangely because these scores are consistently better than the previous ones (although that is surely impossible).

Probably just high variance because of the dataset size, as you mentioned.

Feel free to close this issue. Thanks for your help.

zhuchen03 / FreeLB

Reproducing results from the paper with roberta using fairseq #11