Closed bminixhofer closed 4 years ago
Hi Benjamin,
Sorry for the confusion. I took a look at the hyperparameters. The result reported paper was actually using adv_lr=3e-2
, but I did not check it carefully and included some other hyperparameters I tried for grid search in the released launch script. In other words, could you try
run_exp 1 2036 122 1e-5 2 2 8 RTE 3e-2 3 1.6e-1 1 1.4e-1
and see if it reproduces the results?
Also, as the smallest dataset, RTE's result might also have the highest variance. You could try more runs or tune adv_lr around this value a little bit.
Thanks! I tried five runs with that setup:
run_exp 1 2036 122 1e-5 2 2 8 RTE 3e-2 3 1.6e-1 1 1.4e-1
run_exp 1 2036 122 1e-5 2 2 8 RTE 3e-2 3 1.6e-1 2 1.4e-1
run_exp 1 2036 122 1e-5 2 2 8 RTE 3e-2 3 1.6e-1 3 1.4e-1
run_exp 1 2036 122 1e-5 2 2 8 RTE 3e-2 3 1.6e-1 4 1.4e-1
run_exp 1 2036 122 1e-5 2 2 8 RTE 3e-2 3 1.6e-1 5 1.4e-1
and get the scores 0.8597, 0.8597, 0.8669, 0.8776, 0.8525
(mean 0.8633). logs here.
That is significantly better than before but still 1.8% worse than the mean in the paper, which is quite high for just being variance.
I guess I'll try tuning the parameters a bit, especially adv_lr
(and I'll do some more runs with the parameters from above as well to make sure it isn't just "bad" random seeds).
Another potential issue is how the scores are defined. In the paper, the scores are the highest result evaluated at multiple checkpoints for each run, but it is possible that you are looking at the results from the last checkpoint.
Regarding the variance, if you compare your scores of the 5 runs, the variance is somewhat large, especially for the RoBERTa baseline.
Hmm yes, that's strange. I did use the highest scores from multiple checkpoints (shown as e. g. Best metric: 0.8525179856115108
in the log files) not the last score so that is not an issue.
I can reproduce the results now! I ran five more seeds with the same setup:
run_exp 1 2036 122 1e-5 2 2 8 RTE 3e-2 3 1.6e-1 123 1.4e-1
run_exp 1 2036 122 1e-5 2 2 8 RTE 3e-2 3 1.6e-1 456 1.4e-1
run_exp 1 2036 122 1e-5 2 2 8 RTE 3e-2 3 1.6e-1 789 1.4e-1
run_exp 1 2036 122 1e-5 2 2 8 RTE 3e-2 3 1.6e-1 10112 1.4e-1
run_exp 1 2036 122 1e-5 2 2 8 RTE 3e-2 3 1.6e-1 131415 1.4e-1
and get the results: 0.8849, 0.8812, 0.8669, 0.8849, 0.8777
(mean 0.8791) which is definitely within margin of error of the paper.
The only thing I changed are the seeds. From these scores I'd be inclined to think that low seeds behave strangely because these scores are consistently better than the previous ones (although that is surely impossible).
Probably just high variance because of the dataset size, as you mentioned.
Feel free to close this issue. Thanks for your help.
Hi! Thanks for this repository.
I've been trying to reproduce the results from the paper but ran into some problems. I tried the script in
fairseq-RoBERTa/launch/FreeLB/rte-fp32-clip.sh
which I would've expected to score 88.13 on average as shown in Table 1 in the paper.I tried:
and got the scores
0.8597, 0.8884, 0.8057, 0.8669, 0.8633
(mean 0.8568). logs here.and got the scores
0.8741, 0.7949, 0.8417, 0.6330, 0.6007
(mean 0.7488). logs here.Appreciate any help!