Low ROUGE scores for Tk-instruct large?

Hi Yizhong,

Thanks for the great work and for making everything public!

I'm trying to reproduce/better understand these results you showed in the paper here:

Looking at this graph it seems like T5 Large 770M should be getting 48.0 ROUGE-L on unseen tasks, am I reading this graph correctly?

Some questions

Is this the allenai/tk-instruct-large-def-pos on huggingface hub?
How can I reproduce this training result? I took scripts/train_tk_instruct.sh and simply swapped in T5-large as the base model instead of T5-XL. But I'm getting substantially lower ROUGE scores (see screenshot). Are there different hyperparameters for this training run?

I notice you said in issue #1 that you found

remarkable gap between the smaller models and the 11B or 3B models in generalizing to new tasks

But the scaling results in the original paper don't seem too bad, i.e. 48 ROUGE vs 54 ROUGE for the 3B model. On the other hand the results I'm getting finetuning T5 Large are indeed substantially worse. So just trying to reconcile things here.

yizhongw / Tk-Instruct

Low ROUGE scores for Tk-instruct large? #20