yizhongw / Tk-Instruct

Tk-Instruct is a Transformer model that is tuned to solve many NLP tasks by following instructions.
https://arxiv.org/abs/2204.07705
MIT License
177 stars 27 forks source link

Low ROUGE scores for Tk-instruct large? #20

Closed jayelm closed 1 year ago

jayelm commented 1 year ago

Hi Yizhong,

Thanks for the great work and for making everything public!

I'm trying to reproduce/better understand these results you showed in the paper here:

image

Looking at this graph it seems like T5 Large 770M should be getting 48.0 ROUGE-L on unseen tasks, am I reading this graph correctly?

Some questions

  1. Is this the allenai/tk-instruct-large-def-pos on huggingface hub?
  2. How can I reproduce this training result? I took scripts/train_tk_instruct.sh and simply swapped in T5-large as the base model instead of T5-XL. But I'm getting substantially lower ROUGE scores (see screenshot). Are there different hyperparameters for this training run?
image

I notice you said in issue #1 that you found

remarkable gap between the smaller models and the 11B or 3B models in generalizing to new tasks

But the scaling results in the original paper don't seem too bad, i.e. 48 ROUGE vs 54 ROUGE for the 3B model. On the other hand the results I'm getting finetuning T5 Large are indeed substantially worse. So just trying to reconcile things here.

jayelm commented 1 year ago

NVM, please disregard! Was due to a bug on my end :)