yizhongw / Tk-Instruct

Tk-Instruct is a Transformer model that is tuned to solve many NLP tasks by following instructions.
https://arxiv.org/abs/2204.07705
MIT License
177 stars 27 forks source link

Unable to reproduce Tk-Instruct predictions on Natural Instructions test #12

Closed timoschick closed 2 years ago

timoschick commented 2 years ago

I'm unable to reproduce the predictions found in Tk-Instruct/output/default/tk-instruct-3b-def-pos/predicted_examples.jsonl using the tk-instruct-3b-def-pos model: The predictions I've computed only match with the provided ones ~60% of the time, resulting in a much lower score of 49 vs 54 reported in the paper.

To give one specific example, for task102-87fdccda3ce94464ba5b247a32fb6d74 the input is cob#corn#eat. I used the provided scripts/convert_data_to_s2s.sh script to convert all examples into linearized inputs. In this particular case, doing so returns:

Definition: In this task, you are given concept set (with 3 to 5 concepts) that contain mentions of names of people, places, activities, or things. These concept sets reflect reasonable concept co-occurrences in everyday situations. All concepts given as input are separated by \"#\". Your job is to generate a sentence describing a day-to-day scene using all concepts from a given concept set. Positive Example 1 - Input: mountain#ski#skier. Output: Skier skis down the mountain. Positive Example 2 - Input: call#character#contain#wallpaper. Output: queen of wallpaper containing a portrait called film character . Now complete the following example - Input: cob#corn#eat. Output:

From what I can tell, this is the correct input (definition + 2 positive examples + input). I used the following code to get a prediction for this input:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("allenai/tk-instruct-3b-def-pos").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("allenai/tk-instruct-3b-def-pos")
tokenizer.batch_decode(model.generate(tokenizer.encode(input, return_tensors="pt").to(model.device)))

where input is the string from above. However, this gives the output cob corn and eat, whereas the expected output (according to the predictions file) would be a man cobs corn and eats it..

I have also directly queried the model on the huggingface hub (you can do so using this link), which also gives cob corn and eat as output.

Why am I not getting the "correct" prediction for this example (and many other examples)?

yizhongw commented 2 years ago

Sorry for being late on this issue - I just noticed it this morning. I don't have a good guess for the reason. Let me test it today or tomorrow, and come back later.

yizhongw commented 2 years ago

Hi @timoschick I finally figured out the reason - It's because of the space at the end of the input.

The 3b models are trained on GPUs. When doing that, we used the src/ni_collator.py to convert an example to input/output pair. For the example you provided above, here is encoding output of the collator: 'Definition: In this task, you are given concept set (with 3 to 5 concepts) that contain mentions of names of people, places, activities, or things. These concept sets reflect reasonable concept co-occurrences in everyday situations. All concepts given as input are separated by "#". Your job is to generate a sentence describing a day-to-day scene using all concepts from a given concept set.\n\n Positive Example 1 -\nInput: mountain#ski#skier.\n Output: Skier skis down the mountain.\n\n Positive Example 2 -\nInput: call#character#contain#wallpaper.\n Output: queen of wallpaper containing a portrait called film character .\n\nNow complete the following example -\nInput: cob#corn#eat.\nOutput: '

You can notice there are \n tokens in the middle and a space in the end. I tried using this as the input (or without the \n in the middle), and the model gave me the right output.

With this being said, I am quite surprised that the model is so unstable to the space in the end. I guess the same thing also happens for any model <= 3b, which we trained on GPUs with the collator. The 11b model is trained on TPU without this space.

As for solutions, I think the simplest way for you is just to add a space at the end of the input. For us, maybe we should retrain the models <= 3b?

cc @danyaljj