Closed timoschick closed 2 years ago
Sorry for being late on this issue - I just noticed it this morning. I don't have a good guess for the reason. Let me test it today or tomorrow, and come back later.
Hi @timoschick I finally figured out the reason - It's because of the space at the end of the input.
The 3b models are trained on GPUs. When doing that, we used the src/ni_collator.py
to convert an example to input/output pair. For the example you provided above, here is encoding output of the collator:
'Definition: In this task, you are given concept set (with 3 to 5 concepts) that contain mentions of names of people, places, activities, or things. These concept sets reflect reasonable concept co-occurrences in everyday situations. All concepts given as input are separated by "#". Your job is to generate a sentence describing a day-to-day scene using all concepts from a given concept set.\n\n Positive Example 1 -\nInput: mountain#ski#skier.\n Output: Skier skis down the mountain.\n\n Positive Example 2 -\nInput: call#character#contain#wallpaper.\n Output: queen of wallpaper containing a portrait called film character .\n\nNow complete the following example -\nInput: cob#corn#eat.\nOutput: '
You can notice there are \n
tokens in the middle and a space in the end. I tried using this as the input (or without the \n
in the middle), and the model gave me the right output.
With this being said, I am quite surprised that the model is so unstable to the space in the end. I guess the same thing also happens for any model <= 3b, which we trained on GPUs with the collator. The 11b model is trained on TPU without this space.
As for solutions, I think the simplest way for you is just to add a space at the end of the input. For us, maybe we should retrain the models <= 3b?
cc @danyaljj
I'm unable to reproduce the predictions found in
Tk-Instruct/output/default/tk-instruct-3b-def-pos/predicted_examples.jsonl
using thetk-instruct-3b-def-pos
model: The predictions I've computed only match with the provided ones ~60% of the time, resulting in a much lower score of 49 vs 54 reported in the paper.To give one specific example, for
task102-87fdccda3ce94464ba5b247a32fb6d74
the input iscob#corn#eat
. I used the providedscripts/convert_data_to_s2s.sh
script to convert all examples into linearized inputs. In this particular case, doing so returns:Definition: In this task, you are given concept set (with 3 to 5 concepts) that contain mentions of names of people, places, activities, or things. These concept sets reflect reasonable concept co-occurrences in everyday situations. All concepts given as input are separated by \"#\". Your job is to generate a sentence describing a day-to-day scene using all concepts from a given concept set. Positive Example 1 - Input: mountain#ski#skier. Output: Skier skis down the mountain. Positive Example 2 - Input: call#character#contain#wallpaper. Output: queen of wallpaper containing a portrait called film character . Now complete the following example - Input: cob#corn#eat. Output:
From what I can tell, this is the correct input (definition + 2 positive examples + input). I used the following code to get a prediction for this input:
where
input
is the string from above. However, this gives the outputcob corn and eat
, whereas the expected output (according to the predictions file) would bea man cobs corn and eats it.
.I have also directly queried the model on the huggingface hub (you can do so using this link), which also gives
cob corn and eat
as output.Why am I not getting the "correct" prediction for this example (and many other examples)?