Generate "source", which contains the example QAs and the Question we ask.
Concatenate "source" and "target", where target is the option we want to compare the LLM's output with.
Label ONLY the input_ids at the target part.
Loss is computed using the input_ids of the "target" (labels) and outputs.logits (the LLM's output)
The above steps are done by preprocess().
It seems that a key step is missing before the encodings are feed to the model's forward pass--set the attention_masks corresponding to "target" as 0.
The current design is letting attention_masks as None. For CausalLM, letting attention_masks as None is equivalent as letting attention_masks as 1 for all positions. This means the input_ids of "target" can be seen! Therefore, when doing inference, the output of the LLM will always be similar/same as the the "target", especally when the candidate answers are provided, which explains why the performance of prompting v1.0 for multiple choice selection is even worse than the free answering prompting v2.0.
The original preprocess function is
def preprocess(
sources: Sequence[str],
targets: Sequence[str],
tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
"""Preprocess the data by tokenizing."""
examples = [s + t for s, t in zip(sources, targets)]
examples_tokenized, sources_tokenized = [_tokenize_fn(strings, tokenizer) for strings in (examples, sources)]
input_ids = examples_tokenized["input_ids"]
labels = copy.deepcopy(input_ids)
for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
label[:source_len] = IGNORE_INDEX
return dict(input_ids=torch.stack(input_ids).to(device), labels=torch.stack(labels).to(device))
After masking the input_ids corresponding to the target part:
def preprocess(
sources: Sequence[str],
targets: Sequence[str],
tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
"""Preprocess the data by tokenizing."""
examples = [s + t for s, t in zip(sources, targets)]
examples_tokenized, sources_tokenized = [_tokenize_fn(strings, tokenizer) for strings in (examples, sources)]
input_ids = examples_tokenized["input_ids"]
labels = copy.deepcopy(input_ids)
masks = []
for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
label[:source_len] = IGNORE_INDEX
mask = torch.ones_like(label)
mask[source_len:] = 0
masks.append(mask)
return dict(input_ids=torch.stack(input_ids).to(device), labels=torch.stack(labels).to(device), attention_mask=torch.stack(masks).to(device))
I use the following codes to output the LLM's prediction. Inside the for loop of eval_fewshot.py -> main():
Before I mask out the input_ids for the target, LLM's output will always align to the given target. The fowling are the first for problems (same question but given 4 different targes) of ARC_challenge_validation.jsonl.
Note: for the same prompt with different targets, the expected output should be the same! But due to the mentioned bug in the provided code bass, the output align with the given targets!
prompt #0: Question: A student compared the speeds at which a large and a small marble rolled down an inclined plane. In order to make the findings more reliable, the student should
Candidate answers: (A) release the marbles at different heights. (B) repeat the experiment several times. (C) tilt the plane at different angles. (D) use two marbles that are the same size.
Gold answer: repeat the experiment several times.
Question: Which is most likely needed when describing the change in position of an object?
Candidate answers: (A) initial speed (B) direction change (C) reference point (D) constant rate
Gold answer: reference point
Question: Juan and LaKeisha roll a few objects down a ramp. They want to see which object rolls the farthest. What should they do so they can repeat their investigation?
Candidate answers: (A) Put the objects in groups. (B) Change the height of the ramp. (C) Choose different objects to roll. (D) Record the details of the investigation.
Gold answer:
-------------------------------------
True Answer
D
-------------------------------------
Target
Put the objects in groups.
-------------------------------------
LLM's Prediction
( the objects in groups.
-------------------------------------
True Answer
D
-------------------------------------
Target
Change the height of the ramp.
-------------------------------------
LLM's Prediction
( the height of the ramp.
-------------------------------------
True Answer
D
-------------------------------------
Target
Choose different objects to roll.
-------------------------------------
LLM's Prediction
( different objects to roll.
-------------------------------------
True Answer
D
Record the details of the investigation.
-------------------------------------
Target
Record the details of the investigation.
-------------------------------------
LLM's Prediction
( the details of the investigation.
After
After masking out the target attention, now the LLM truly answering the question, even though it is saying nonsense (since the max_length for prediction is also fixed by defining the target!):
prompt #0: Question: A student compared the speeds at which a large and a small marble rolled down an inclined plane. In order to make the findings more reliable, the student should
Candidate answers: (A) release the marbles at different heights. (B) repeat the experiment several times. (C) tilt the plane at different angles. (D) use two marbles that are the same size.
Gold answer: repeat the experiment several times.
Question: Which is most likely needed when describing the change in position of an object?
Candidate answers: (A) initial speed (B) direction change (C) reference point (D) constant rate
Gold answer: reference point
Question: Juan and LaKeisha roll a few objects down a ramp. They want to see which object rolls the farthest. What should they do so they can repeat their investigation?
Candidate answers: (A) Put the objects in groups. (B) Change the height of the ramp. (C) Choose different objects to roll. (D) Record the details of the investigation.
Gold answer:
-------------------------------------
True Answer
D
-------------------------------------
Target
Put the objects in groups.
-------------------------------------
LLM's Prediction
( the details in order
0% 1/1194 [00:00<03:33, 5.58it/s]-------------------------------------
True Answer
D
-------------------------------------
Target
Change the height of the ramp.
-------------------------------------
LLM's Prediction
( the details of the details
-------------------------------------
True Answer
D
-------------------------------------
Target
Choose different objects to roll.
-------------------------------------
LLM's Prediction
( different objects in repeat the
-------------------------------------
True Answer
D
Record the details of the investigation.
-------------------------------------
Target
Record the details of the investigation.
-------------------------------------
LLM's Prediction
( the details of the details
The current workflow of
eval_fewshot.py
is:input_ids
of the "target" (labels) andoutputs.logits
(the LLM's output)The above steps are done by
preprocess()
.It seems that a key step is missing before the encodings are feed to the model's forward pass--set the attention_masks corresponding to "target" as 0.
The current design is letting attention_masks as None. For CausalLM, letting attention_masks as None is equivalent as letting attention_masks as 1 for all positions. This means the input_ids of "target" can be seen! Therefore, when doing inference, the output of the LLM will always be similar/same as the the "target", especally when the candidate answers are provided, which explains why the performance of prompting v1.0 for multiple choice selection is even worse than the free answering prompting v2.0.
The original preprocess function is
After masking the input_ids corresponding to the target part:
I use the following codes to output the LLM's prediction. Inside the for loop of
eval_fewshot.py
->main()
:Before
Before I mask out the input_ids for the target, LLM's output will always align to the given target. The fowling are the first for problems (same question but given 4 different targes) of ARC_challenge_validation.jsonl.
Note: for the same prompt with different targets, the expected output should be the same! But due to the mentioned bug in the provided code bass, the output align with the given targets!
After
After masking out the target attention, now the LLM truly answering the question, even though it is saying nonsense (since the max_length for prediction is also fixed by defining the target!):