A potential **WRONG Design and bug** in the preprocessing design!

The current workflow of eval_fewshot.py is:

Generate "source", which contains the example QAs and the Question we ask.
Concatenate "source" and "target", where target is the option we want to compare the LLM's output with.
Label ONLY the input_ids at the target part.
Loss is computed using the input_ids of the "target" (labels) and outputs.logits (the LLM's output)

The above steps are done by preprocess().

It seems that a key step is missing before the encodings are feed to the model's forward pass--set the attention_masks corresponding to "target" as 0.

The current design is letting attention_masks as None. For CausalLM, letting attention_masks as None is equivalent as letting attention_masks as 1 for all positions. This means the input_ids of "target" can be seen! Therefore, when doing inference, the output of the LLM will always be similar/same as the the "target", especally when the candidate answers are provided, which explains why the performance of prompting v1.0 for multiple choice selection is even worse than the free answering prompting v2.0.

The original preprocess function is


def preprocess(
    sources: Sequence[str],
    targets: Sequence[str],
    tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
    """Preprocess the data by tokenizing."""
    examples = [s + t for s, t in zip(sources, targets)]
    examples_tokenized, sources_tokenized = [_tokenize_fn(strings, tokenizer) for strings in (examples, sources)]
    input_ids = examples_tokenized["input_ids"]
    labels = copy.deepcopy(input_ids)
    for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
        label[:source_len] = IGNORE_INDEX
    return dict(input_ids=torch.stack(input_ids).to(device), labels=torch.stack(labels).to(device))

After masking the input_ids corresponding to the target part:


def preprocess(
    sources: Sequence[str],
    targets: Sequence[str],
    tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
    """Preprocess the data by tokenizing."""

    examples = [s + t for s, t in zip(sources, targets)]
    examples_tokenized, sources_tokenized = [_tokenize_fn(strings, tokenizer) for strings in (examples, sources)]
    input_ids = examples_tokenized["input_ids"]
    labels = copy.deepcopy(input_ids)
    masks = []
    for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
        label[:source_len] = IGNORE_INDEX
        mask = torch.ones_like(label)
        mask[source_len:] = 0
        masks.append(mask)
    return dict(input_ids=torch.stack(input_ids).to(device), labels=torch.stack(labels).to(device), attention_mask=torch.stack(masks).to(device))

I use the following codes to output the LLM's prediction. Inside the for loop of eval_fewshot.py -> main():

    with torch.no_grad():
            # task 6
            outputs = model(**encoding)
            log_likelihood = "Write your codes here"

            # output the prediction
            label = problems[i]["label"]
            answerKey = problems[i]["answerKey"]

            print("-------------------------------------")
            print("True Answer")
            print(answerKey)
            if answerKey == label:
                print(answer)
            print("-------------------------------------")
            print("Target")
            shift_logits = outputs.logits[..., :-1, :].contiguous()
            shift_labels = encoding["labels"][..., 1:].contiguous()
            shift_logits = shift_logits.view(-1, model.config.vocab_size)
            shift_labels = shift_labels.view(-1)
            print(tokenizer.decode(shift_labels[source_len-1:]))
            print(shift_labels[source_len-1:])
            print("-------------------------------------")
            print("LLM's Prediction")
            prediction = shift_logits.argmax(dim=1)[source_len-1:]
            print(prediction)
            print(tokenizer.decode(prediction))

Before

Before I mask out the input_ids for the target, LLM's output will always align to the given target. The fowling are the first for problems (same question but given 4 different targes) of ARC_challenge_validation.jsonl.

Note: for the same prompt with different targets, the expected output should be the same! But due to the mentioned bug in the provided code bass, the output align with the given targets!

prompt #0: Question: A student compared the speeds at which a large and a small marble rolled down an inclined plane. In order to make the findings more reliable, the student should
Candidate answers: (A) release the marbles at different heights. (B) repeat the experiment several times. (C) tilt the plane at different angles. (D) use two marbles that are the same size.
Gold answer: repeat the experiment several times.

Question: Which is most likely needed when describing the change in position of an object?
Candidate answers: (A) initial speed (B) direction change (C) reference point (D) constant rate
Gold answer: reference point

Question: Juan and LaKeisha roll a few objects down a ramp. They want to see which object rolls the farthest. What should they do so they can repeat their investigation?
Candidate answers: (A) Put the objects in groups. (B) Change the height of the ramp. (C) Choose different objects to roll. (D) Record the details of the investigation.
Gold answer:
-------------------------------------
True Answer
D
-------------------------------------
Target
 Put the objects in groups.
-------------------------------------
LLM's Prediction
 ( the objects in groups.
-------------------------------------
True Answer
D
-------------------------------------
Target
 Change the height of the ramp.
-------------------------------------
LLM's Prediction
 ( the height of the ramp.
-------------------------------------
True Answer
D
-------------------------------------
Target
 Choose different objects to roll.
-------------------------------------
LLM's Prediction
 ( different objects to roll.
-------------------------------------
True Answer
D
Record the details of the investigation.
-------------------------------------
Target
 Record the details of the investigation.
-------------------------------------
LLM's Prediction
 ( the details of the investigation.

After

After masking out the target attention, now the LLM truly answering the question, even though it is saying nonsense (since the max_length for prediction is also fixed by defining the target!):

prompt #0: Question: A student compared the speeds at which a large and a small marble rolled down an inclined plane. In order to make the findings more reliable, the student should
Candidate answers: (A) release the marbles at different heights. (B) repeat the experiment several times. (C) tilt the plane at different angles. (D) use two marbles that are the same size.
Gold answer: repeat the experiment several times.

Question: Which is most likely needed when describing the change in position of an object?
Candidate answers: (A) initial speed (B) direction change (C) reference point (D) constant rate
Gold answer: reference point

Question: Juan and LaKeisha roll a few objects down a ramp. They want to see which object rolls the farthest. What should they do so they can repeat their investigation?
Candidate answers: (A) Put the objects in groups. (B) Change the height of the ramp. (C) Choose different objects to roll. (D) Record the details of the investigation.
Gold answer:
-------------------------------------
True Answer
D
-------------------------------------
Target
 Put the objects in groups.
-------------------------------------
LLM's Prediction
 ( the details in order

  0% 1/1194 [00:00<03:33,  5.58it/s]-------------------------------------
True Answer
D
-------------------------------------
Target
 Change the height of the ramp.
-------------------------------------
LLM's Prediction
 ( the details of the details

-------------------------------------
True Answer
D
-------------------------------------
Target
 Choose different objects to roll.
-------------------------------------
LLM's Prediction
 ( different objects in repeat the
-------------------------------------
True Answer
D
Record the details of the investigation.
-------------------------------------
Target
 Record the details of the investigation.
-------------------------------------
LLM's Prediction
 ( the details of the details

zhaoxlpku / HKU-DASC7606-A2