zhaoxlpku / HKU-DASC7606-A2

18 stars 11 forks source link

Question about preprocess function in eval_fewshot.py #9

Closed willcss9109 closed 6 months ago

willcss9109 commented 6 months ago

I am studying the logic flow of the program, and I found that I am confused about the behavior of the preprocess function:

def preprocess(
    sources: Sequence[str],
    targets: Sequence[str],
    tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
    """Preprocess the data by tokenizing."""
    examples = [s + t for s, t in zip(sources, targets)]
    examples_tokenized, sources_tokenized = [_tokenize_fn(strings, tokenizer) for strings in (examples, sources)]
    input_ids = examples_tokenized["input_ids"]
    labels = copy.deepcopy(input_ids)
    for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
        label[:source_len] = IGNORE_INDEX
    return dict(input_ids=torch.stack(input_ids).to(device), labels=torch.stack(labels).to(device))

My understanding is that the input_ids should be the tokenized input string (prompt) fed to the Phi-1.5 model to generate the output response, and the labels should be the desired response from the model. The loss will then be calculated by comparing the labels with the response generated by the model.

However, in the current implementation of the preprocess function, the input_ids returned by the function contain the target answer. This seems to contradict my understanding of how the preprocessing should work.

Is my understanding correct? If not, could you please provide more explanation about this part of the code?

willcss9109 commented 6 months ago

Hello. I have read through the modeling_phi.py file and understand what is exactly doing in the forward function of class PhiForCausalLM. The preprocess function is implemented correctly. The issue can be closed. Sorry for the question.