[Question] Label information leakage

yxuansu / OpenAlpaca

OpenAlpaca: A Fully Open-Source Instruction-Following Model Based On OpenLLaMA

Apache License 2.0

301 stars 35 forks source link

[Question] Label information leakage #8

Open Nsigma-Bill opened 1 year ago

Nsigma-Bill commented 1 year ago

I have a question regarding the function preprocess in datasets/sft_dataset.py: Line 51 goes like this:

inpt = [1] + s_tokens + t_tokens + [2]

I am a bit confused about why we add target(label) information in the input and did not mask this. To me, it seems like label information leakage.

Could you clarify this a bit? Thanks!