[Question] how does orpo combine dpo into sft ?

xfactlab / orpo

Official repository for ORPO

Apache License 2.0

373 stars 34 forks source link

[Question] how does orpo combine dpo into sft ? #33

Open wj-Mcat opened 1 week ago

wj-Mcat commented 1 week ago

Description

refer to https://github.com/xfactlab/orpo/issues/30 , I have got the data format for orpo training. but this is actually the dpo training data format: [prompt, chosen, rejected], and the data format of sft is like: [prompt, response(or use chosen field)], which doesn't contain rejected field.

so, i wonder how do you combine dpo into sft stage ? look forward to your reply.

jiwooya1000 commented 1 week ago

Hello @wj-Mcat,

The chosen field of binary preference data is used for supervised fine-tuning (NLL loss) in ORPO. And this is the core reason why ORPO can directly fine-tune the pre-trained language models to an aligned instruction-following model.

Let me know if you have further questions!