Open wj-Mcat opened 1 week ago
Hello @wj-Mcat,
The chosen
field of binary preference data is used for supervised fine-tuning (NLL loss) in ORPO.
And this is the core reason why ORPO can directly fine-tune the pre-trained language models to an aligned instruction-following model.
Let me know if you have further questions!
Description
refer to https://github.com/xfactlab/orpo/issues/30 , I have got the data format for orpo training. but this is actually the dpo training data format: [prompt, chosen, rejected], and the data format of sft is like: [prompt, response(or use chosen field)], which doesn't contain rejected field.
so, i wonder how do you combine dpo into sft stage ? look forward to your reply.