Closed kxleee closed 4 months ago
Hello @kxleee, thanks for the question. Although is hard to strictly characterize the loss without a reference model in our setting, but you could imagine the SFT training on the chosen responses of ORPO acting as a guidance similar to what a reference model would explicitly give.
For the chat model after SFT, how to ensure that the model performance does not cause loss without a reference model? thanks