Updates (24.03.25)
trl/test_orpo_trainer_demo.py
This is the official repository for ORPO: Monolithic Preference Optimization without Reference Model. The detailed results in the paper can be found in:
Model Checkpoints
Our models trained with ORPO can be found in:
And the corresponding logs for the average log probabilities of chosen/rejected responses during training are reported in:
AlpacaEval
MT-Bench
IFEval
IFEval scores are measured with EleutherAI/lm-evaluation-harness by applying the chat template. The scores for Llama-2-Chat (70B), Zephyr-β (7B), and Mixtral-8X7B-Instruct-v0.1 are originally reported in this tweet.
Model Type | Prompt-Strict | Prompt-Loose | Inst-Strict | Inst-Loose |
---|---|---|---|---|
Llama-2-Chat (70B) | 0.4436 | 0.5342 | 0.5468 | 0.6319 |
Zephyr-β (7B) | 0.4233 | 0.4547 | 0.5492 | 0.5767 |
Mixtral-8X7B-Instruct-v0.1 | 0.5213 | 0.5712 | 0.6343 | 0.6823 |
Mistral-ORPO-⍺ (7B) | 0.5009 | 0.5083 | 0.5995 | 0.6163 |
Mistral-ORPO-β (7B) | 0.5287 | 0.5564 | 0.6355 | 0.6619 |