Closed hlchen23 closed 4 months ago
We indirectly confirm the quality of AI-generated preference annotations after conducting PPO training by checking the quantitative results of the trained policy model (i.e., VLM-RLAIF), as shown in Tables 1, 2, and 3 in the paper. This approach is necessary because directly confirming the quality of AI-generated preference annotations is difficult without a human reviewing the entire sample.
Furthermore, as illustrated in Figure 12 in the paper, we verify that the trained reward model effectively scores the generated responses, thereby demonstrating the reward model's effectiveness.
您的来件已收到,谢谢!(自动回复)Chen
I want to know how you prove the AI-generated preference annotations are correct so that can be used to train the reward model.