yonseivnl / vlm-rlaif

ACL'24 (Oral) Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback
Apache License 2.0
52 stars 3 forks source link

AI-generated preference annotations may be noisy #2

Closed hlchen23 closed 4 months ago

hlchen23 commented 9 months ago

I want to know how you prove the AI-generated preference annotations are correct so that can be used to train the reward model.

dcahn12 commented 4 months ago

We indirectly confirm the quality of AI-generated preference annotations after conducting PPO training by checking the quantitative results of the trained policy model (i.e., VLM-RLAIF), as shown in Tables 1, 2, and 3 in the paper. This approach is necessary because directly confirming the quality of AI-generated preference annotations is difficult without a human reviewing the entire sample.

Furthermore, as illustrated in Figure 12 in the paper, we verify that the trained reward model effectively scores the generated responses, thereby demonstrating the reward model's effectiveness.

hlchen23 commented 4 months ago

您的来件已收到,谢谢!(自动回复)Chen