yk7333 / d3po

[CVPR 2024] Code for the paper "Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model"
https://arxiv.org/abs/2311.13231
MIT License
168 stars 14 forks source link

Comparison with DiffusionDPO #10

Closed samedii closed 7 months ago

samedii commented 7 months ago

What are the differences with https://github.com/SalesforceAIResearch/DiffusionDPO and will you release pretrained weights at some point to make it easier to experiment?

yk7333 commented 7 months ago

Our work and theirs are concurrent, and from an idea perspective, the objectives we aim to achieve are similar. However, what sets us apart is that theoretically, we propose the final loss from the perspective of reinforcement learning. We update the probability values of the entire denoising process. In the work you mentioned, if I understand correctly, they randomly select a step from the denoising process for updating, rather than updating the entire denoising trajectory. Therefore, I believe there isn't a significant difference between the two methods. Both approaches are feasible if you intend to fine-tune the diffusion models :)

samedii commented 7 months ago

Okay, thanks for the clarification! Looking forward to trying your method!