Closed samedii closed 7 months ago
Our work and theirs are concurrent, and from an idea perspective, the objectives we aim to achieve are similar. However, what sets us apart is that theoretically, we propose the final loss from the perspective of reinforcement learning. We update the probability values of the entire denoising process. In the work you mentioned, if I understand correctly, they randomly select a step from the denoising process for updating, rather than updating the entire denoising trajectory. Therefore, I believe there isn't a significant difference between the two methods. Both approaches are feasible if you intend to fine-tune the diffusion models :)
Okay, thanks for the clarification! Looking forward to trying your method!
What are the differences with https://github.com/SalesforceAIResearch/DiffusionDPO and will you release pretrained weights at some point to make it easier to experiment?