wyysf-98 / CraftsMan

CraftsMan: High-fidelity Mesh Generation with 3D Native Diffusion and Interactive Geometry Refiner
https://craftsman3d.github.io/
379 stars 17 forks source link

Training convergence problem of shape diffusion model #19

Closed waterbearbee closed 1 month ago

waterbearbee commented 1 month ago

Thanks for your amazing work!

As you mentioned in the Appendix , "For the conditional Latent Set Diffusion Model (LSDM), we train our model on 32x A800 GPUs with a batch size of 32 per GPU for 7 days."

I am replicating your work, but I have trained the model on 32x A800 GPUs for 1 day and the result is still bad. I would like to ask you how long it takes to train in order to have better results.

Thank you!

jinnan-chen commented 1 month ago

Hi, I have the similar issue, are you using obj-mix rendered image for condtion diffusion training?

waterbearbee commented 1 month ago

Hi, I have the similar issue, are you using obj-mix rendered image for condtion diffusion training?

Yes. I have checked the obj-mix rendered image, and it looks fine.

waterbearbee commented 1 month ago

Hi, I have the similar issue, are you using obj-mix rendered image for condtion diffusion training?

Another phenomenon is that the training MSE loss becomes very small after the first epoch, about 0.08. But the visualization results are still poor when inferencing.

wyysf-98 commented 1 month ago

Hi, can you provide more details about the training? Or which config of vae used in training? I think I can take some time to figure out the reason to fix the released config if possible as I did not test the config in detail .

jinnan-chen commented 1 month ago

Should VAE sample_posterior be False during Diffusion training?

waterbearbee commented 1 month ago

Hi, can you provide more details about the training? Or which config of vae used in training? I think I can take some time to figure out the reason to fix the released config if possible as I did not test the config in detail .

Thank you very much for your reply. I've partially solved this problem by using a single image as condition and not introducing camera parameters. In addition, I found that the camera parameters provided in the objaverse mix dataset were inconsistent with those in the code, which is probably the reason.