Open LTT-O opened 5 months ago
Hi, we combined the CCM map of size 3x64x64 with the mask map of size 1x64x64 to create a latent image of size 4x64x64. It took approximately 3 days to fine-tune the UNet of stable diffusion v2.1 using 8 40G A100 GPUs.
Thanks for your reply, I'm very curious why you didn't use VAE, is it because the quality of the CCM reconstructed from it isn't good?
1、Efficiency and stronger generalization ability Fantasia3D and latent-nerf discovered that it is unnecessary to use a VAE and we can directly treat the rendered image as a latent image. This approach allows for the rapid generation of a 3D model matching the text description using SDS loss without passing the gradient to the VAE. This method works because the pre-trained latent space is like a downsampled feature image of the RGB space, preserving the shape of the object rather than being meaningless noise.
We found that fine-tuning the diffusion unet using CCM with only a small amount of data can enable it to acquire 3D perception capabilities without causing catastrophic forgetting. Conversely, our early experiments showed a significant decrease in the model's generalization ability when using VAE for fine-tuning, leading us to ultimately abandon the use of VAE.
2、To train both Nerf and DMTet representations from scratch The ability to significantly change the initial shape is the key to training DMTet from scratch. Fantasia3d found that gradients calculated without VAE can significantly change the initial shape, while gradients calculated with VAE could not significantly change the initial shape. We follow the Fantasia3D and abandon the VAE.
Thanks so much for your very detailed explanation!
Hello, I noticed that you are using CCM as the latent image for ldm fine-tune, but the channel for CCM is 3 and the original latent image from VAE is 4. Did you modify the input layer of the model? Besides, the resolution of the CCM is 64?