Closed diamond0910 closed 1 year ago
Hi, you should sample using the diffusion UNet instead of directly from the VAE latents. The VAE here is mainly for spatial compression purposes. Diffusion UNet is the main one that takes care of the image distribution.
Do you mean these default outputs are useless?
For the VAE here, you should look at reconstructions_gs-xxxxxx
to assess the training progress. I would say you don't need to look at samples_gs-xxxxx
, but it can give you a sense of what images look like if you directly sample from the VAE latents without using the diffusion UNet.
Hi,
Thank you for your nice work.
I would like to know the time required for their training, including vae, uni-modal and dynamic model.
I train the vae model for about 3 hours using 4 gpus. But I still find the sampled image is poor.
The recon image looks ok.