Inquiry about Training Time and RuntimeError in Diffuser Code

Jonyond-lin commented 1 year ago

Hello,

Thank you for your nice job. I recently encountered an issue while running the training code for Diffuser on GitHub, and I would appreciate your guidance.

During training, I encountered the following error:

Diffusion/ldm/models/diffusion/ddpm_compose.py", line 1237, in p_losses logvar_t = self.logvar[t].to(self.device) RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (CPU)

I managed to resolve the issue by moving 't' to the CPU. However, I noticed that the training time for a single epoch is quite long, nearly an hour. I am unsure if this training time is normal or if my actions, such as training on the CPU, are causing the slowdown.

Could you please share your typical training time for a single epoch, so I can better understand if my situation is unusual? Additionally, if you suspect that there may be issues with my setup, I would greatly appreciate any suggestions or solutions you can offer.

Thank you very much for your assistance.

ziqihuangg commented 1 year ago

Hi, you should consider training on GPUs. I'm not sure about training time on CPU.

Jonyond-lin commented 1 year ago

@ziqihuangg So, it's not normal for one epoch of training to take over an hour, right? However, I'm using two A6000 GPUs, and each card has 30GB of VRAM occupied, so theoretically the model training is indeed on CUDA. I'd like to know which part of your code I should adjust regarding setting the CUDA device for the model. Thanks in advance!

ziqihuangg commented 4 months ago

Hi, unfortunately I don't have specific information on training on A6000.

My set-up was on V100 GPUs. To specify the number of CUDA device, you can use the "--gpus" argument (shown below). To set which GPU to use, you can probably set the "CUDA_VISIBLE_DEVICES" variable.

python main.py \ --logdir 'outputs/512_vae' \ --base 'configs/512_vae.yaml' \ -t --gpus 0,1,2,3,

ziqihuangg / Collaborative-Diffusion

Inquiry about Training Time and RuntimeError in Diffuser Code #18