Closed Jonyond-lin closed 4 months ago
Hi, you should consider training on GPUs. I'm not sure about training time on CPU.
@ziqihuangg So, it's not normal for one epoch of training to take over an hour, right? However, I'm using two A6000 GPUs, and each card has 30GB of VRAM occupied, so theoretically the model training is indeed on CUDA. I'd like to know which part of your code I should adjust regarding setting the CUDA device for the model. Thanks in advance!
Hi, unfortunately I don't have specific information on training on A6000.
My set-up was on V100 GPUs. To specify the number of CUDA device, you can use the "--gpus" argument (shown below). To set which GPU to use, you can probably set the "CUDA_VISIBLE_DEVICES" variable.
python main.py \ --logdir 'outputs/512_vae' \ --base 'configs/512_vae.yaml' \ -t --gpus 0,1,2,3,
Hello,
Thank you for your nice job. I recently encountered an issue while running the training code for Diffuser on GitHub, and I would appreciate your guidance.
During training, I encountered the following error:
I managed to resolve the issue by moving 't' to the CPU. However, I noticed that the training time for a single epoch is quite long, nearly an hour. I am unsure if this training time is normal or if my actions, such as training on the CPU, are causing the slowdown.
Could you please share your typical training time for a single epoch, so I can better understand if my situation is unusual? Additionally, if you suspect that there may be issues with my setup, I would greatly appreciate any suggestions or solutions you can offer.
Thank you very much for your assistance.