ziqihuangg / Collaborative-Diffusion

[CVPR 2023] Collaborative Diffusion
https://ziqihuangg.github.io/projects/collaborative-diffusion.html
Other
406 stars 31 forks source link

About GPU #3

Closed jy12he closed 1 year ago

jy12he commented 1 year ago

Hi,how much GPU memory is required,can I run on RTX3090 ?

0546trigger commented 1 year ago

i think you can use the model to generate pictures, but training the model needs much bigger memory . I tried to use one rtx4090 to train this model following the github and got the error 'CUDA out of memory'.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 23.65 GiB total capacity; 21.73 GiB already allocated; 101.81 MiB free; 21.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

ziqihuangg commented 1 year ago

@0546trigger Hi, for 512x512 resolution, training dynamic diffusers, I set batch size as 8 samples per GPU, and it takes 12GB GPU memory. You can surely reduce the batch size to avoid the CUDA OOM problem.

jy12he commented 1 year ago

OK, I see, thank you ! if I also set batch size as 8 samples per GPU , how much memory will consume per GPU? @ziqihuangg

0546trigger commented 1 year ago

@0546trigger Hi, for 512x512 resolution, training dynamic diffusers, I set batch size as 8 samples per GPU. You can surely reduce the batch size to avoid the CUDA OOM problem.

do you main to reduce the max_images in config files? I set 4 for max_images and then got following error

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

ziqihuangg commented 1 year ago

@jy12he Hi, it takes 12GB GPU memory. The setting is: 512x512 resolution, training dynamic diffusers, batch size = 8 samples per GPU. If your RTX3090 has 24GB memory, there should be no problem training at the 512x512 resolution.

ziqihuangg commented 1 year ago

@0546trigger You simply need to modify the parameter batch_size in the config.

0546trigger commented 1 year ago

thanks for your patience, i reduce the batch_size to 1 with 512*512 resolution and it works. It takes about 18GB memory to do this training on my rtx4090, it's much larger than the number you said above. Did I do something wrong while installation or setting parameters ?

ziqihuangg commented 1 year ago

@0546trigger Which model are you training? Which config file did you use?

0546trigger commented 1 year ago

the vae model with default config except batchsize=1

ziqihuangg commented 1 year ago

@0546trigger For 512x512 VAE training, batch_size = 2 works fine on a 32 GB GPU. Each sample takes around 16 GB.

The setting I previously mentioned was "it takes 12GB GPU memory. The setting is: 512x512 resolution, training dynamic diffusers, batch size = 8 samples per GPU." Hope this clarifies, thanks.