Multi-GPU training - Githubissues

JigneshChowdary commented 1 year ago

Hi, I want train your model on multiple gpus. But I am getting errors. Can you help me in this regard?

eslambakr commented 1 year ago

Hi @JigneshChowdary What is the error your are facing? As I am facing the following error, when running on 4 GPUs however the code work smoothly when using single GPU: RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

Your help would be much appreciated @xuekt98 Thanks in advance!

xiaoxiaoyuii commented 1 year ago

Hi @eslambakr I encountered the same issue when running my code on 4 GPUs. Did you manage to resolve it? Could you please share how you resolved it? RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

Your help would be much appreciated @xuekt98 Thanks in advance!

xuekt98 commented 1 year ago

maybe the hyper parameters you use is not suitable that the model is too large to run on one GPU. Just try to decrease batch size

---- Replied Message ---- | From | @.> | | Date | 10/09/2023 10:32 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [xuekt98/BBDM] Multi-GPU training (Issue #14) |

Hi @eslambakr I encountered the same issue when running my code on 4 GPUs. Did you manage to resolve it? Could you please share how you resolved it? RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

Your help would be much appreciated @xuekt98 Thanks in advance!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

xiaoxiaoyuii commented 1 year ago

I want to run on 4GPUs not 1GPU，I meet the question"RuntimeError: Unable to find a valid cuDNN algorithm to run convolution"，Could you please share how you resolved it?

---- Replied Message ---- | From | @.> | | Date | 10/09/2023 17:54 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [xuekt98/BBDM] Multi-GPU training (Issue #14) |

maybe the hyper parameters you use is not suitable that the model is too large to run on one GPU. Just try to decrease batch size

---- Replied Message ---- | From | @.> | | Date | 10/09/2023 10:32 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [xuekt98/BBDM] Multi-GPU training (Issue #14) |

Hi @eslambakr I encountered the same issue when running my code on 4 GPUs. Did you manage to resolve it? Could you please share how you resolved it? RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

Your help would be much appreciated @xuekt98 Thanks in advance!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

eslambakr commented 1 year ago

Hi @xiaoxiaoyuii Unfortunately, I didn't solve it, as I don't have enough time to make it. But I guess it is doable we can convert any code that run one single GPU and make it support distributed training. Sorry for that.

arminbiglari commented 4 months ago

Hi @xiaoxiaoyuii i have solved the problem, sometimes, Vram is not enough to run the code. Therefore, to handle this issue you should decrease batch size

xuekt98 / BBDM

Multi-GPU training #14