CUDA error - Githubissues

Dratlan commented 4 months ago

Sorry to bother you again, I met a bug bug . I'm training video toonifiy in a single A100 GPU, as the 1th iter calculate the discriminator loss, this error shows. One thing I noticed is that the GPU memory used reaches 62.2GB/80GB at the peak, and then the error shows up. Can you give me some suggestions?

Dratlan commented 4 months ago

After setting the os.environ['CUDA_LAUNCH_BLOCKING'] = '1', the erros change to: Traceback (most recent call last): File "/workspace/cpfs-data/code/video_transfer/StyleGANEX/scripts/train.py", line 32, in main() File "/workspace/cpfs-data/code/video_transfer/StyleGANEX/scripts/train.py", line 28, in main coach.train() File "/workspace/cpfs-data/code/video_transfer/StyleGANEX/training/coach.py", line 244, in train val_loss_dict = self.validate() File "/workspace/cpfs-data/code/video_transfer/StyleGANEX/training/coach.py", line 334, in validate y_hat, latent = self.net.forward(x1=x, x2=x_tilde, resize=(x.shape[2:]==y.shape[2:]), zero_noise=self.opts.zero_noise, File "/workspace/cpfs-data/code/video_transfer/StyleGANEX/models/psp.py", line 117, in forward images, result_latent = self.decoder([codes], File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/workspace/cpfs-data/code/video_transfer/StyleGANEX/models/stylegan2/model.py", line 617, in forward out = conv1(out, latent[:, i], noise=noise1) File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/workspace/cpfs-data/code/video_transfer/StyleGANEX/models/stylegan2/model.py", line 366, in forward out = self.conv(input, style) File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/workspace/cpfs-data/code/video_transfer/StyleGANEX/models/stylegan2/model.py", line 286, in forward out = self.blur(out) File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/workspace/cpfs-data/code/video_transfer/StyleGANEX/models/stylegan2/model.py", line 86, in forward out = upfirdn2d(input, self.kernel, pad=self.pad) File "/workspace/cpfs-data/code/video_transfer/StyleGANEX/models/stylegan2/op/upfirdn2d.py", line 17, in upfirdn2d return upfirdn2d_native(inputs, kernel, up, down, pad) File "/workspace/cpfs-data/code/video_transfer/StyleGANEX/models/stylegan2/op/upfirdn2d.py", line 48, in upfirdn2d_native out = F.conv2d(out, w) RuntimeError: Unable to find a valid cuDNN algorithm to run convolution. I'll try for new torch versions.

Dratlan commented 4 months ago

It seems to be solve after i change a new cuda+pytorch version. now it is cuda11.6+pytorch1.13+cudnn8

williamyang1991 / StyleGANEX

CUDA error #28