cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

nhansendev commented 2 years ago

I am trying to run your code on a fresh install of Ubuntu 20.04 with Python 3.9.5, and CUDA 11.6 / cuDNN 8.3.2, but when executing main.py the following cuDNN error results:

$ python main.py 
2022-01-21 16:02:17,793 INFO services.py:1272 -- View the Ray dashboard at http://127.0.0.1:8265
(pid=36888) A.L.E: Arcade Learning Environment (version +978d2ce)
(pid=36888) [Powered by Stella]
(pid=36874) A.L.E: Arcade Learning Environment (version +978d2ce)
(pid=36874) [Powered by Stella]
(pid=36881) A.L.E: Arcade Learning Environment (version +978d2ce)
(pid=36881) [Powered by Stella]
(pid=36885) A.L.E: Arcade Learning Environment (version +978d2ce)
(pid=36885) [Powered by Stella]
(pid=36882) A.L.E: Arcade Learning Environment (version +978d2ce)
(pid=36882) [Powered by Stella]
(pid=36875) A.L.E: Arcade Learning Environment (version +978d2ce)
(pid=36875) [Powered by Stella]
====================================================================================================
Traceback (most recent call last):
  File "/home/nate/Desktop/Atom/agent57_pytorch/main.py", line 267, in <module>
    main(parser.parse_args())
  File "/home/nate/Desktop/Atom/agent57_pytorch/main.py", line 144, in main
    in_q_weight, ex_q_weight, embed_weight, trained_lifelong_weight, indices, priorities, in_q_loss, ex_q_loss, embed_loss, lifelong_loss = ray.get(finished_learner[0])
  File "/home/nate/miniconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
    return func(*args, **kwargs)
  File "/home/nate/miniconda3/lib/python3.9/site-packages/ray/worker.py", line 1495, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::Learner.update_network() (pid=36888, ip=192.168.137.71)
  File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 451, in ray._raylet.execute_task.function_executor
  File "/home/nate/miniconda3/lib/python3.9/site-packages/ray/_private/function_manager.py", line 563, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "/home/nate/Desktop/Atom/agent57_pytorch/learner.py", line 262, in update_network
    priorities, in_q_loss, ex_q_loss = self.qnet_update(weights, segments)
  File "/home/nate/Desktop/Atom/agent57_pytorch/learner.py", line 308, in qnet_update
    ex_target_qvalues = self.get_qvalues(self.ex_target_q_network, ex_h0, ex_c0)
  File "/home/nate/Desktop/Atom/agent57_pytorch/learner.py", line 371, in get_qvalues
    _, (h, c) = q_network(self.states[t],
  File "/home/nate/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/nate/Desktop/Atom/agent57_pytorch/model.py", line 99, in forward
    x, states = self.lstm(x.unsqueeze(0), states)
  File "/home/nate/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/nate/miniconda3/lib/python3.9/site-packages/torch/nn/modules/rnn.py", line 679, in forward
    result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Have you encountered an error like this during development? Are you using an older version of CUDA / cuDNN? Please let me know if you have any suggestions.

nhansendev commented 2 years ago

Based on other, similar issues I think the problem is that not all tensors are being sent to the GPU when Cuda is available. I'm trying to find where ".to(self.device)" might be missing. Could someone confirm whether they can run on Cuda without changes, or was this only run on CPU?

CPU works fine, but possibly slower than GPU...

yuta0821 commented 2 years ago

@Obliman Sorry for late reply. At first, thanks for your question. I was able to run my code with CUDA 10.0. I hope this helps!

yuta0821 / agent57_pytorch

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #1