RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment,

wgong commented 1 month ago

Great work!

came to your repo after reading Prof Luo's wechat post. Following the steps in README, try to run scripts/inference.sh, but got the following error

Traceback (most recent call last):
  File "/home/gongai/projects/wgong/PromptFix/scripts/../process_images_json.py", line 195, in <module>
    main()
  File "/home/gongai/projects/wgong/PromptFix/scripts/../process_images_json.py", line 135, in main
    model.eval().cuda()
  File "/home/gongai/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 918, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/gongai/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/home/gongai/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/home/gongai/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/gongai/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
    param_applied = fn(param)
  File "/home/gongai/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 918, in <lambda>
    return self._apply(lambda t: t.cuda(device))
  File "/home/gongai/.local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 298, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.

Please advise on fixing it

my machine is Ubuntu with GPU spec as below

Thanks

wgong commented 1 month ago

I am able to make progress after installing cuDNN and NCCL shared libs. see attached doc promptfix-issue-5.pdf

wgong commented 1 month ago

close

yeates / PromptFix

RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, #5