After 49 iterations, the model always stops training and runs into this error.
I am training without CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
Traceback (most recent call last):
File "pytorch_connectomics/scripts/main.py", line 67, in <module>
main()
File "pytorch_connectomics/scripts/main.py", line 62, in main
trainer.train()
File "/n/home00/nwendt/zebrafish/pytorch_connectomics/connectomics/engine/trainer.py", line 92, in train
GPUtil.showUtilization(all=True)
File "/n/home00/nwendt/anaconda3/envs/py3_torch/lib/python3.7/site-packages/GPUtil/GPUtil.py", line 210, in showUtilization
GPUs = getGPUs()
File "/n/home00/nwendt/anaconda3/envs/py3_torch/lib/python3.7/site-packages/GPUtil/GPUtil.py", line 102, in getGPUs
deviceIds = int(vals[i])
ValueError: invalid literal for int() with base 10: 'No devices were found'
After 49 iterations, the model always stops training and runs into this error. I am training without CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7