zudi-lin / pytorch_connectomics

PyTorch Connectomics: segmentation toolbox for EM connectomics
http://connectomics.readthedocs.io/
MIT License
172 stars 77 forks source link

GPU related error when using CPU only (GPUutil related) #57

Closed ygCoconut closed 3 years ago

ygCoconut commented 3 years ago

After 49 iterations, the model always stops training and runs into this error. I am training without CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

Traceback (most recent call last):
  File "pytorch_connectomics/scripts/main.py", line 67, in <module>
    main()
  File "pytorch_connectomics/scripts/main.py", line 62, in main
    trainer.train()
  File "/n/home00/nwendt/zebrafish/pytorch_connectomics/connectomics/engine/trainer.py", line 92, in train
    GPUtil.showUtilization(all=True)
  File "/n/home00/nwendt/anaconda3/envs/py3_torch/lib/python3.7/site-packages/GPUtil/GPUtil.py", line 210, in showUtilization
    GPUs = getGPUs()
  File "/n/home00/nwendt/anaconda3/envs/py3_torch/lib/python3.7/site-packages/GPUtil/GPUtil.py", line 102, in getGPUs
    deviceIds = int(vals[i])
ValueError: invalid literal for int() with base 10: 'No devices were found'
zudi-lin commented 3 years ago

Fixed: https://github.com/zudi-lin/pytorch_connectomics/blob/master/connectomics/engine/trainer.py#L102