RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

zsyOAOA / DifFace

DifFace: Blind Face Restoration with Diffused Error Contraction (TPAMI, 2024)

Other

628 stars 42 forks source link

RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. #15

Open lajihaonange opened 1 year ago

lajihaonange commented 1 year ago

I met this problem when I tried to run the command CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nproc_per_node=4 --nnodes=1 main_diffusion.py --gpu_id 0123 --cfg_path configs/training/diffusion_ffhq512.yaml --save_dir myfolder. Could someone help me solve it?

zsyOAOA commented 1 year ago

I have updated the code. Please have a try: CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nproc_per_node=4 --nnodes=1 main_diffusion.py --cfg_path configs/training/diffusion_ffhq512.yaml --save_dir yourfolder

I suggest you firstly train the model using one GPU, and then turn to the distributed training.

lajihaonange commented 1 year ago

Thank you for your timely reply. I have used single GPU for training and successfully, I will try your new code right now.