zhanghang1989 / PyTorch-Encoding

A CV toolkit for my papers.
https://hangzhang.org/PyTorch-Encoding/
MIT License
2.04k stars 450 forks source link

RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert triggered #379

Open ravitejarj opened 3 years ago

ravitejarj commented 3 years ago

Hi zhanghang1989 i have a custom dataset which has 1000 images. when i try to train with 1000 images i'm getting this error. my training parameters are: python train_epoch.py --dataset ade20k --aux --se-loss --model encnet --backbone resnest101 --epochs 180 can you help me to resolve this error. errror: Traceback (most recent call last): File "train_epoch.py", line 340, in main() File "train_epoch.py", line 144, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception: -- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, args) File "/home/hbbg/HBXL/PyTorch-Encoding/experiments/segmentation/train_epoch.py", line 329, in main_worker training(epoch) File "/home/hbbg/HBXL/PyTorch-Encoding/experiments/segmentation/train_epoch.py", line 256, in training loss = criterion(outputs, target) File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/encoding/nn/loss.py", line 85, in forward loss3 = self.bceloss(torch.sigmoid(se_pred), se_target) File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, **kwargs) File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 498, in forward return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction) File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/torch/nn/functional.py", line 2077, in binary_cross_entropy input, target, weight, reduction_enum) RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert triggered

zhanghang1989 commented 3 years ago

Please check whether num_classes is set properly