Hi zhanghang1989
i have a custom dataset which has 1000 images.
when i try to train with 1000 images i'm getting this error.
my training parameters are:
python train_epoch.py --dataset ade20k --aux --se-loss --model encnet --backbone resnest101 --epochs 180
can you help me to resolve this error.
errror:
Traceback (most recent call last):
File "train_epoch.py", line 340, in
main()
File "train_epoch.py", line 144, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, args)
File "/home/hbbg/HBXL/PyTorch-Encoding/experiments/segmentation/train_epoch.py", line 329, in main_worker
training(epoch)
File "/home/hbbg/HBXL/PyTorch-Encoding/experiments/segmentation/train_epoch.py", line 256, in training
loss = criterion(outputs, target)
File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, *kwargs)
File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/encoding/nn/loss.py", line 85, in forward
loss3 = self.bceloss(torch.sigmoid(se_pred), se_target)
File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(input, **kwargs)
File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 498, in forward
return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/torch/nn/functional.py", line 2077, in binary_cross_entropy
input, target, weight, reduction_enum)
RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert triggered
Hi zhanghang1989 i have a custom dataset which has 1000 images. when i try to train with 1000 images i'm getting this error. my training parameters are: python train_epoch.py --dataset ade20k --aux --se-loss --model encnet --backbone resnest101 --epochs 180 can you help me to resolve this error. errror: Traceback (most recent call last): File "train_epoch.py", line 340, in
main()
File "train_epoch.py", line 144, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, args)
File "/home/hbbg/HBXL/PyTorch-Encoding/experiments/segmentation/train_epoch.py", line 329, in main_worker
training(epoch)
File "/home/hbbg/HBXL/PyTorch-Encoding/experiments/segmentation/train_epoch.py", line 256, in training
loss = criterion(outputs, target)
File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, *kwargs)
File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/encoding/nn/loss.py", line 85, in forward
loss3 = self.bceloss(torch.sigmoid(se_pred), se_target)
File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(input, **kwargs)
File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 498, in forward
return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
File "/home/hbbg/miniconda3/envs/seg-pe/lib/python3.7/site-packages/torch/nn/functional.py", line 2077, in binary_cross_entropy
input, target, weight, reduction_enum)
RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert triggered