Closed snaka99 closed 1 year ago
i got same error while training own data. How did you fix this problem ?
Hi, I can see that this issue is closed. Could you please explain how to solve the problem ? I get the same error.
in the config/acdc/medformer_2d.yaml I changed the number of classes for my data , but I still have some issues
I personally solved my problem: 1) I removed all -1 values in the gt data I was using. 2) The number of classes I had put in the first place in config/mydata/medformer_3d.yaml was different from the exact number of classes in the gt data, so I updated the number of classes in the config file. I hope this can help !
hello @yhygao , I am trying to train my own cardiac data with your medformer 2d model but I am having some trouble , I am trying with your medformer_2d.yaml from the acdc data should I change something in it? When I run it I get this error:
Traceback (most recent call last): File "/content/gdrive/.shortcut-targets-by-id/1CdjrP0uBrq3xcbNjQtST6Y_Mx7YdGinp/CBIM-Medical-Image-Segmentation/train.py", line 343, in
best_Dice, best_HD, best_ASD = train_net(net, args, ema_net, fold_idx=fold_idx)
File "/content/gdrive/.shortcut-targets-by-id/1CdjrP0uBrq3xcbNjQtST6Y_Mx7YdGinp/CBIM-Medical-Image-Segmentation/train.py", line 97, in train_net
train_epoch(trainLoader, net, ema_net, optimizer, epoch, writer, criterion, criterion_dl, scaler, args)
File "/content/gdrive/.shortcut-targets-by-id/1CdjrP0uBrq3xcbNjQtST6Y_Mx7YdGinp/CBIM-Medical-Image-Segmentation/train.py", line 209, in train_epoch
loss += args.aux_weight[j] (criterion(result[j], label.squeeze(1)) + criterion_dl(result[j], label))
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/loss.py", line 1174, in forward
return F.cross_entropy(input, target, weight=self.weight,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 3029, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: CUDA error: device-side assert triggered
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Exception ignored in atexit callback: <function _MultiProcessingDataLoaderIter._clean_up_worker at 0x7f926feb7eb0> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1472, in _clean_up_worker w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL) File "/usr/lib/python3.10/multiprocessing/process.py", line 149, in join res = self._popen.wait(timeout) File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 40, in wait if not wait([self.sentinel], timeout): File "/usr/lib/python3.10/multiprocessing/connection.py", line 931, in wait ready = selector.select(timeout) File "/usr/lib/python3.10/selectors.py", line 416, in select fd_event_list = self._selector.poll(timeout) KeyboardInterrupt: terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f927b10f4d7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f927b0d936b in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) + 0x118 (0x7f927b1abb58 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #3: + 0x12513e5 (0x7f927c4bd3e5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x4d5a16 (0x7f92e1b2aa16 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #5: + 0x3ee77 (0x7f927b0f4e77 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f927b0ed69e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f927b0ed7b9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: + 0x75afc8 (0x7f92e1daffc8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x305 (0x7f92e1db0355 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #10: python3() [0x622134]
frame #11: python3() [0x53e218]
frame #12: python3() [0x58d328]
frame #13: python3() [0x58d3ff]
frame #14: python3() [0x58d3ff]