yhygao / CBIM-Medical-Image-Segmentation

A PyTorch framework for medical image segmentation
Apache License 2.0
260 stars 46 forks source link

training my own data #26

Closed snaka99 closed 1 year ago

snaka99 commented 1 year ago

hello @yhygao , I am trying to train my own cardiac data with your medformer 2d model but I am having some trouble , I am trying with your medformer_2d.yaml from the acdc data should I change something in it? When I run it I get this error:

Traceback (most recent call last): File "/content/gdrive/.shortcut-targets-by-id/1CdjrP0uBrq3xcbNjQtST6Y_Mx7YdGinp/CBIM-Medical-Image-Segmentation/train.py", line 343, in best_Dice, best_HD, best_ASD = train_net(net, args, ema_net, fold_idx=fold_idx) File "/content/gdrive/.shortcut-targets-by-id/1CdjrP0uBrq3xcbNjQtST6Y_Mx7YdGinp/CBIM-Medical-Image-Segmentation/train.py", line 97, in train_net train_epoch(trainLoader, net, ema_net, optimizer, epoch, writer, criterion, criterion_dl, scaler, args) File "/content/gdrive/.shortcut-targets-by-id/1CdjrP0uBrq3xcbNjQtST6Y_Mx7YdGinp/CBIM-Medical-Image-Segmentation/train.py", line 209, in train_epoch loss += args.aux_weight[j] (criterion(result[j], label.squeeze(1)) + criterion_dl(result[j], label)) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/loss.py", line 1174, in forward return F.cross_entropy(input, target, weight=self.weight, File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 3029, in cross_entropy return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) RuntimeError: CUDA error: device-side assert triggered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception ignored in atexit callback: <function _MultiProcessingDataLoaderIter._clean_up_worker at 0x7f926feb7eb0> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1472, in _clean_up_worker w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL) File "/usr/lib/python3.10/multiprocessing/process.py", line 149, in join res = self._popen.wait(timeout) File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 40, in wait if not wait([self.sentinel], timeout): File "/usr/lib/python3.10/multiprocessing/connection.py", line 931, in wait ready = selector.select(timeout) File "/usr/lib/python3.10/selectors.py", line 416, in select fd_event_list = self._selector.poll(timeout) KeyboardInterrupt: terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f927b10f4d7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f927b0d936b in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) + 0x118 (0x7f927b1abb58 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #3: + 0x12513e5 (0x7f927c4bd3e5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x4d5a16 (0x7f92e1b2aa16 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so) frame #5: + 0x3ee77 (0x7f927b0f4e77 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #6: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f927b0ed69e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f927b0ed7b9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #8: + 0x75afc8 (0x7f92e1daffc8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so) frame #9: THPVariable_subclass_dealloc(_object*) + 0x305 (0x7f92e1db0355 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so) frame #10: python3() [0x622134] frame #11: python3() [0x53e218] frame #12: python3() [0x58d328] frame #13: python3() [0x58d3ff] frame #14: python3() [0x58d3ff]

frame #17: python3() [0x6c02ca] frame #21: __libc_start_main + 0xf3 (0x7f9312e55083 in /lib/x86_64-linux-gnu/libc.so.6)
shbkukuk commented 9 months ago

i got same error while training own data. How did you fix this problem ?

anaellezan commented 6 months ago

Hi, I can see that this issue is closed. Could you please explain how to solve the problem ? I get the same error.

snaka99 commented 6 months ago

in the config/acdc/medformer_2d.yaml I changed the number of classes for my data , but I still have some issues

anaellezan commented 6 months ago

I personally solved my problem: 1) I removed all -1 values in the gt data I was using. 2) The number of classes I had put in the first place in config/mydata/medformer_3d.yaml was different from the exact number of classes in the gt data, so I updated the number of classes in the config file. I hope this can help !