A loss becomes a nan during training. The code is basically unchanged.

FeatherWaves666 commented 5 months ago

Checklist

I have searched related issues but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug 在训练时，loss和decode.loss_ce变成了nan。源码除了数据地址部分，其他没有修改过。

Reproduction

What command or script did you run?

python Samba/tools/train.py Samba/configs/samba/samba_upernet-15k_potsdam-512x512_6e4.py --work-dir Samba/output/p --amp

Did you make any modifications on the code or config? Did you understand what you have modified? 源码除了数据地址部分，没有更改。
What dataset did you use? 使用的potsdam，目前只使用过这个数据集。 Environment

sys.platform: linux
Python: 3.9.19 (main, Mar 21 2024, 17:11:28) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA GeForce RTX 4090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
PyTorch: 2.0.0+cu118
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.15.1+cu118
OpenCV: 4.9.0
MMEngine: 0.10.3
MMSegmentation: 1.2.2+997fe71

Error traceback

If applicable, paste the error trackback here.

2024/04/18 18:15:37 - mmengine - INFO - Iter(train) [  650/15000]  lr: 2.5977e-04  eta: 0:36:53  time: 0.1487  data_time: 0.0087  memory: 4364  loss: 1.4168  decode.loss_ce: 1.0096  decode.acc_seg: 41.0303  aux.loss_ce: 0.4072  aux.acc_seg: 35.1656
2024/04/18 18:15:45 - mmengine - INFO - Iter(train) [  700/15000]  lr: 2.7979e-04  eta: 0:36:41  time: 0.1509  data_time: 0.0097  memory: 4364  loss: 1.4327  decode.loss_ce: 1.0229  decode.acc_seg: 14.2394  aux.loss_ce: 0.4098  aux.acc_seg: 14.0891
2024/04/18 18:15:52 - mmengine - INFO - Iter(train) [  750/15000]  lr: 2.9980e-04  eta: 0:36:26  time: 0.1454  data_time: 0.0087  memory: 4364  loss: 1.3953  decode.loss_ce: 0.9931  decode.acc_seg: 61.1958  aux.loss_ce: 0.4022  aux.acc_seg: 57.2154
2024/04/18 18:15:59 - mmengine - INFO - Iter(train) [  800/15000]  lr: 3.1981e-04  eta: 0:36:06  time: 0.1341  data_time: 0.0080  memory: 4363  loss: nan  decode.loss_ce: nan  decode.acc_seg: 58.5070  aux.loss_ce: 0.4762  aux.acc_seg: 53.7278
2024/04/18 18:16:06 - mmengine - INFO - Iter(train) [  850/15000]  lr: 3.3983e-04  eta: 0:35:44  time: 0.1358  data_time: 0.0076  memory: 4365  loss: nan  decode.loss_ce: nan  decode.acc_seg: 42.6550  aux.loss_ce: 0.4075  aux.acc_seg: 56.2516
2024/04/18 18:16:08 - mmengine - INFO - Exp name: samba_upernet-15k_potsdam-512x512_6e4_20240418_181350
2024/04/18 18:16:13 - mmengine - INFO - Iter(train) [  900/15000]  lr: 3.5984e-04  eta: 0:35:26  time: 0.1427  data_time: 0.0112  memory: 4364  loss: nan  decode.loss_ce: nan  decode.acc_seg: 27.4019  aux.loss_ce: 0.3872  aux.acc_seg: 38.4336
2024/04/18 18:16:20 - mmengine - INFO - Iter(train) [  950/15000]  lr: 3.7985e-04  eta: 0:35:10  time: 0.1369  data_time: 0.0083  memory: 4363  loss: nan  decode.loss_ce: nan  decode.acc_seg: 32.8315  aux.loss_ce: 0.4617  aux.acc_seg: 45.6420
2024/04/18 18:16:27 - mmengine - INFO - Exp name: samba_upernet-15k_potsdam-512x512_6e4_20240418_181350
2024/04/18 18:16:27 - mmengine - INFO - Iter(train) [ 1000/15000]  lr: 3.9987e-04  eta: 0:34:54  time: 0.1326  data_time: 0.0076  memory: 4364  loss: nan  decode.loss_ce: nan  decode.acc_seg: 39.9736  aux.loss_ce: 0.4192  aux.acc_seg: 55.7047

希望能得到解答，感谢

zhuqinfeng1999 commented 5 months ago

Hello. Thank you for your interest and detailed information about the issue you're encountering with the Mamba framework. It appears that the issue with loss and decode. loss_ce turning into NaN during training may be due to insufficient support for Automatic Mixed Precision (AMP) in Mamba. You could try disabling AMP by removing the --amp flag from your training script to see if this resolves the problem.

Please feel free to communicate with me if you have further issues or need more assistance.

FeatherWaves666 commented 5 months ago

Yes, that solves the problem. Thank you very much. However, I also found a problem that without modifying the parameters, the results obtained were relatively low, far lower than the results of the paper, and the previous mmseg classical model (FCN, etc.). (In the vaihingen dataset, MIoU: 60)

zhuqinfeng1999 commented 5 months ago

Hi. Please kindly check if you exclude "clutter" class when calculating the mIoU metric. In both ISPRS data sets, the clutter category needs to be excluded when calculating metrics, which I have already pointed out in the article. If you check the benchmarks for these two ISPRS datasets, you will find that everyone's experimental setups are similar. Please kindly check if your experimental setup is the same as mine.

Thank you again for your attention, and feel free to communicate with me.

FeatherWaves666 commented 5 months ago

Thanks for your answer.Climb your mountain of success!

zhuqinfeng1999 commented 5 months ago

Thank you for your kind words. I wish you smooth sailing in your academic endeavors as well.!

zhuqinfeng1999 / Samba

A loss becomes a nan during training. The code is basically unchanged. #5