zifuwan / Sigma

[WACV 2025] Python implementation of Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation
https://zifuwan.github.io/Sigma/
MIT License
190 stars 19 forks source link

Problems in Windows #29

Closed Worldseer closed 3 months ago

Worldseer commented 3 months ago

I run the following command on the windows system:

torchrun -m --nproc_per_node=1 train.py -p 29501 -d 0 -n "nyu"

However, incompatible function arguments.

The error is as follows

[2024-08-27 10:26:29,828] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[W socket.cpp:663] [c10d] The client socket has failed to connect to [DESKTOP-863KBCC]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。).
27 10:26:33 PyTorch Version 2.1.1+cu118
27 10:26:33 using devices 0
Namespace(devices='0', continue_fpath=None, local_rank=0, port='29501', dataset_name='nyu')
'pwd' 不是内部或外部命令,也不是可运行的程序
或批处理文件。
=======================================
E:\2 FDS Project\multimodal_segment\Sigma\log_final\log_nyudepth\log_NYUDepthv2_sigma_tiny_cromb_conmb_cvssdecoder\tb
=======================================
'rm' 不是内部或外部命令,也不是可运行的程序
或批处理文件。
'ln' 不是内部或外部命令,也不是可运行的程序
或批处理文件。
27 10:26:34 Using backbone: V-MAMBA
Failed loading checkpoint form pretrained/vmamba/vssmtiny_dp01_ckpt_epoch_292.pth: invalid load key, 'v'.
27 10:26:35 Using Mamba Decoder
27 10:26:35 Initing weights ...
27 10:26:35 begin trainning:
[00:05<?,?it/s]
27 10:26:41 WRN A exception occurred during Engine initialization, give up running process
Traceback (most recent call last):
  File "E:\anaconda\envs\mamba\lib\runpy.py", line 187, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "E:\anaconda\envs\mamba\lib\runpy.py", line 110, in _get_module_details
    __import__(pkg_name)
  File "E:\2 FDS Project\multimodal_segment\Sigma\train.py", line 164, in <module>
    loss = model(imgs, modal_xs, gts)
  File "E:\anaconda\envs\mamba\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "E:\anaconda\envs\mamba\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\2 FDS Project\multimodal_segment\Sigma\models\builder.py", line 154, in forward
    out = self.encode_decode(rgb, modal_x)
  File "E:\2 FDS Project\multimodal_segment\Sigma\models\builder.py", line 136, in encode_decode
    x = self.backbone(rgb, modal_x)
  File "E:\anaconda\envs\mamba\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "E:\anaconda\envs\mamba\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\2 FDS Project\multimodal_segment\Sigma\models\encoders\dual_vmamba.py", line 110, in forward
    out = self.forward_features(x_rgb, x_e)
  File "E:\2 FDS Project\multimodal_segment\Sigma\models\encoders\dual_vmamba.py", line 85, in forward_features
    outs_rgb = self.vssm(x_rgb) # B x C x H x W
  File "E:\anaconda\envs\mamba\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "E:\anaconda\envs\mamba\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\2 FDS Project\multimodal_segment\Sigma\models\encoders\vmamba.py", line 2201, in forward
    o, x = layer_forward(layer, x) # (B, H, W, C)
  File "E:\2 FDS Project\multimodal_segment\Sigma\models\encoders\vmamba.py", line 2194, in layer_forward
    x = l.blocks(x)
  File "E:\anaconda\envs\mamba\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "E:\anaconda\envs\mamba\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\anaconda\envs\mamba\lib\site-packages\torch\nn\modules\container.py", line 215, in forward
    input = module(input)
  File "E:\anaconda\envs\mamba\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "E:\anaconda\envs\mamba\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\2 FDS Project\multimodal_segment\Sigma\models\encoders\vmamba.py", line 1721, in forward
    return self._forward(input)
  File "E:\2 FDS Project\multimodal_segment\Sigma\models\encoders\vmamba.py", line 1712, in _forward
    x = input + self.drop_path(self.op(self.norm(input)))
  File "E:\anaconda\envs\mamba\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "E:\anaconda\envs\mamba\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\2 FDS Project\multimodal_segment\Sigma\models\encoders\vmamba.py", line 1072, in forward
    y = self.forward_core(x)
  File "E:\2 FDS Project\multimodal_segment\Sigma\models\encoders\vmamba.py", line 1047, in forward_corev2
    return cross_selective_scan(
  File "E:\2 FDS Project\multimodal_segment\Sigma\models\encoders\vmamba.py", line 212, in cross_selective_scan
    ys: torch.Tensor = selective_scan(
  File "E:\2 FDS Project\multimodal_segment\Sigma\models\encoders\vmamba.py", line 210, in selective_scan
    return SelectiveScan.apply(u, delta, A, B, C, D, delta_bias, delta_softplus, nrows)
  File "E:\anaconda\envs\mamba\lib\site-packages\torch\autograd\function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "E:\anaconda\envs\mamba\lib\site-packages\torch\cuda\amp\autocast_mode.py", line 121, in decorate_fwd
    return fwd(*args, **kwargs)
  File "E:\2 FDS Project\multimodal_segment\Sigma\models\encoders\vmamba.py", line 60, in forward
    out, x, *rest = selective_scan_cuda.fwd(u, delta, A, B, C, D, delta_bias, delta_softplus, nrows)
TypeError: fwd(): incompatible function arguments. The following argument types are supported:
    1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.Tensor, arg4: torch.Tensor, arg5: Optional[torch.Tensor], arg6: Optional[torch.Tensor], arg7: bool, arg8: int, arg9: bool) -> List[torch.Tensor]

Invoked with: tensor([[[ 0.0181,  0.0379,  0.0385,  ..., -0.0145, -0.0135,  0.0327],
         [ 0.0328,  0.0402,  0.0400,  ...,  0.0821,  0.0816,  0.0808],
         [ 0.1926,  0.1918,  0.1907,  ...,  0.2178,  0.2175,  0.2044],
         ...,
         [ 0.1050,  0.1308,  0.1309,  ...,  0.1220,  0.1205,  0.1044],
         [-0.0431, -0.0488, -0.0488,  ..., -0.0896, -0.0908, -0.0852],
         [ 0.0171,  0.0106,  0.0109,  ...,  0.0592,  0.0576,  0.0117]]],
       device='cuda:0', grad_fn=<ViewBackward0>), tensor([[[ 0.0047,  0.0142,  0.0148,  ...,  0.0191,  0.0191,  0.0012],
         [ 0.0316,  0.0244,  0.0245,  ...,  0.0405,  0.0407,  0.0216],
         [ 0.0078,  0.0135,  0.0139,  ...,  0.0210,  0.0217,  0.0096],
         ...,
         [ 0.0105,  0.0152,  0.0151,  ...,  0.0205,  0.0210,  0.0243],
         [ 0.0046, -0.0038, -0.0039,  ..., -0.0051, -0.0057,  0.0002],
         [-0.0392, -0.0279, -0.0278,  ..., -0.0409, -0.0415, -0.0300]]],
       device='cuda:0', grad_fn=<ViewBackward0>), tensor([[ -1.0000,  -2.0000,  -3.0000,  ..., -14.0000, -15.0000, -16.0000],
        [ -1.0000,  -2.0000,  -3.0000,  ..., -14.0000, -15.0000, -16.0000],
        [ -1.0000,  -2.0000,  -3.0000,  ..., -14.0000, -15.0000, -16.0000],
        ...,
        [ -1.0000,  -2.0000,  -3.0000,  ..., -14.0000, -15.0000, -16.0000],
        [ -1.0000,  -2.0000,  -3.0000,  ..., -14.0000, -15.0000, -16.0000],
        [ -1.0000,  -2.0000,  -3.0000,  ..., -14.0000, -15.0000, -16.0000]],
       device='cuda:0', grad_fn=<NegBackward0>), tensor([[[[ 0.0556,  0.0285,  0.0271,  ...,  0.0469,  0.0470,  0.0397],
          [ 0.0748,  0.0790,  0.0790,  ...,  0.0534,  0.0548,  0.0213],
          [ 0.0074, -0.0072, -0.0077,  ...,  0.0546,  0.0536,  0.0453],
          ...,
          [ 0.0232,  0.0483,  0.0484,  ...,  0.0160,  0.0159,  0.0119],
          [-0.0409, -0.0451, -0.0449,  ..., -0.0818, -0.0810, -0.0563],
          [-0.0919, -0.0792, -0.0798,  ..., -0.0392, -0.0377, -0.0648]],

         [[ 0.0095, -0.0101, -0.0099,  ...,  0.0184,  0.0187,  0.0229],
          [-0.0976, -0.0962, -0.0982,  ..., -0.0904, -0.0905, -0.0943],
          [-0.0593, -0.0689, -0.0693,  ..., -0.0498, -0.0505, -0.0562],
          ...,
          [-0.1448, -0.1397, -0.1401,  ..., -0.0527, -0.0524, -0.0371],
          [ 0.0556,  0.0524,  0.0533,  ...,  0.0480,  0.0480,  0.0623],
          [ 0.0030,  0.0060,  0.0073,  ...,  0.0882,  0.0890,  0.0774]],

         [[-0.1212, -0.1209, -0.1207,  ..., -0.0744, -0.0747, -0.0904],
          [ 0.0182,  0.0174,  0.0181,  ...,  0.0270,  0.0268,  0.0163],
          [-0.2103, -0.2186, -0.2177,  ..., -0.1916, -0.1916, -0.1623],
          ...,
          [ 0.0972,  0.0970,  0.0963,  ...,  0.1327,  0.1338,  0.1309],
          [-0.0560, -0.0842, -0.0841,  ..., -0.0588, -0.0587, -0.0756],
          [ 0.0211,  0.0264,  0.0270,  ...,  0.0387,  0.0385,  0.0148]],

         [[-0.0362, -0.0274, -0.0268,  ..., -0.0424, -0.0431, -0.0446],
          [ 0.0324,  0.0592,  0.0594,  ...,  0.0520,  0.0539,  0.0565],
          [-0.0196, -0.0331, -0.0333,  ..., -0.0574, -0.0571, -0.0602],
          ...,
          [-0.0342, -0.0530, -0.0523,  ...,  0.0309,  0.0294,  0.0268],
          [-0.0030,  0.0059,  0.0056,  ..., -0.0599, -0.0598, -0.0542],
          [ 0.1560,  0.1550,  0.1562,  ...,  0.1505,  0.1499,  0.1442]]]],
       device='cuda:0', grad_fn=<CloneBackward0>), tensor([[[[ 0.0973,  0.0988,  0.0985,  ...,  0.1145,  0.1137,  0.0802],
          [ 0.0021,  0.0492,  0.0509,  ...,  0.0087,  0.0082,  0.0121],
          [-0.0128, -0.0122, -0.0121,  ..., -0.0435, -0.0435, -0.0245],
          ...,
          [ 0.1583,  0.1808,  0.1811,  ...,  0.1023,  0.1024,  0.0754],
          [ 0.0326,  0.0373,  0.0366,  ...,  0.0575,  0.0570,  0.0524],
          [-0.0392, -0.0415, -0.0411,  ..., -0.0388, -0.0395, -0.0505]],

         [[ 0.0101,  0.0103,  0.0095,  ..., -0.0063, -0.0071, -0.0183],
          [-0.0061, -0.0240, -0.0244,  ..., -0.0405, -0.0409, -0.0370],
          [ 0.0613,  0.0506,  0.0504,  ...,  0.0424,  0.0430,  0.0456],
          ...,
          [ 0.0478,  0.0249,  0.0260,  ...,  0.0677,  0.0671,  0.0496],
          [ 0.0333,  0.0341,  0.0352,  ...,  0.0361,  0.0362,  0.0183],
          [ 0.0263,  0.0356,  0.0359,  ...,  0.0191,  0.0187,  0.0179]],

         [[-0.0513, -0.0785, -0.0782,  ..., -0.0769, -0.0767, -0.0696],
          [-0.0323, -0.0411, -0.0412,  ...,  0.0071,  0.0071, -0.0013],
          [-0.0012, -0.0034, -0.0023,  ...,  0.0157,  0.0154,  0.0170],
          ...,
          [ 0.0528,  0.0695,  0.0702,  ...,  0.0594,  0.0583,  0.0729],
          [ 0.0403,  0.0476,  0.0462,  ...,  0.0031,  0.0031,  0.0205],
          [ 0.0501,  0.0616,  0.0628,  ...,  0.0498,  0.0494,  0.0756]],

         [[ 0.0028,  0.0077,  0.0074,  ...,  0.0021,  0.0003,  0.0061],
          [ 0.0077,  0.0167,  0.0178,  ..., -0.0086, -0.0069,  0.0031],
          [ 0.0079,  0.0246,  0.0247,  ...,  0.0072,  0.0062, -0.0061],
          ...,
          [-0.1588, -0.1409, -0.1418,  ..., -0.1760, -0.1771, -0.1520],
          [-0.1543, -0.1603, -0.1612,  ..., -0.1484, -0.1479, -0.1349],
          [-0.0273,  0.0058,  0.0050,  ..., -0.0267, -0.0275, -0.0284]]]],
       device='cuda:0', grad_fn=<CloneBackward0>), Parameter containing:
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1.
[2024-08-27 10:26:44,882] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 13216) of binary: E:\anaconda\envs\mamba\python.exe
Traceback (most recent call last):
  File "E:\anaconda\envs\mamba\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "E:\anaconda\envs\mamba\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "E:\anaconda\envs\mamba\Scripts\torchrun.exe\__main__.py", line 7, in <module>
  File "E:\anaconda\envs\mamba\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "E:\anaconda\envs\mamba\lib\site-packages\torch\distributed\run.py", line 806, in main
    run(args)
  File "E:\anaconda\envs\mamba\lib\site-packages\torch\distributed\run.py", line 797, in run
    elastic_launch(
  File "E:\anaconda\envs\mamba\lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "E:\anaconda\envs\mamba\lib\site-packages\torch\distributed\launcher\api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-27_10:26:44
  host      : DESKTOP-863KBCC
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 13216)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
zifuwan commented 3 months ago

Hi, I haven't tried running on Windows. There might be a series of unexpected problems happening if you transfer the code to Windows. You can try to ask GPT about the issues and solve them one by one, but there might be problems happening. What I would do is probably installing a dual system or use the Cloud service like AWS GPUs or Google Colab with premium subscription.

Worldseer commented 3 months ago

Thank you very much, I am trying other methods