zifuwan / Sigma

[WACV 2025] Python implementation of Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation
https://zifuwan.github.io/Sigma/
MIT License
190 stars 19 forks source link

AttributeError: 'vssm_tiny' object has no attribute 'init_weights' #2

Closed wuaodi closed 7 months ago

wuaodi commented 7 months ago

Thanks for your awesome work !

I changed config_pst900.py as below:

""" Settings for network, this would be different for each kind of model"""
C.backbone = 'sigma_tiny' # Remember change the path below.
C.pretrained_model = C.root_dir + '/pretrained/vmamba/vssmtiny_dp01_ckpt_epoch_292.pth' # C.root_dir + '/pretrained/segformer/mit_b2.pth'
C.decoder = 'MambaDecoder' # 'MLPDecoder'
C.decoder_embed_dim = 512
C.optimizer = 'AdamW'

and runned train.py: CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" torchrun -m --nproc_per_node=1 train.py -p 29501 -d 0 -n "pst"

Then got this error:

15 16:34:14 using devices 0
Namespace(devices='0', continue_fpath=None, local_rank=0, port='29501', dataset_name='pst')
=======================================
/home/ai-i-wuaodi/Sigma/log_final/log_pst900/log_PST900_sigma_tiny_cromb_conmb_cvssdecoder/tb
=======================================
15 16:34:14 Using backbone: V-MAMBA
Successfully load ckpt pretrained/vmamba/vssmtiny_dp01_ckpt_epoch_292.pth
incompatible: _IncompatibleKeys(missing_keys=['outnorm0.weight', 'outnorm0.bias', 'outnorm1.weight', 'outnorm1.bias', 'outnorm2.weight', 'outnorm2.bias', 'outnorm3.weight', 'outnorm3.bias'], unexpected_keys=['classifier.norm.weight', 'classifier.norm.bias', 'classifier.head.weight', 'classifier.head.bias'])
15 16:34:16 Using Mamba Decoder
15 16:34:16 Loading pretrained model: /home/ai-i-wuaodi/Sigma/pretrained/vmamba/vssmtiny_dp01_ckpt_epoch_292.pth
15 16:34:16 WRN A exception occurred during Engine initialization, give up running process
Traceback (most recent call last):
  File "/home/ai-i-wuaodi/anaconda3/envs/sigma/lib/python3.9/runpy.py", line 188, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/home/ai-i-wuaodi/anaconda3/envs/sigma/lib/python3.9/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/home/ai-i-wuaodi/Sigma/train.py", line 82, in <module>
    model=segmodel(cfg=config, criterion=criterion, norm_layer=BatchNorm2d)
  File "/home/ai-i-wuaodi/Sigma/models/builder.py", line 112, in __init__
    self.init_weights(cfg, pretrained=cfg.pretrained_model)
  File "/home/ai-i-wuaodi/Sigma/models/builder.py", line 118, in init_weights
    self.backbone.init_weights(pretrained=pretrained)
  File "/home/ai-i-wuaodi/anaconda3/envs/sigma/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1269, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'vssm_tiny' object has no attribute 'init_weights'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2979) of binary: /home/ai-i-wuaodi/anaconda3/envs/sigma/bin/python
Traceback (most recent call last):
  File "/home/ai-i-wuaodi/anaconda3/envs/sigma/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ai-i-wuaodi/anaconda3/envs/sigma/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/ai-i-wuaodi/anaconda3/envs/sigma/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/ai-i-wuaodi/anaconda3/envs/sigma/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/ai-i-wuaodi/anaconda3/envs/sigma/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ai-i-wuaodi/anaconda3/envs/sigma/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-15_16:34:18
  host      : linux
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2979)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Could you give me some ideas how to solve it, thanks!

zifuwan commented 7 months ago

Hi, we hardcoded the code for initializing weights. 15 16:34:16 Loading pretrained model: /home/ai-i-wuaodi/Sigma/pretrained/vmamba/vssmtiny_dp01_ckpt_epoch_292.pth means you have successfully loaded the vmamba weight. So please simply set C.pretrained_model = None and you're all good. Thanks.

wuaodi commented 7 months ago

Thank you so much for your quick reply !

I set C.pretrained_model = None, but got a new error:

(sigma) ai-i-wuaodi@linux:~/Sigma$ CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" torchrun -m --nproc_per_node=1 train.py -p 29501 -d 0 -n "pst"
16 00:15:04 PyTorch Version 1.13.1+cu117
16 00:15:04 using devices 0
Namespace(devices='0', continue_fpath=None, local_rank=0, port='29501', dataset_name='pst')
=======================================
/home/ai-i-wuaodi/Sigma/log_final/log_pst900/log_PST900_sigma_tiny_cromb_conmb_cvssdecoder/tb
=======================================
16 00:15:04 Using backbone: V-MAMBA
Successfully load ckpt pretrained/vmamba/vssmtiny_dp01_ckpt_epoch_292.pth
incompatible: _IncompatibleKeys(missing_keys=['outnorm0.weight', 'outnorm0.bias', 'outnorm1.weight', 'outnorm1.bias', 'outnorm2.weight', 'outnorm2.bias', 'outnorm3.weight', 'outnorm3.bias'], unexpected_keys=['classifier.norm.weight', 'classifier.norm.bias', 'classifier.head.weight', 'classifier.head.bias'])
16 00:15:06 Using Mamba Decoder
16 00:15:06 Initing weights ...
16 00:15:08 begin trainning:
[00:04<?,?it/s]
16 00:15:12 WRN A exception occurred during Engine initialization, give up running process
Traceback (most recent call last):
  File "/home/ai-i-wuaodi/anaconda3/envs/sigma/lib/python3.9/runpy.py", line 188, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/home/ai-i-wuaodi/anaconda3/envs/sigma/lib/python3.9/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/home/ai-i-wuaodi/Sigma/train.py", line 164, in <module>
    loss = model(imgs, modal_xs, gts)
  File "/home/ai-i-wuaodi/anaconda3/envs/sigma/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ai-i-wuaodi/Sigma/models/builder.py", line 151, in forward
    out = self.encode_decode(rgb, modal_x)
  File "/home/ai-i-wuaodi/Sigma/models/builder.py", line 134, in encode_decode
    out = self.decode_head.forward(x)
  File "/home/ai-i-wuaodi/Sigma/models/decoders/MambaDecoder.py", line 258, in forward
    x = self.forward_up_features(inputs) # B, H, W, C
  File "/home/ai-i-wuaodi/Sigma/models/decoders/MambaDecoder.py", line 230, in forward_up_features
    x = y + inputs[3 - inx].permute(0, 2, 3, 1).contiguous()
RuntimeError: The size of tensor a (46) must match the size of tensor b (45) at non-singleton dimension 1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 31291) of binary: /home/ai-i-wuaodi/anaconda3/envs/sigma/bin/python
Traceback (most recent call last):
  File "/home/ai-i-wuaodi/anaconda3/envs/sigma/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ai-i-wuaodi/anaconda3/envs/sigma/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/ai-i-wuaodi/anaconda3/envs/sigma/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/ai-i-wuaodi/anaconda3/envs/sigma/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/ai-i-wuaodi/anaconda3/envs/sigma/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ai-i-wuaodi/anaconda3/envs/sigma/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-16_00:15:18
  host      : linux
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 31291)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

If you could give me some advice, I would be very grateful.

zifuwan commented 7 months ago

Hi, thanks for pointing this out and we have updated the bug. You can manually resize the feature by:

def forward_up_features(self, inputs):  # B, C, H, W
        if not self.deep_supervision:
            for inx, layer_up in enumerate(self.layers_up):
                if inx == 0:
                    x = inputs[3 - inx]  # B, 768, 15, 20
                    x = x.permute(0, 2, 3, 1).contiguous()  # B, 15, 20, 768
                    y = layer_up(x)  # B, 30, 40, 384
                else:
                    # interpolate y to input size (only pst900 dataset needs)
                    B, C, H, W = inputs[3 - inx].shape
                    y = F.interpolate(y.permute(0, 3, 1, 2).contiguous(), size=(H, W), mode='bilinear', align_corners=False).permute(0, 2, 3, 1).contiguous()

                    x = y + inputs[3 - inx].permute(0, 2, 3, 1).contiguous()
                    y = layer_up(x)

            x = self.norm_up(y)

            return x

Only PST900 would need this since its input size is different.

wuaodi commented 7 months ago

Everything goes well now ! Cool ~