wurenkai / UltraLight-VM-UNet

[arXiv] The official code for "UltraLight VM-UNet: Parallel Vision Mamba Significantly Reduces Parameters for Skin Lesion Segmentation".
191 stars 21 forks source link

about ss2d #9

Open cdn940912 opened 2 months ago

cdn940912 commented 2 months ago

有个问题想请教下,from mamba_ssm import Mamba中引入的mamba不是ssm吗,没有按照论文中的用到ss2d吧

wurenkai commented 2 months ago

Hi, thank you very much for asking this question. We have previously analyzed the impact of changes in the number of channels for both SSM and SS2D, and they are similar, with the same large reduction in the number of parameters (~93.5%) for Mamba quadruple concatenation under SSM, which is similar to the reduction for SS2D (~93.1%). However, SS2D has more initial parameters than SSM and has more GFLOPs and is more complex, SS2D takes 2 minutes to train 1 epoch at ISIC2017, while SSM used in this paper takes less than half a minute to train 1 epoch. So we adopted Mamba for SSM as the base building block for the ultra-lightweight model instead of SS2D. We will change the content of the subsection on mamba parameter analysis to SSM in the second version of the arXiv preprint in the next two days, which is currently retaining the description of SS2D. Thank you for raising this query!

fceex49 commented 2 months ago

Hi,

many thanks for your great work!

One doubt wrt. SS2D or CSM which was proposed by Vmamba. In your variant of the VSS you are not using the SS2D/CSM. Instead you directly flatten the input and put that into SSM (S6) directly. (I am guessing that you took over the implementation from the LightM-Unet paper?)

Can this approach really capture the spatial 2D information in images?

In the vision mamba paper they also came up with bi-directional SSM to deal with the spatial understanding.

Could you please give a bit insights?

Thanks

wurenkai commented 2 months ago

@fceex49 Hi, although Mamba was originally designed for 1D sequences, it can also learn spatial relationships and long-range dependencies between pixels in 2D image data through serialization, deserialization, convolution operations, and advanced residual and adjustment factor operations. The forward and backward scanning is proposed in vim to further improve the spatial perception, but it also doubles the memory consumption. Whereas lightweight model mainly focuses on lightweight operation to address future mHealth and hardware cost issues, so lowest memory consumption and good performance is its focus. I hope my answer can help you.