wjh892521292 / LKM-UNet

Large Kernel Vision Mamba UNet for Medical Image Segmentation
https://arxiv.org/abs/2403.07332
70 stars 3 forks source link

Inconsistency Between Paper Description (Pixel-level SSM) and Code Implementation (BiPixelMambaLayer) #5

Open Zewbiee opened 1 month ago

Zewbiee commented 1 month ago

Hello, thank you for your inspiring work! However, I have a little confusion.

In the paper's description of Pixel-level SSM, it states, "when these sub-kernels are sent into a Mamba layer, the local adjacent pixels will be input continuously into SSM." However, from LKMUNet.py#L123, it appears that the BiPixelMambaLayer actually forms a token sequence with pixels of stride self.p.

Could you please explain this discrepancy?

wjh892521292 commented 1 month ago

Sub-kernels are obtained by LKMUNet.py#L123 with the kernel number hyperparameter self.p. Self.p denotes the kernel numbers along each dimension, not the stride. So the feature map is divided into p x p sub-kernels, the pixels with size H/p x W/p within the same kernel are adjacent and input continuously into SSM.

Zewbiee commented 1 month ago

So the feature map is divided into p x p sub-kernels, the pixels with size H/p x W/p within the same kernel are adjacent and input continuously into SSM.

If so, I think the code should be

x_div = x.reshape(B, C, self.p, H//self.p, self.p, W//self.p).permute(0, 2, 4, 1, 3, 5).contiguous().view(B*self.p*self.p, C, H//self.p, W//self.p)

instead of

x_div = x.reshape(B, C, H//self.p, self.p, W//self.p, self.p).permute(0, 3, 5, 1, 2, 4).contiguous().view(B*self.p*self.p, C, H//self.p, W//self.p)

To confirm this, I generated an example image (with shape of $300 \times 300 \times 3$):

example img

Set self.p = 3. As described in the paper, x_div[0] should be a solid color image (with shape of $100 \times 100 \times 3$). However, LKMUNet.py#L123 gives 9 colors.