Open Zewbiee opened 1 month ago
Sub-kernels are obtained by LKMUNet.py#L123 with the kernel number hyperparameter self.p. Self.p denotes the kernel numbers along each dimension, not the stride. So the feature map is divided into p x p sub-kernels, the pixels with size H/p x W/p within the same kernel are adjacent and input continuously into SSM.
So the feature map is divided into p x p sub-kernels, the pixels with size H/p x W/p within the same kernel are adjacent and input continuously into SSM.
If so, I think the code should be
x_div = x.reshape(B, C, self.p, H//self.p, self.p, W//self.p).permute(0, 2, 4, 1, 3, 5).contiguous().view(B*self.p*self.p, C, H//self.p, W//self.p)
instead of
x_div = x.reshape(B, C, H//self.p, self.p, W//self.p, self.p).permute(0, 3, 5, 1, 2, 4).contiguous().view(B*self.p*self.p, C, H//self.p, W//self.p)
To confirm this, I generated an example image (with shape of $300 \times 300 \times 3$):
Set self.p = 3
. As described in the paper, x_div[0]
should be a solid color image (with shape of $100 \times 100 \times 3$). However, LKMUNet.py#L123 gives 9 colors.
Hello, thank you for your inspiring work! However, I have a little confusion.
In the paper's description of Pixel-level SSM, it states, "when these sub-kernels are sent into a Mamba layer, the local adjacent pixels will be input continuously into SSM." However, from LKMUNet.py#L123, it appears that the BiPixelMambaLayer actually forms a token sequence with pixels of stride self.p.
Could you please explain this discrepancy?