Closed alalbiol closed 1 year ago
Hi,
Thank you very much for your interest in our work. I'm really glad that our work helps.
Your understanding of fig 8 is correct, no spatial information is used for the initial semantic map generation.
For the bidirectional attention, i.e., fig 1 (B), the 1x1 convolution on 2D semantic maps is essentially the same as the linear layer for sequence inputs, which element-wise transforms each token linearly. You can do it without reshaping and using linear layers. So we actually share the same idea on this point. I was initially want to incorporate spatial information into the semantic map, so I used the 1x1 conv on 2D semantic map. But after experiments, I found the spatial information doesn't help due to the small size of the semantic map.
I'm recently working on this repo to make support for distributed training and more datasets. The new updates will come soon once I finish the test. If you have any suggestions, for example, the support of any datasets or new features, please let me know.
Dear Yunhe, I have tested your model on several segmentation tasks with excellent results. I would like to ask you a quick question about your segmentation map.
If I am not wrong once you compute the semantic map as in fig 8, all the spatial information is lost and you get a global summary of the scene. The upper branch with the softmax selects "where" to see and the lower branch selects "what" to see. This information is later refined using your network.
Interestingly, the bidirectional attention layers use 1x1 convolutions on the semantic maps. So information between adjacent pixels of the segmentation mask is never mixed. In fact the w, h dimensions are never used for anything and everything could be done equivantly without the reshape into the w and h dimensions. In the paper you say that convolutions are not used for the semantic maps, but I really think that convolutions shouldn't be used because they make no sense.
Finally, I really think that the proposed approach is very interesting and the semantic maps accumulate global information in each layer and for this reason the algoritmh works so well. But I would rewrite this interpretation of the h, w dimensions.