Closed npurson closed 1 year ago
In some early experiments, we tried both the 3dconv decoder (3d fpn + 3d conv head) and the transformer decoder when the encoder is a resnet-like 3dconv encoder. We have the following observations:
(1) The 3dconv decoder seems not benefit from (at least not obviously) the 3D augmentation techniques. (2) The best recorded performance for 3dconv decoder (w.o. 3d aug) is around 12.2 mIoU, while is ~0.5 mIoU lower than the transformer decoder (with 3d aug).
Since this is an important ablation, we will try the 3dconv decoder recently with the latest settings.
Hope it helps.
Inspiring and helpful! Thanks for taking the time to reply.
Congratulations on creating such an excellent and solid work!
However, I'm wondering about the results achieved without the query-based Transformer decoder, or in other words, the isolated impact of the Transformer Occupancy Decoder. Given that the Dual-path Transformer Encoder guides the Voxel Features with the BEV feature, it seems that the voxel features should already possess sufficient fine-grained features for SSC. Additionally, the absence of instance-level annotations could possibly reduce the impact of the Transformer Decoder and one-to-one matching.
I would appreciate any insights you may have on this matter.