zhangyp15 / OccFormer

[ICCV 2023] OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction
https://arxiv.org/abs/2304.05316
Apache License 2.0
324 stars 22 forks source link

What's the results without the query-based Transformer decoder? #1

Closed npurson closed 1 year ago

npurson commented 1 year ago

Congratulations on creating such an excellent and solid work!

However, I'm wondering about the results achieved without the query-based Transformer decoder, or in other words, the isolated impact of the Transformer Occupancy Decoder. Given that the Dual-path Transformer Encoder guides the Voxel Features with the BEV feature, it seems that the voxel features should already possess sufficient fine-grained features for SSC. Additionally, the absence of instance-level annotations could possibly reduce the impact of the Transformer Decoder and one-to-one matching.

I would appreciate any insights you may have on this matter.

zhangyp15 commented 1 year ago
  1. In some early experiments, we tried both the 3dconv decoder (3d fpn + 3d conv head) and the transformer decoder when the encoder is a resnet-like 3dconv encoder. We have the following observations:

    (1) The 3dconv decoder seems not benefit from (at least not obviously) the 3D augmentation techniques. (2) The best recorded performance for 3dconv decoder (w.o. 3d aug) is around 12.2 mIoU, while is ~0.5 mIoU lower than the transformer decoder (with 3d aug).

Since this is an important ablation, we will try the 3dconv decoder recently with the latest settings.

  1. We think the query-based decoder can provide a smooth transformation towards the panoptic occupancy. Also, we had some shallow experiments with the panoptic version of occformer, by importing the panoptic lidarseg labels from nuScenes. However, we found the instance-level annotations will make each mask too sparse and hinder the learning process.

Hope it helps.

npurson commented 1 year ago

Inspiring and helpful! Thanks for taking the time to reply.