yuantianyuan01 / StreamMapNet

GNU General Public License v3.0
189 stars 16 forks source link

Replace BEVFormer backbone with bevpool cannot converge. #19

Open synsin0 opened 9 months ago

synsin0 commented 9 months ago

Thanks for your work. I have tested to replace bevformer backbone to LSSTransform in MapTR but it cannot converge around loss=30. I test it without streaming config and on the old split. Do you test different view transformers?

yuantianyuan01 commented 8 months ago

I tried Inverse Inverse Perspective Mapping (IPM) backbone used in VectorMapNet and it converged normally. Maybe you can try adjust some training hyperparameters?

Zhenghao97 commented 8 months ago

@synsin0 Hi, I also used LSS-BEV method with Streammapnet's temporal fusion module. Notice that default vehicle direction of LSS-BEV is heading north rather than east(StreamMapNet default direction), so you can check this point.

Zhenghao97 commented 8 months ago

@yuantianyuan01 Hi, tianyuan. I used your BEV fusion method on Lift-Splat-Shoot Method. The problem I meet is that when the 4 warmup epoch(non-temporal) is end, the performance of network meet a sharp degradation(temporal fusion start point), have you meet this scenario, and does you have some suggestions, appreciate it.

synsin0 commented 8 months ago

@Zhenghao97 May you share the config of LSS-based StreamMapNet with me? Thanks a lot! I also observed the sudden drop after warmup epoch in BEVDet, it may be common for BEVDet series.

synsin0 commented 8 months ago

BEVDet does not give solution for temporal epoch warmup. Besides, it has 20 epochs with cbgs so a very long training time after the warmup.

Zhenghao97 commented 8 months ago

@synsin0 I have solved the sharp degration problem, and in experiment, temporal fusion tech. can accelate the training convergence and stable the training process. You just need to replace the GRU module with a simple residual block to conduct bev fusion.

synsin0 commented 8 months ago

@Zhenghao97 Hi, I still fail to converge in bevpool settings. I think the fault may come from my bevpool is still incompatible with streammapnet downstream modules (I rotate bev feature (B, C, 50, 100) from north to east but doesn't improve). I'd express my sincere gratitude if you may share the related code of bevpool module and bev fusion module!

Zhenghao97 commented 8 months ago

@synsin0 About the vehicle heading direction, you need to keep same between the BEV feature and the GT label. In warp phase, I rotate BEV feature to head east. After warping, I restore warped BEV feature to head north. And the GT label is always heading north.

About the temporal fusion module, sorry for that I can't release the code because my company has related safety policy. However, the implementation is not complex. After warping, you just need to concat them along the channel and conduct channel-wise conv fusion. For details, you can refer to the Fast-BEV paper of SenseTime.

yuantianyuan01 commented 8 months ago

@synsin0 Sorry for the late response. I suggest you check the coordinate system of your BEV feature. Popular BEV based models (e.g. BEVFormer, BEVDet) build their BEV coordinate as x-axis rightward and y-axis downward while keeping the the vehicle heading rightward. However our code reverses y-axis on BEV feature, making it consistent with ego coordinate. It is worth noting that simply rotating the BEV feature may not be enough, you may need to flip it accordingly (maybe along the y-axis, which depends on your implementation).

Zhenghao97 commented 8 months ago

@synsin0 I find that layernorm in the GRU module severely interfered temporal performance of Lift-Splat. After turning it off, the GRU module worked well. You can also try it.

@yuantianyuan01 Hi, tianyuan, I visualize the fused temporal feature from GRU module w. and w/o. layernorm. The layernorm would lead to abnormal distribution of fused temporal feature in Lift-Splat. It is weird that this layer worked well in your paper. Could you give some suggestions? Appreciate it.

yuantianyuan01 commented 8 months ago

I guess one reason could be the different nature between the feature extracted by LSS and BEVFormer. Transformer-based BEV extractor inherently uses layernorm on each BEV query.

KSonPham commented 7 months ago

Hi @yuantianyuan01, Could you tell me in which part of the code did you reverses y-axis on BEV feature? Thanks

PeggyPeppa commented 7 months ago

@synsin0我发现 GRU 模块中的层规范严重干扰了 Lift-Splat 的时间性能。关闭后,GRU模块工作良好。你也可以尝试一下。

@yuantianyuan01你好,tianyuan,我将 GRU 模块 w 的融合时间特征可视化。并且没有。层规范。层范数会导致 Lift-Splat 中融合时间特征的异常分布。奇怪的是,这一层在你的论文中效果很好。你能给一些建议吗?欣赏它。

Hi @Zhenghao97 Could you tell me the y axis direction of LSS? Can I just rot90 or transpose x,y of LSS bev feature to align with the streammapnet?

Zhenghao97 commented 6 months ago

@PeggyPeppa Yes, that's exactly what I did. A good idea is that you only turn on the CAM_FRONT and CAM_LEFT_FRONT for development debugging. This helps you see where you are going, and take care to keep the coordinate systems aligned throughout your network forward process.

Wolfybox commented 4 months ago

I guess one reason could be the different nature between the feature extracted by lss and bevformer. Transformer-based bev extractor inherently uses ln on each bev query.

Agree. Just to add on, lss uses batch norm so bev features are normalized in a different manner (against ln) throughout the warm-ups. After that, when gru is involved, these bev features are suddenly normalized in the channel-dims, which results in a sudden change in bev feats distribution. If you use BEVFormer (which also applies LayerNorm), the normalization behavior is consistent throughout the whole training, feature distribution is relatively stable. Therefore, besides removing the ln in gru, i guess adding ln to lss would also alleviate the performance drops.

Above are only theoretical analysis since i did not do experiments myself. Also, though the discussion, I failed to find the warm-up logics in the latest codes. Would you mind pointing out the location ?