I've been following this series of papers, they have been quite impressive! Regarding the implementation of TopoMLP code, I have several questions:
I noticed that the lane detection head does not use the encoder module of DETR and only uses the second-last image feature map extracted by the backbone.
In contrast, the traffic elements detection head utilizes the entire DETR architecture and all four layers of features. What caused this difference?
In petr_transformer.py, line 89, there's a parameter named self.cross. From my observation, enabling this switch allows for attention interaction between images from different cameras. However, it seems that this functionality is not utilized during the default training process. Why is that?
In lane_head.py, line 300, there's a section of code for generating the distribution of depth-direction coordinates. The calculation involving self.position_range[3] - self.depth_start is a bit puzzling to me. It seems that position_range[3] is the maximum range of the x-axis in the BEV space. Why is the difference between the x-axis range in the BEV space and the depth range in the image frustum direction being calculated?
Looking forward to your response. Thank you very much.
Our centerline detection follows PETR to use single-scale feature, while our traffic detection follows Deformable DETR to use multi-scale features.
This setting is not working, so you can ignore it.
Here we follow PETR setting to use the the maximum range of the x-axis in the BEV space to represent the maximum depth range. This is an approximate approach.
I've been following this series of papers, they have been quite impressive! Regarding the implementation of TopoMLP code, I have several questions:
In contrast, the traffic elements detection head utilizes the entire DETR architecture and all four layers of features. What caused this difference?
In petr_transformer.py, line 89, there's a parameter named self.cross. From my observation, enabling this switch allows for attention interaction between images from different cameras. However, it seems that this functionality is not utilized during the default training process. Why is that?![fig2](https://github.com/wudongming97/TopoMLP/assets/141025598/a9b9d177-43cf-4f05-a3b5-2b03ece163fe)
In lane_head.py, line 300, there's a section of code for generating the distribution of depth-direction coordinates. The calculation involving self.position_range[3] - self.depth_start is a bit puzzling to me. It seems that position_range[3] is the maximum range of the x-axis in the BEV space. Why is the difference between the x-axis range in the BEV space and the depth range in the image frustum direction being calculated?![fig3](https://github.com/wudongming97/TopoMLP/assets/141025598/4cac0c90-251b-4635-9826-ff381d0c423a)
Looking forward to your response. Thank you very much.