Open wonseokroh opened 1 year ago
Domain Shift Issue Also, we observe a domain shift effect between the overlap region and the non-overlap region. We assume that this is due to camera lens distortion and strong inductive bias of Deep Neural Network. To find the degree of inductive bias, we visualize the t-SNE results of central (nonoverlapped, blue) and edge (overlapped, red) region’ features. As we clearly see in the first row (DETR3D), the features of each two groups form distinguishable clusters. Our method starts with this observation
Main Architecture This is main architecture of ORA3D. Our model takes multi-view camera inputs and outputs a set of 3D bounding boxes for objects in the scene. Our model consists of two main modules: First, Stereo Matching Network for Weak Depth Supervision, where our depth estimation head is trained to predict a dense depth map of the overlap region. Given the disparity estimates as supervision, we propose regularizing the network to fully utilize the geometric potential of binocular images and improve the overall detection accuracy. Second, Adversarial Overlap Region Discriminator, which minimizes the gap between non-overlap regions vs. overlap regions.
Multi-view Camera System To begin, in this work, we utilize multi-view images around ego-vehicle from multi-view camera system. Recently, multi-view camera systems have become an economical alternative balanced option as they can resolve some of the weaknesses of monocular and stereo vision systems for the 3D object detection task, potentially replacing LiDAR sensors. When considering the multi-view system, adjacent cameras have a strong association. We regard this association comes from overlap regions and can be extended to geometric guides.
Limitations of Previous Work Next, I will explain the limitations of existing landmark work in camera-only multi-view 3D object detection task DETR3D. DETR3D introduces a promising multi-view detection pipeline that processes six images concurrently in an end-to-end manner. Even though DETR3D performs reasonably well, we found that the network (without explicit guidance) does not totally use the geometric potentials of multi-view camera systems. Specifically, the network’s understanding of the scene could be limited to that of a monocular detection network, resulting in multiple false positives in the overlapped regions, as shown in this figure. The pink dotted line represents the Field Of View of the Front Right camera, while the blue dotted line represents the Field Of View of the Front camera. There is an overlapping area between the Field Of Views of the two cameras. There are multiple false positives in this area causing performance degradation compared to other areas. So, in this paper, we focus on addressing the overlap region issue in order to boost detection accuracy.