Abstract
Pseudo-LiDAR 3D detectors have made remarkable progress in monocular 3D detection by enhancing the capability of perceiving depth with depth estimation networks, and using LiDAR-based 3D detection architectures. The advanced stereo 3D detectors can also accurately localize 3D objects. The gap in image-to-image generation for stereo views is much smaller than that in image-to-LiDAR generation. Motivated by this, we propose a Pseudo-Stereo 3D detection framework with three novel virtual view generation methods, including image-level generation, feature-level generation, and feature-clone, for detecting 3D objects from a single image. Our analysis of depth-aware learning shows that the depth loss is effective in only feature-level virtual view generation and the estimated depth map is effective in both image-level and feature-level in our framework. We propose a disparity-wise dynamic convolution with dynamic kernels sampled from the disparity feature map to filter the features adaptively from a single image for generating virtual image features, which eases the feature degradation caused by the depth estimation errors. Till submission (November 18, 2021), our Pseudo-Stereo 3D detection framework ranks 1st on car, pedestrian, and cyclist among the monocular 3D detectors with publications on the KITTI-3D benchmark. The code is released at https://github.com/revisitq/Pseudo-Stereo-3D.
A Versatile Multi-View Framework for LiDAR-based 3D Object Detection with Guidance from Panoptic Segmentation
Authors: Hamidreza Fazlali, Yixuan Xu, Yuan Ren, Bingbing Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
3D object detection using LiDAR data is an indispensable component for autonomous driving systems. Yet, only a few LiDAR-based 3D object detection methods leverage segmentation information to further guide the detection process. In this paper, we propose a novel multi-task framework that jointly performs 3D object detection and panoptic segmentation. In our method, the 3D object detection backbone in Bird's-Eye-View (BEV) plane is augmented by the injection of Range-View (RV) feature maps from the 3D panoptic segmentation backbone. This enables the detection backbone to leverage multi-view information to address the shortcomings of each projection view. Furthermore, foreground semantic information is incorporated to ease the detection task by highlighting the locations of each object class in the feature maps. Finally, a new center density heatmap generated based on the instance-level information further guides the detection backbone by suggesting possible box center locations for objects. Our method works with any BEV-based 3D object detection method, and as shown by extensive experiments on the nuScenes dataset, it provides significant performance gains. Notably, the proposed method based on a single-stage CenterPoint 3D object detection network achieved state-of-the-art performance on nuScenes 3D Detection Benchmark with 67.3 NDS.
Real-Time Hybrid Mapping of Populated Indoor Scenes using a Low-Cost Monocular UAV
Authors: Stuart Golodetz, Madhu Vankadari, Aluna Everitt, Sangyun Shin, Andrew Markham, Niki Trigoni
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Abstract
Unmanned aerial vehicles (UAVs) have been used for many applications in recent years, from urban search and rescue, to agricultural surveying, to autonomous underground mine exploration. However, deploying UAVs in tight, indoor spaces, especially close to humans, remains a challenge. One solution, when limited payload is required, is to use micro-UAVs, which pose less risk to humans and typically cost less to replace after a crash. However, micro-UAVs can only carry a limited sensor suite, e.g. a monocular camera instead of a stereo pair or LiDAR, complicating tasks like dense mapping and markerless multi-person 3D human pose estimation, which are needed to operate in tight environments around people. Monocular approaches to such tasks exist, and dense monocular mapping approaches have been successfully deployed for UAV applications. However, despite many recent works on both marker-based and markerless multi-UAV single-person motion capture, markerless single-camera multi-person 3D human pose estimation remains a much earlier-stage technology, and we are not aware of existing attempts to deploy it in an aerial context. In this paper, we present what is thus, to our knowledge, the first system to perform simultaneous mapping and multi-person 3D human pose estimation from a monocular camera mounted on a single UAV. In particular, we show how to loosely couple state-of-the-art monocular depth estimation and monocular 3D human pose estimation approaches to reconstruct a hybrid map of a populated indoor scene in real time. We validate our component-level design choices via extensive experiments on the large-scale ScanNet and GTA-IM datasets. To evaluate our system-level performance, we also construct a new Oxford Hybrid Mapping dataset of populated indoor scenes.
Keyword: loop detection
There is no result
Keyword: autonomous driving
A Versatile Multi-View Framework for LiDAR-based 3D Object Detection with Guidance from Panoptic Segmentation
Authors: Hamidreza Fazlali, Yixuan Xu, Yuan Ren, Bingbing Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
3D object detection using LiDAR data is an indispensable component for autonomous driving systems. Yet, only a few LiDAR-based 3D object detection methods leverage segmentation information to further guide the detection process. In this paper, we propose a novel multi-task framework that jointly performs 3D object detection and panoptic segmentation. In our method, the 3D object detection backbone in Bird's-Eye-View (BEV) plane is augmented by the injection of Range-View (RV) feature maps from the 3D panoptic segmentation backbone. This enables the detection backbone to leverage multi-view information to address the shortcomings of each projection view. Furthermore, foreground semantic information is incorporated to ease the detection task by highlighting the locations of each object class in the feature maps. Finally, a new center density heatmap generated based on the instance-level information further guides the detection backbone by suggesting possible box center locations for objects. Our method works with any BEV-based 3D object detection method, and as shown by extensive experiments on the nuScenes dataset, it provides significant performance gains. Notably, the proposed method based on a single-stage CenterPoint 3D object detection network achieved state-of-the-art performance on nuScenes 3D Detection Benchmark with 67.3 NDS.
Safety-aware metrics for object detectors in autonomous driving
Abstract
We argue that object detectors in the safety critical domain should prioritize detection of objects that are most likely to interfere with the actions of the autonomous actor. Especially, this applies to objects that can impact the actor's safety and reliability. In the context of autonomous driving, we propose new object detection metrics that reward the correct identification of objects that are most likely to interact with the subject vehicle (i.e., the actor), and that may affect its driving decision. To achieve this, we build a criticality model to reward the detection of the objects based on proximity, orientation, and relative velocity with respect to the subject vehicle. Then, we apply our model on the recent autonomous driving dataset nuScenes, and we compare eight different object detectors. Results show that, in several settings, object detectors that perform best according to the nuScenes ranking are not the preferable ones when the focus is shifted on safety and reliability.
Intrinsically-Motivated Reinforcement Learning: A Brief Introduction
Abstract
Reinforcement learning (RL) is one of the three basic paradigms of machine learning. It has demonstrated impressive performance in many complex tasks like Go and StarCraft, which is increasingly involved in smart manufacturing and autonomous driving. However, RL consistently suffers from the exploration-exploitation dilemma. In this paper, we investigated the problem of improving exploration in RL and introduced the intrinsically-motivated RL. In sharp contrast to the classic exploration strategies, intrinsically-motivated RL utilizes the intrinsic learning motivation to provide sustainable exploration incentives. We carefully classified the existing intrinsic reward methods and analyzed their practical drawbacks. Moreover, we proposed a new intrinsic reward method via R\'enyi state entropy maximization, which overcomes the drawbacks of the preceding methods and provides powerful exploration incentives. Finally, extensive simulation demonstrated that the proposed module achieve superior performance with higher efficiency and robustness.
Differentiable Control Barrier Functions for Vision-based End-to-End Autonomous Driving
Authors: Wei Xiao, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, Ramin Hasani, Daniela Rus
Abstract
Guaranteeing safety of perception-based learning systems is challenging due to the absence of ground-truth state information unlike in state-aware control scenarios. In this paper, we introduce a safety guaranteed learning framework for vision-based end-to-end autonomous driving. To this end, we design a learning system equipped with differentiable control barrier functions (dCBFs) that is trained end-to-end by gradient descent. Our models are composed of conventional neural network architectures and dCBFs. They are interpretable at scale, achieve great test performance under limited training data, and are safety guaranteed in a series of autonomous driving scenarios such as lane keeping and obstacle avoidance. We evaluated our framework in a sim-to-real environment, and tested on a real autonomous car, achieving safe lane following and obstacle avoidance via Augmented Reality (AR) and real parked vehicles.
Pedestrian Stop and Go Forecasting with Hybrid Feature Fusion
Authors: Dongxu Guo, Taylor Mordan, Alexandre Alahi
Abstract
Forecasting pedestrians' future motions is essential for autonomous driving systems to safely navigate in urban areas. However, existing prediction algorithms often overly rely on past observed trajectories and tend to fail around abrupt dynamic changes, such as when pedestrians suddenly start or stop walking. We suggest that predicting these highly non-linear transitions should form a core component to improve the robustness of motion prediction algorithms. In this paper, we introduce the new task of pedestrian stop and go forecasting. Considering the lack of suitable existing datasets for it, we release TRANS, a benchmark for explicitly studying the stop and go behaviors of pedestrians in urban traffic. We build it from several existing datasets annotated with pedestrians' walking motions, in order to have various scenarios and behaviors. We also propose a novel hybrid model that leverages pedestrian-specific and scene features from several modalities, both video sequences and high-level attributes, and gradually fuses them to integrate multiple levels of context. We evaluate our model and several baselines on TRANS, and set a new benchmark for the community to work on pedestrian stop and go forecasting.
Keyword: mapping
Fast Neural Architecture Search for Lightweight Dense Prediction Networks
Authors: Lam Huynh, Esa Rahtu, Jiri Matas, Janne Heikkila
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Dense prediction is a class of computer vision problems aiming at mapping every pixel of the input image with some predicted values. Depending on the problem, the output values can be either continous or discrete. For instance, monocular depth estimation and image super-resolution are often formulated as regression, while semantic segmentation is a dense classification, i.e. discrete, problem. More specifically, the monocular depth estimation problem produces a dense depth map from a single image to be used in various applications including robotics, scene understanding, and augmented reality. Single image super-resolution (SISR) is a low-level vision task that generates a high-resolution image from its low-resolution counterpart. SISR is widely utilized in medical and surveillance imaging, where images with more precise details can provide invaluable information. On the other hand, semantic segmentation predicts a dense annotated map of different semantic categories from a given image that is crucial for image understanding tasks.
Abstract
Precise representation of large-scale undirected network is the basis for understanding relations within a massive entity set. The undirected network representation task can be efficiently addressed by a symmetry non-negative latent factor (SNLF) model, whose objective is clearly non-convex. However, existing SNLF models commonly adopt a first-order optimizer that cannot well handle the non-convex objective, thereby resulting in inaccurate representation results. On the other hand, higher-order learning algorithms are expected to make a breakthrough, but their computation efficiency are greatly limited due to the direct manipulation of the Hessian matrix, which can be huge in undirected network representation tasks. Aiming at addressing this issue, this study proposes to incorporate an efficient second-order method into SNLF, thereby establishing a second-order symmetric non-negative latent factor analysis model for undirected network with two-fold ideas: a) incorporating a mapping strategy into SNLF model to form an unconstrained model, and b) training the unconstrained model with a specially designed second order method to acquire a proper second-order step efficiently. Empirical studies indicate that proposed model outperforms state-of-the-art models in representation accuracy with affordable computational burden.
Teaching Robots to Span the Space of Functional Expressive Motion
Authors: Arjun Sripathy, Andreea Bobu, Zhongyu Li, Koushil Sreenath, Daniel S. Brown, Anca D. Dragan
Abstract
Our goal is to enable robots to perform functional tasks in emotive ways, be it in response to their users' emotional states, or expressive of their confidence levels. Prior work has proposed learning independent cost functions from user feedback for each target emotion, so that the robot may optimize it alongside task and environment specific objectives for any situation it encounters. However, this approach is inefficient when modeling multiple emotions and unable to generalize to new ones. In this work, we leverage the fact that emotions are not independent of each other: they are related through a latent space of Valence-Arousal-Dominance (VAD). Our key idea is to learn a model for how trajectories map onto VAD with user labels. Considering the distance between a trajectory's mapping and a target VAD allows this single model to represent cost functions for all emotions. As a result 1) all user feedback can contribute to learning about every emotion; 2) the robot can generate trajectories for any emotion in the space instead of only a few predefined ones; and 3) the robot can respond emotively to user-generated natural language by mapping it to a target VAD. We introduce a method that interactively learns to map trajectories to this latent space and test it in simulation and in a user study. In experiments, we use a simple vacuum robot as well as the Cassie biped.
Plan Your Target and Learn Your Skills: Transferable State-Only Imitation Learning via Decoupled Policy Optimization
Authors: Minghuan Liu, Zhengbang Zhu, Yuzheng Zhuang, Weinan Zhang, Jianye Hao, Yong Yu, Jun Wang
Abstract
Recent progress in state-only imitation learning extends the scope of applicability of imitation learning to real-world settings by relieving the need for observing expert actions. However, existing solutions only learn to extract a state-to-action mapping policy from the data, without considering how the expert plans to the target. This hinders the ability to leverage demonstrations and limits the flexibility of the policy. In this paper, we introduce Decoupled Policy Optimization (DePO), which explicitly decouples the policy as a high-level state planner and an inverse dynamics model. With embedded decoupled policy gradient and generative adversarial training, DePO enables knowledge transfer to different action spaces or state transition dynamics, and can generalize the planner to out-of-demonstration state regions. Our in-depth experimental analysis shows the effectiveness of DePO on learning a generalized target state planner while achieving the best imitation performance. We demonstrate the appealing usage of DePO for transferring across different tasks by pre-training, and the potential for co-training agents with various skills.
Feature Transformation for Cross-domain Few-shot Remote Sensing Scene Classification
Authors: Qiaoling Chen, Zhihao Chen, Wei Luo
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Effectively classifying remote sensing scenes is still a challenge due to the increasing spatial resolution of remote imaging and large variances between remote sensing images. Existing research has greatly improved the performance of remote sensing scene classification (RSSC). However, these methods are not applicable to cross-domain few-shot problems where target domain is with very limited training samples available and has a different data distribution from source domain. To improve the model's applicability, we propose the feature-wise transformation module (FTM) in this paper. FTM transfers the feature distribution learned on source domain to that of target domain by a very simple affine operation with negligible additional parameters. Moreover, FTM can be effectively learned on target domain in the case of few training data available and is agnostic to specific network structures. Experiments on RSSC and land-cover mapping tasks verified its capability to handle cross-domain few-shot problems. By comparison with directly finetuning, FTM achieves better performance and possesses better transferability and fine-grained discriminability. \textit{Code will be publicly available.}
Freeform Body Motion Generation from Speech
Authors: Jing Xu, Wei Zhang, Yalong Bai, Qibin Sun, Tao Mei
Subjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract
People naturally conduct spontaneous body motions to enhance their speeches while giving talks. Body motion generation from speech is inherently difficult due to the non-deterministic mapping from speech to body motions. Most existing works map speech to motion in a deterministic way by conditioning on certain styles, leading to sub-optimal results. Motivated by studies in linguistics, we decompose the co-speech motion into two complementary parts: pose modes and rhythmic dynamics. Accordingly, we introduce a novel freeform motion generation model (FreeMo) by equipping a two-stream architecture, i.e., a pose mode branch for primary posture generation, and a rhythmic motion branch for rhythmic dynamics synthesis. On one hand, diverse pose modes are generated by conditional sampling in a latent space, guided by speech semantics. On the other hand, rhythmic dynamics are synced with the speech prosody. Extensive experiments demonstrate the superior performance against several baselines, in terms of motion diversity, quality and syncing with speech. Code and pre-trained models will be publicly available through https://github.com/TheTempAccount/Co-Speech-Motion-Generation.
Robust Event-Based Control: Bridge Time-Domain Triggering and Frequency-Domain Uncertainties
Authors: Shiqi Zhang, Zhongkui Li
Subjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
Abstract
This paper considers the robustness of event-triggered control of general linear systems against additive or multiplicative frequency-domain uncertainties. It is revealed that in static or dynamic event triggering mechanisms, the sampling errors are images of affine operators acting on the sampled outputs. Though not belonging to $\mathcal{RH}_\infty$, these operators are finite-gain $\mathcal{L}_2$ stable with operator-norm depending on the triggering conditions and the norm bound of the uncertainties. This characterization is further extended to the general integral quadratic constraint (IQC)-based triggering mechanism. As long as the triggering condition characterizes an $\mathcal{L}_2$-to-$\mathcal{L}2$ mapping relationship (in other words, small-gain-type constraints) between the sampled outputs and the sampling errors, the robust event-triggered controller design problem can be transformed into the standard $H\infty$ synthesis problem of a linear system having the same order as the controlled plant. Algorithms are provided to construct the robust controllers for the static, dynamic and IQC-based event triggering cases.
AutoMap: Automatic Medical Code Mapping for Clinical Prediction Model Deployment
Authors: Zhenbang Wu, Cao Xiao, Lucas M Glass, David M Liebovitz, Jimeng Sun
Abstract
Given a deep learning model trained on data from a source site, how to deploy the model to a target hospital automatically? How to accommodate heterogeneous medical coding systems across different hospitals? Standard approaches rely on existing medical code mapping tools, which have significant practical limitations. To tackle this problem, we propose AutoMap to automatically map the medical codes across different EHR systems in a coarse-to-fine manner: (1) Ontology-level Alignment: We leverage the ontology structure to learn a coarse alignment between the source and target medical coding systems; (2) Code-level Refinement: We refine the alignment at a fine-grained code level for the downstream tasks using a teacher-student framework. We evaluate AutoMap using several deep learning models with two real-world EHR datasets: eICU and MIMIC-III. Results show that AutoMap achieves relative improvements up to 3.9% (AUC-ROC) and 8.7% (AUC-PR) for mortality prediction, and up to 4.7% (AUC-ROC) and 3.7% (F1) for length-of-stay estimation. Further, we show that AutoMap can provide accurate mapping across coding systems. Lastly, we demonstrate that AutoMap can adapt to the two challenging scenarios: (1) mapping between completely different coding systems and (2) between completely different hospitals.
Real-Time Hybrid Mapping of Populated Indoor Scenes using a Low-Cost Monocular UAV
Authors: Stuart Golodetz, Madhu Vankadari, Aluna Everitt, Sangyun Shin, Andrew Markham, Niki Trigoni
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Abstract
Unmanned aerial vehicles (UAVs) have been used for many applications in recent years, from urban search and rescue, to agricultural surveying, to autonomous underground mine exploration. However, deploying UAVs in tight, indoor spaces, especially close to humans, remains a challenge. One solution, when limited payload is required, is to use micro-UAVs, which pose less risk to humans and typically cost less to replace after a crash. However, micro-UAVs can only carry a limited sensor suite, e.g. a monocular camera instead of a stereo pair or LiDAR, complicating tasks like dense mapping and markerless multi-person 3D human pose estimation, which are needed to operate in tight environments around people. Monocular approaches to such tasks exist, and dense monocular mapping approaches have been successfully deployed for UAV applications. However, despite many recent works on both marker-based and markerless multi-UAV single-person motion capture, markerless single-camera multi-person 3D human pose estimation remains a much earlier-stage technology, and we are not aware of existing attempts to deploy it in an aerial context. In this paper, we present what is thus, to our knowledge, the first system to perform simultaneous mapping and multi-person 3D human pose estimation from a monocular camera mounted on a single UAV. In particular, we show how to loosely couple state-of-the-art monocular depth estimation and monocular 3D human pose estimation approaches to reconstruct a hybrid map of a populated indoor scene in real time. We validate our component-level design choices via extensive experiments on the large-scale ScanNet and GTA-IM datasets. To evaluate our system-level performance, we also construct a new Oxford Hybrid Mapping dataset of populated indoor scenes.
Keyword: localization
Truncation Error Analysis for an Accurate Nonlocal Manifold Poisson Model with Dirichlet Boundary
Abstract
In this work, we introduced a class of nonlocal models to accurately approximate the Poisson model on manifolds that are embedded in high dimensional Euclid spaces with Dirichlet boundary. In comparison to the existing nonlocal Poisson models, instead of utilizing volumetric boundary constraint to reduce the truncation error to its local counterpart, we rely on the Poisson equation itself along the boundary to explicitly express the second order normal derivative by some geometry-based terms, so that to create a new model with $\mathcal{O}(\delta)$ truncation error along the $2\delta-$boundary layer and $\mathcal{O}(\delta^2)$ at interior, with $\delta$ be the nonlocal interaction horizon. Our concentration is on the construction and the truncation error analysis of such nonlocal model. The control on the truncation error is currently optimal among all nonlocal models, and is sufficient to attain second order localization rate that will be derived in our subsequent work.
Uncertainty Estimation for Heatmap-based Landmark Localization
Authors: Lawrence Schobs, Andrew J. Swift, Haiping Lu
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Abstract
Automatic anatomical landmark localization has made great strides by leveraging deep learning methods in recent years. The ability to quantify the uncertainty of these predictions is a vital ingredient needed to see these methods adopted in clinical use, where it is imperative that erroneous predictions are caught and corrected. We propose Quantile Binning, a data-driven method to categorise predictions by uncertainty with estimated error bounds. This framework can be applied to any continuous uncertainty measure, allowing straightforward identification of the best subset of predictions with accompanying estimated error bounds. We facilitate easy comparison between uncertainty measures by constructing two evaluation metrics derived from Quantile Binning. We demonstrate this framework by comparing and contrasting three uncertainty measures (a baseline, the current gold standard, and a proposed method combining aspects of the two), across two datasets (one easy, one hard) and two heatmap-based landmark localization model paradigms (U-Net and patch-based). We conclude by illustrating how filtering out gross mispredictions caught in our Quantile Bins significantly improves the proportion of predictions under an acceptable error threshold, and offer recommendations on which uncertainty measure to use and how to use it.
Keyword: SLAM
There is no result
Keyword: Visual inertial
There is no result
Keyword: livox
There is no result
Keyword: loam
There is no result
Keyword: Visual inertial odometry
There is no result
Keyword: lidar
Pseudo-Stereo for Monocular 3D Object Detection in Autonomous Driving
A Versatile Multi-View Framework for LiDAR-based 3D Object Detection with Guidance from Panoptic Segmentation
Real-Time Hybrid Mapping of Populated Indoor Scenes using a Low-Cost Monocular UAV
Keyword: loop detection
There is no result
Keyword: autonomous driving
A Versatile Multi-View Framework for LiDAR-based 3D Object Detection with Guidance from Panoptic Segmentation
Safety-aware metrics for object detectors in autonomous driving
Intrinsically-Motivated Reinforcement Learning: A Brief Introduction
Differentiable Control Barrier Functions for Vision-based End-to-End Autonomous Driving
Pedestrian Stop and Go Forecasting with Hybrid Feature Fusion
Keyword: mapping
Fast Neural Architecture Search for Lightweight Dense Prediction Networks
Second-order Symmetric Non-negative Latent Factor Analysis
Teaching Robots to Span the Space of Functional Expressive Motion
Plan Your Target and Learn Your Skills: Transferable State-Only Imitation Learning via Decoupled Policy Optimization
Feature Transformation for Cross-domain Few-shot Remote Sensing Scene Classification
Freeform Body Motion Generation from Speech
Robust Event-Based Control: Bridge Time-Domain Triggering and Frequency-Domain Uncertainties
AutoMap: Automatic Medical Code Mapping for Clinical Prediction Model Deployment
Real-Time Hybrid Mapping of Populated Indoor Scenes using a Low-Cost Monocular UAV
Keyword: localization
Truncation Error Analysis for an Accurate Nonlocal Manifold Poisson Model with Dirichlet Boundary
Uncertainty Estimation for Heatmap-based Landmark Localization