Abstract
Implicit representations are widely used for object reconstruction due to their efficiency and flexibility. In 2021, a novel structure named neural implicit map has been invented for incremental reconstruction. A neural implicit map alleviates the problem of inefficient memory cost of previous online 3D dense reconstruction while producing better quality. % However, the neural implicit map suffers the limitation that it does not support remapping as the frames of scans are encoded into a deep prior after generating the neural implicit map. This means, that neither this generation process is invertible, nor a deep prior is transformable. The non-remappable property makes it not possible to apply loop-closure techniques. % We present a neural implicit map based transformation algorithm to fill this gap. As our neural implicit map is transformable, our model supports remapping for this special map of latent features. % Experiments show that our remapping module is capable to well-transform neural implicit maps to new poses. Embedded into a SLAM framework, our mapping model is able to tackle the remapping of loop closures and demonstrates high-quality surface reconstruction. % Our implementation is available at github\footnote{\url{https://github.com/Jarrome/IMT_Mapping}} for the research community.
Efficient WiFi LiDAR SLAM for Autonomous Robots in Large Environments
Authors: Khairuldanial Ismail, Ran Liu, Zhenghong Qin, Achala Athukorala, Billy Pik Lik Lau, Muhammad Shalihan, Chau Yuen, U-Xuan Tan
Abstract
Autonomous robots operating in indoor and GPS denied environments can use LiDAR for SLAM instead. However, LiDARs do not perform well in geometrically-degraded environments, due to the challenge of loop closure detection and computational load to perform scan matching. Existing WiFi infrastructure can be exploited for localization and mapping with low hardware and computational cost. Yet, accurate pose estimation using WiFi is challenging as different signal values can be measured at the same location due to the unpredictability of signal propagation. Therefore, we introduce the use of WiFi fingerprint sequence for pose estimation (i.e. loop closure) in SLAM. This approach exploits the spatial coherence of location fingerprints obtained while a mobile robot is moving. This has better capability of correcting odometry drift. The method also incorporates LiDAR scans and thus, improving computational efficiency for large and geometrically-degraded environments while maintaining the accuracy of LiDAR SLAM. We conducted experiments in an indoor environment to illustrate the effectiveness of the method. The results are evaluated based on Root Mean Square Error (RMSE) and it has achieved an accuracy of 0.88m for the test environment.
Keyword: odometry
Effective Solid State LiDAR Odometry Using Continuous-time Filter Registration
Authors: Xin Zheng, Jianke Zhu
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Abstract
Solid-state LiDARs are more compact and cheaper than the conventional mechanical multi-line spinning LiDARs, which have become increasingly popular in autonomous driving recently. However, there are several challenges for these new LiDAR sensors, including severe motion distortions, small field of view and sparse point cloud, which hinder them from being widely used in LiDAR odometry. To tackle these problems, we present an effective continuous-time LiDAR odometry (ECTLO) method for the Risley prism-based LiDARs with non-repetitive scanning patterns. To account for the noisy data, a filter-based point-to-plane Gaussian Mixture Model is used for robust registration. Moreover, a LiDAR-only continuous-time motion model is employed to relieve the inevitable distortions. To facilitate the implicit data association in parallel, we maintain all map points within a single range image. Extensive experiments have been conducted on various testbeds using the solid-state LiDARs with different scanning patterns, whose promising results demonstrate the efficacy of our proposed approach.
Efficient WiFi LiDAR SLAM for Autonomous Robots in Large Environments
Authors: Khairuldanial Ismail, Ran Liu, Zhenghong Qin, Achala Athukorala, Billy Pik Lik Lau, Muhammad Shalihan, Chau Yuen, U-Xuan Tan
Abstract
Autonomous robots operating in indoor and GPS denied environments can use LiDAR for SLAM instead. However, LiDARs do not perform well in geometrically-degraded environments, due to the challenge of loop closure detection and computational load to perform scan matching. Existing WiFi infrastructure can be exploited for localization and mapping with low hardware and computational cost. Yet, accurate pose estimation using WiFi is challenging as different signal values can be measured at the same location due to the unpredictability of signal propagation. Therefore, we introduce the use of WiFi fingerprint sequence for pose estimation (i.e. loop closure) in SLAM. This approach exploits the spatial coherence of location fingerprints obtained while a mobile robot is moving. This has better capability of correcting odometry drift. The method also incorporates LiDAR scans and thus, improving computational efficiency for large and geometrically-degraded environments while maintaining the accuracy of LiDAR SLAM. We conducted experiments in an indoor environment to illustrate the effectiveness of the method. The results are evaluated based on Root Mean Square Error (RMSE) and it has achieved an accuracy of 0.88m for the test environment.
Keyword: livox
There is no result
Keyword: loam
There is no result
Keyword: lidar
Effective Solid State LiDAR Odometry Using Continuous-time Filter Registration
Authors: Xin Zheng, Jianke Zhu
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Abstract
Solid-state LiDARs are more compact and cheaper than the conventional mechanical multi-line spinning LiDARs, which have become increasingly popular in autonomous driving recently. However, there are several challenges for these new LiDAR sensors, including severe motion distortions, small field of view and sparse point cloud, which hinder them from being widely used in LiDAR odometry. To tackle these problems, we present an effective continuous-time LiDAR odometry (ECTLO) method for the Risley prism-based LiDARs with non-repetitive scanning patterns. To account for the noisy data, a filter-based point-to-plane Gaussian Mixture Model is used for robust registration. Moreover, a LiDAR-only continuous-time motion model is employed to relieve the inevitable distortions. To facilitate the implicit data association in parallel, we maintain all map points within a single range image. Extensive experiments have been conducted on various testbeds using the solid-state LiDARs with different scanning patterns, whose promising results demonstrate the efficacy of our proposed approach.
Efficient WiFi LiDAR SLAM for Autonomous Robots in Large Environments
Authors: Khairuldanial Ismail, Ran Liu, Zhenghong Qin, Achala Athukorala, Billy Pik Lik Lau, Muhammad Shalihan, Chau Yuen, U-Xuan Tan
Abstract
Autonomous robots operating in indoor and GPS denied environments can use LiDAR for SLAM instead. However, LiDARs do not perform well in geometrically-degraded environments, due to the challenge of loop closure detection and computational load to perform scan matching. Existing WiFi infrastructure can be exploited for localization and mapping with low hardware and computational cost. Yet, accurate pose estimation using WiFi is challenging as different signal values can be measured at the same location due to the unpredictability of signal propagation. Therefore, we introduce the use of WiFi fingerprint sequence for pose estimation (i.e. loop closure) in SLAM. This approach exploits the spatial coherence of location fingerprints obtained while a mobile robot is moving. This has better capability of correcting odometry drift. The method also incorporates LiDAR scans and thus, improving computational efficiency for large and geometrically-degraded environments while maintaining the accuracy of LiDAR SLAM. We conducted experiments in an indoor environment to illustrate the effectiveness of the method. The results are evaluated based on Root Mean Square Error (RMSE) and it has achieved an accuracy of 0.88m for the test environment.
Keyword: loop detection
There is no result
Keyword: nerf
There is no result
Keyword: mapping
On the Influence of Enforcing Model Identifiability on Learning dynamics of Gaussian Mixture Models
Abstract
A common way to learn and analyze statistical models is to consider operations in the model parameter space. But what happens if we optimize in the parameter space and there is no one-to-one mapping between the parameter space and the underlying statistical model space? Such cases frequently occur for hierarchical models which include statistical mixtures or stochastic neural networks, and these models are said to be singular. Singular models reveal several important and well-studied problems in machine learning like the decrease in convergence speed of learning trajectories due to attractor behaviors. In this work, we propose a relative reparameterization technique of the parameter space, which yields a general method for extracting regular submodels from singular models. Our method enforces model identifiability during training and we study the learning dynamics for gradient descent and expectation maximization for Gaussian Mixture Models (GMMs) under relative parameterization, showing faster experimental convergence and a improved manifold shape of the dynamics around the singularity. Extending the analysis beyond GMMs, we furthermore analyze the Fisher information matrix under relative reparameterization and its influence on the generalization error, and show how the method can be applied to more complex models like deep neural networks.
The Role of Depth, Width, and Activation Complexity in the Number of Linear Regions of Neural Networks
Authors: Alexis Goujon, Arian Etemadi, Michael Unser
Abstract
Many feedforward neural networks generate continuous and piecewise-linear (CPWL) mappings. Specifically, they partition the input domain into regions on which the mapping is an affine function. The number of these so-called linear regions offers a natural metric to characterize the expressiveness of CPWL mappings. Although the precise determination of this quantity is often out of reach, bounds have been proposed for specific architectures, including the well-known ReLU and Maxout networks. In this work, we propose a more general perspective and provide precise bounds on the maximal number of linear regions of CPWL networks based on three sources of expressiveness: depth, width, and activation complexity. Our estimates rely on the combinatorial structure of convex partitions and highlight the distinctive role of depth which, on its own, is able to exponentially increase the number of regions. We then introduce a complementary stochastic framework to estimate the average number of linear regions produced by a CPWL network architecture. Under reasonable assumptions, the expected density of linear regions along any 1D path is bounded by the product of depth, width, and a measure of activation complexity (up to a scaling factor). This yields an identical role to the three sources of expressiveness: no exponential growth with depth is observed anymore.
Learning Using Privileged Information for Zero-Shot Action Recognition
Abstract
Zero-Shot Action Recognition (ZSAR) aims to recognize video actions that have never been seen during training. Most existing methods assume a shared semantic space between seen and unseen actions and intend to directly learn a mapping from a visual space to the semantic space. This approach has been challenged by the semantic gap between the visual space and semantic space. This paper presents a novel method that uses object semantics as privileged information to narrow the semantic gap and, hence, effectively, assist the learning. In particular, a simple hallucination network is proposed to implicitly extract object semantics during testing without explicitly extracting objects and a cross-attention module is developed to augment visual feature with the object semantics. Experiments on the Olympic Sports, HMDB51 and UCF101 datasets have shown that the proposed method outperforms the state-of-the-art methods by a large margin.
Statistical and Neural Methods for Cross-lingual Entity Label Mapping in Knowledge Graphs
Authors: Gabriel Amaral, Mārcis Pinnis, Inguna Skadiņa, Odinaldo Rodrigues, Elena Simperl
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract
Knowledge bases such as Wikidata amass vast amounts of named entity information, such as multilingual labels, which can be extremely useful for various multilingual and cross-lingual applications. However, such labels are not guaranteed to match across languages from an information consistency standpoint, greatly compromising their usefulness for fields such as machine translation. In this work, we investigate the application of word and sentence alignment techniques coupled with a matching algorithm to align cross-lingual entity labels extracted from Wikidata in 10 languages. Our results indicate that mapping between Wikidata's main labels stands to be considerably improved (up to $20$ points in F1-score) by any of the employed methods. We show how methods relying on sentence embeddings outperform all others, even across different scripts. We believe the application of such techniques to measure the similarity of label pairs, coupled with a knowledge base rich in high-quality entity labels, to be an excellent asset to machine translation.
An Algorithm for the SE(3)-Transformation on Neural Implicit Maps for Remapping Functions
Authors: Yijun Yuan, Andreas Nuechter
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Abstract
Implicit representations are widely used for object reconstruction due to their efficiency and flexibility. In 2021, a novel structure named neural implicit map has been invented for incremental reconstruction. A neural implicit map alleviates the problem of inefficient memory cost of previous online 3D dense reconstruction while producing better quality. % However, the neural implicit map suffers the limitation that it does not support remapping as the frames of scans are encoded into a deep prior after generating the neural implicit map. This means, that neither this generation process is invertible, nor a deep prior is transformable. The non-remappable property makes it not possible to apply loop-closure techniques. % We present a neural implicit map based transformation algorithm to fill this gap. As our neural implicit map is transformable, our model supports remapping for this special map of latent features. % Experiments show that our remapping module is capable to well-transform neural implicit maps to new poses. Embedded into a SLAM framework, our mapping model is able to tackle the remapping of loop closures and demonstrates high-quality surface reconstruction. % Our implementation is available at github\footnote{\url{https://github.com/Jarrome/IMT_Mapping}} for the research community.
Efficient WiFi LiDAR SLAM for Autonomous Robots in Large Environments
Authors: Khairuldanial Ismail, Ran Liu, Zhenghong Qin, Achala Athukorala, Billy Pik Lik Lau, Muhammad Shalihan, Chau Yuen, U-Xuan Tan
Abstract
Autonomous robots operating in indoor and GPS denied environments can use LiDAR for SLAM instead. However, LiDARs do not perform well in geometrically-degraded environments, due to the challenge of loop closure detection and computational load to perform scan matching. Existing WiFi infrastructure can be exploited for localization and mapping with low hardware and computational cost. Yet, accurate pose estimation using WiFi is challenging as different signal values can be measured at the same location due to the unpredictability of signal propagation. Therefore, we introduce the use of WiFi fingerprint sequence for pose estimation (i.e. loop closure) in SLAM. This approach exploits the spatial coherence of location fingerprints obtained while a mobile robot is moving. This has better capability of correcting odometry drift. The method also incorporates LiDAR scans and thus, improving computational efficiency for large and geometrically-degraded environments while maintaining the accuracy of LiDAR SLAM. We conducted experiments in an indoor environment to illustrate the effectiveness of the method. The results are evaluated based on Root Mean Square Error (RMSE) and it has achieved an accuracy of 0.88m for the test environment.
Keyword: localization
Scalable Temporal Localization of Sensitive Activities in Movies and TV Episodes
Abstract
To help customers make better-informed viewing choices, video-streaming services try to moderate their content and provide more visibility into which portions of their movies and TV episodes contain age-appropriate material (e.g., nudity, sex, violence, or drug-use). Supervised models to localize these sensitive activities require large amounts of clip-level labeled data which is hard to obtain, while weakly-supervised models to this end usually do not offer competitive accuracy. To address this challenge, we propose a novel Coarse2Fine network designed to make use of readily obtainable video-level weak labels in conjunction with sparse clip-level labels of age-appropriate activities. Our model aggregates frame-level predictions to make video-level classifications and is therefore able to leverage sparse clip-level labels along with video-level labels. Furthermore, by performing frame-level predictions in a hierarchical manner, our approach is able to overcome the label-imbalance problem caused due to the rare-occurrence nature of age-appropriate content. We present comparative results of our approach using 41,234 movies and TV episodes (~3 years of video-content) from 521 sub-genres and 250 countries making it by far the largest-scale empirical analysis of age-appropriate activity localization in long-form videos ever published. Our approach offers 107.2% relative mAP improvement (from 5.5% to 11.4%) over existing state-of-the-art activity-localization approaches.
CDNet: Contrastive Disentangled Network for Fine-Grained Image Categorization of Ocular B-Scan Ultrasound
Authors: Ruilong Dan, Yunxiang Li, Yijie Wang, Gangyong Jia, Ruiquan Ge, Juan Ye, Qun Jin, Yaqi Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Precise and rapid categorization of images in the B-scan ultrasound modality is vital for diagnosing ocular diseases. Nevertheless, distinguishing various diseases in ultrasound still challenges experienced ophthalmologists. Thus a novel contrastive disentangled network (CDNet) is developed in this work, aiming to tackle the fine-grained image categorization (FGIC) challenges of ocular abnormalities in ultrasound images, including intraocular tumor (IOT), retinal detachment (RD), posterior scleral staphyloma (PSS), and vitreous hemorrhage (VH). Three essential components of CDNet are the weakly-supervised lesion localization module (WSLL), contrastive multi-zoom (CMZ) strategy, and hyperspherical contrastive disentangled loss (HCD-Loss), respectively. These components facilitate feature disentanglement for fine-grained recognition in both the input and output aspects. The proposed CDNet is validated on our ZJU Ocular Ultrasound Dataset (ZJUOUSD), consisting of 5213 samples. Furthermore, the generalization ability of CDNet is validated on two public and widely-used chest X-ray FGIC benchmarks. Quantitative and qualitative results demonstrate the efficacy of our proposed CDNet, which achieves state-of-the-art performance in the FGIC task. Code is available at: https://github.com/ZeroOneGame/CDNet-for-OUS-FGIC .
Efficient WiFi LiDAR SLAM for Autonomous Robots in Large Environments
Authors: Khairuldanial Ismail, Ran Liu, Zhenghong Qin, Achala Athukorala, Billy Pik Lik Lau, Muhammad Shalihan, Chau Yuen, U-Xuan Tan
Abstract
Autonomous robots operating in indoor and GPS denied environments can use LiDAR for SLAM instead. However, LiDARs do not perform well in geometrically-degraded environments, due to the challenge of loop closure detection and computational load to perform scan matching. Existing WiFi infrastructure can be exploited for localization and mapping with low hardware and computational cost. Yet, accurate pose estimation using WiFi is challenging as different signal values can be measured at the same location due to the unpredictability of signal propagation. Therefore, we introduce the use of WiFi fingerprint sequence for pose estimation (i.e. loop closure) in SLAM. This approach exploits the spatial coherence of location fingerprints obtained while a mobile robot is moving. This has better capability of correcting odometry drift. The method also incorporates LiDAR scans and thus, improving computational efficiency for large and geometrically-degraded environments while maintaining the accuracy of LiDAR SLAM. We conducted experiments in an indoor environment to illustrate the effectiveness of the method. The results are evaluated based on Root Mean Square Error (RMSE) and it has achieved an accuracy of 0.88m for the test environment.
DGMIL: Distribution Guided Multiple Instance Learning for Whole Slide Image Classification
Abstract
Multiple Instance Learning (MIL) is widely used in analyzing histopathological Whole Slide Images (WSIs). However, existing MIL methods do not explicitly model the data distribution, and instead they only learn a bag-level or instance-level decision boundary discriminatively by training a classifier. In this paper, we propose DGMIL: a feature distribution guided deep MIL framework for WSI classification and positive patch localization. Instead of designing complex discriminative network architectures, we reveal that the inherent feature distribution of histopathological image data can serve as a very effective guide for instance classification. We propose a cluster-conditioned feature distribution modeling method and a pseudo label-based iterative feature space refinement strategy so that in the final feature space the positive and negative instances can be easily separated. Experiments on the CAMELYON16 dataset and the TCGA Lung Cancer dataset show that our method achieves new SOTA for both global classification and positive patch localization tasks.
SwarmHawk: Self-Sustaining Multi-Agent System for Landing on a Moving Platform through an Agent Supervision
Authors: Ayush Gupta, Ekaterina Dorzhieva, Ahmed Baza, Mert Alper, Aleksey Fedoseev, Dzmitry Tsetserukou
Abstract
Heterogeneous teams of mobile robots and UAVs are offering a substantial benefit in an autonomous exploration of the environment. Nevertheless, although joint exploration scenarios for such systems are widely discussed, they are still suffering from low adaptability to changes in external conditions and faults of swarm agents during the UAV docking. We propose a novel vision-based drone swarm docking system for robust landing on a moving platform when one of the agents lost its position signal. The proposed SwarmHawk system relies on vision-based detection for the mobile platform tracking and navigation of its agents. Each drone of the swarm carries an RGB camera and AprilTag3 QR-code marker on board. SwarmHawk can switch between two modes of operation, acting as a homogeneous swarm in case of global UAV localization or assigning leader drones to navigate its neighbors in case of a camera fault in one of the drones or global localization failure. Two experiments were performed to evaluate SwarmHawk's performance under the global and local localization with static and moving platforms. The experimental results revealed a sufficient accuracy in the swarm landing task on a static mobile platform (error of 4.2 cm in homogeneous formation and 1.9 cm in leader-follower formation) and on moving platform (error of 6.9 cm in homogeneous formation and 4.7 cm in leader-follower formation). Moreover, the drones showed a good landing on a platform moving along a complex trajectory (average error of 19.4 cm) in leader-follower formation. The proposed SwarmHawk technology can be potentially applied in various swarm scenarios, including complex environment exploration, inspection, and drone delivery.
Keyword: transformer
IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes
Abstract
Indoor scenes exhibit significant appearance variations due to myriad interactions between arbitrarily diverse object shapes, spatially-changing materials, and complex lighting. Shadows, highlights, and inter-reflections caused by visible and invisible light sources require reasoning about long-range interactions for inverse rendering, which seeks to recover the components of image formation, namely, shape, material, and lighting. In this work, our intuition is that the long-range attention learned by transformer architectures is ideally suited to solve longstanding challenges in single-image inverse rendering. We demonstrate with a specific instantiation of a dense vision transformer, IRISformer, that excels at both single-task and multi-task reasoning required for inverse rendering. Specifically, we propose a transformer architecture to simultaneously estimate depths, normals, spatially-varying albedo, roughness and lighting from a single image of an indoor scene. Our extensive evaluations on benchmark datasets demonstrate state-of-the-art results on each of the above tasks, enabling applications like object insertion and material editing in a single unconstrained real image, with greater photorealism than prior works. Code and data are publicly released at https://github.com/ViLab-UCSD/IRISformer.
GAAMA 2.0: An Integrated System that Answers Boolean and Extractive Question
Authors: Scott McCarley, Mihaela Bornea, Sara Rosenthal, Anthony Ferritto, Md Arafat Sultan, Avirup Sil, Radu Florian
Abstract
Recent machine reading comprehension datasets include extractive and boolean questions but current approaches do not offer integrated support for answering both question types. We present a multilingual machine reading comprehension system and front-end demo that handles boolean questions by providing both a YES/NO answer and highlighting supporting evidence, and handles extractive questions by highlighting the answer in the passage. Our system, GAAMA 2.0, is ranked first on the Tydi QA leaderboard at the time of this writing. We contrast two different implementations of our approach. The first includes several independent stacks of transformers allowing easy deployment of each component. The second is a single stack of transformers utilizing adapters to reduce GPU memory footprint in a resource-constrained environment.
Abstract
Vision Transformers (ViT) have recently demonstrated exemplary performance on a variety of vision tasks and are being used as an alternative to CNNs. Their design is based on a self-attention mechanism that processes images as a sequence of patches, which is quite different compared to CNNs. Hence it is interesting to study if ViTs are vulnerable to backdoor attacks. Backdoor attacks happen when an attacker poisons a small part of the training data for malicious purposes. The model performance is good on clean test images, but the attacker can manipulate the decision of the model by showing the trigger at test time. To the best of our knowledge, we are the first to show that ViTs are vulnerable to backdoor attacks. We also find an intriguing difference between ViTs and CNNs - interpretation algorithms effectively highlight the trigger on test images for ViTs but not for CNNs. Based on this observation, we propose a test-time image blocking defense for ViTs which reduces the attack success rate by a large margin. Code is available here: https://github.com/UCDvision/backdoor_transformer.git
Rectify ViT Shortcut Learning by Visual Saliency
Authors: Chong Ma, Lin Zhao, Yuzhong Chen, David Weizhong Liu, Xi Jiang, Tuo Zhang, Xintao Hu, Dinggang Shen, Dajiang Zhu, Tianming Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Shortcut learning is common but harmful to deep learning models, leading to degenerated feature representations and consequently jeopardizing the model's generalizability and interpretability. However, shortcut learning in the widely used Vision Transformer framework is largely unknown. Meanwhile, introducing domain-specific knowledge is a major approach to rectifying the shortcuts, which are predominated by background related factors. For example, in the medical imaging field, eye-gaze data from radiologists is an effective human visual prior knowledge that has the great potential to guide the deep learning models to focus on meaningful foreground regions of interest. However, obtaining eye-gaze data is time-consuming, labor-intensive and sometimes even not practical. In this work, we propose a novel and effective saliency-guided vision transformer (SGT) model to rectify shortcut learning in ViT with the absence of eye-gaze data. Specifically, a computational visual saliency model is adopted to predict saliency maps for input image samples. Then, the saliency maps are used to distil the most informative image patches. In the proposed SGT, the self-attention among image patches focus only on the distilled informative ones. Considering this distill operation may lead to global information lost, we further introduce, in the last encoder layer, a residual connection that captures the self-attention across all the image patches. The experiment results on four independent public datasets show that our SGT framework can effectively learn and leverage human prior knowledge without eye gaze data and achieves much better performance than baselines. Meanwhile, it successfully rectifies the harmful shortcut learning and significantly improves the interpretability of the ViT model, demonstrating the promise of transferring human prior knowledge derived visual saliency in rectifying shortcut learning
Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection
Authors: Joo-Yeon Lee, Woo-Jeoung Nam, Seong-Whan Lee
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Video Anomaly Detection(VAD) has been traditionally tackled in two main methodologies: the reconstruction-based approach and the prediction-based one. As the reconstruction-based methods learn to generalize the input image, the model merely learns an identity function and strongly causes the problem called generalizing issue. On the other hand, since the prediction-based ones learn to predict a future frame given several previous frames, they are less sensitive to the generalizing issue. However, it is still uncertain if the model can learn the spatio-temporal context of a video. Our intuition is that the understanding of the spatio-temporal context of a video plays a vital role in VAD as it provides precise information on how the appearance of an event in a video clip changes. Hence, to fully exploit the context information for anomaly detection in video circumstances, we designed the transformer model with three different contextual prediction streams: masked, whole and partial. By learning to predict the missing frames of consecutive normal frames, our model can effectively learn various normality patterns in the video, which leads to a high reconstruction error at the abnormal cases that are unsuitable to the learned context. To verify the effectiveness of our approach, we assess our model on the public benchmark datasets: USCD Pedestrian 2, CUHK Avenue and ShanghaiTech and evaluate the performance with the anomaly score metric of reconstruction error. The results demonstrate that our proposed approach achieves a competitive performance compared to the existing video anomaly detection methods.
Bootstrapped Transformer for Offline Reinforcement Learning
Authors: Kerong Wang, Hanye Zhao, Xufang Luo, Kan Ren, Weinan Zhang, Dongsheng Li
Abstract
Offline reinforcement learning (RL) aims at learning policies from previously collected static trajectory data without interacting with the real environment. Recent works provide a novel perspective by viewing offline RL as a generic sequence generation problem, adopting sequence models such as Transformer architecture to model distributions over trajectories, and repurposing beam search as a planning algorithm. However, the training datasets utilized in general offline RL tasks are quite limited and often suffer from insufficient distribution coverage, which could be harmful to training sequence generation models yet has not drawn enough attention in the previous works. In this paper, we propose a novel algorithm named Bootstrapped Transformer, which incorporates the idea of bootstrapping and leverages the learned model to self-generate more offline data to further boost the sequence model training. We conduct extensive experiments on two offline RL benchmarks and demonstrate that our model can largely remedy the existing offline RL training limitations and beat other strong baseline methods. We also analyze the generated pseudo data and the revealed characteristics may shed some light on offline RL training. The codes are available at https://seqml.github.io/bootorl.
Using Transfer Learning for Code-Related Tasks
Authors: Antonio Mastropaolo, Nathan Cooper, David Nader Palacio, Simone Scalabrino, Denys Poshyvanyk, Rocco Oliveto, Gabriele Bavota
Abstract
Deep learning (DL) techniques have been used to support several code-related tasks such as code summarization and bug-fixing. In particular, pre-trained transformer models are on the rise, also thanks to the excellent results they achieved in Natural Language Processing (NLP) tasks. The basic idea behind these models is to first pre-train them on a generic dataset using a self-supervised task (e.g, filling masked words in sentences). Then, these models are fine-tuned to support specific tasks of interest (e.g, language translation). A single model can be fine-tuned to support multiple tasks, possibly exploiting the benefits of transfer learning. This means that knowledge acquired to solve a specific task (e.g, language translation) can be useful to boost performance on another task (e.g, sentiment classification). While the benefits of transfer learning have been widely studied in NLP, limited empirical evidence is available when it comes to code-related tasks. In this paper, we assess the performance of the Text-To-Text Transfer Transformer (T5) model in supporting four different code-related tasks: (i) automatic bug-fixing, (ii) injection of code mutants, (iii) generation of assert statements, and (iv) code summarization. We pay particular attention in studying the role played by pre-training and multi-task fine-tuning on the model's performance. We show that (i) the T5 can achieve better performance as compared to state-of-the-art baselines; and (ii) while pre-training helps the model, not all tasks benefit from a multi-task fine-tuning.
Local Slot Attention for Vision-and-Language Navigation
Abstract
Vision-and-language navigation (VLN), a frontier study aiming to pave the way for general-purpose robots, has been a hot topic in the computer vision and natural language processing community. The VLN task requires an agent to navigate to a goal location following natural language instructions in unfamiliar environments. Recently, transformer-based models have gained significant improvements on the VLN task. Since the attention mechanism in the transformer architecture can better integrate inter- and intra-modal information of vision and language. However, there exist two problems in current transformer-based models. 1) The models process each view independently without taking the integrity of the objects into account. 2) During the self-attention operation in the visual modality, the views that are spatially distant can be inter-weaved with each other without explicit restriction. This kind of mixing may introduce extra noise instead of useful information. To address these issues, we propose 1) A slot-attention based module to incorporate information from segmentation of the same object. 2) A local attention mask mechanism to limit the visual attention span. The proposed modules can be easily plugged into any VLN architecture and we use the Recurrent VLN-Bert as our base model. Experiments on the R2R dataset show that our model has achieved the state-of-the-art results.
Holistic Transformer: A Joint Neural Network for Trajectory Prediction and Decision-Making of Autonomous Vehicles
Abstract
Trajectory prediction and behavioral decision-making are two important tasks for autonomous vehicles that require good understanding of the environmental context; behavioral decisions are better made by referring to the outputs of trajectory predictions. However, most current solutions perform these two tasks separately. Therefore, a joint neural network that combines multiple cues is proposed and named as the holistic transformer to predict trajectories and make behavioral decisions simultaneously. To better explore the intrinsic relationships between cues, the network uses existing knowledge and adopts three kinds of attention mechanisms: the sparse multi-head type for reducing noise impact, feature selection sparse type for optimally using partial prior knowledge, and multi-head with sigmoid activation type for optimally using posteriori knowledge. Compared with other trajectory prediction models, the proposed model has better comprehensive performance and good interpretability. Perceptual noise robustness experiments demonstrate that the proposed model has good noise robustness. Thus, simultaneous trajectory prediction and behavioral decision-making combining multiple cues can reduce computational costs and enhance semantic relationships between scenes and agents.
Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product Retrieval
Abstract
Our goal in this research is to study a more realistic environment in which we can conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories. We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks to enable the evaluations on the price comparison and personalized recommendations. For both instance-level tasks, how to accurately pinpoint the product target mentioned in the visual-linguistic data and effectively decrease the influence of irrelevant contents is quite challenging. To address this, we exploit to train a more effective cross-modal pertaining model which is adaptively capable of incorporating key concept information from the multi-modal data, by using an entity graph whose node and edge respectively denote the entity and the similarity relation between entities. Specifically, a novel Entity-Graph Enhanced Cross-Modal Pretraining (EGE-CMP) model is proposed for instance-level commodity retrieval, that explicitly injects entity knowledge in both node-based and subgraph-based ways into the multi-modal networks via a self-supervised hybrid-stream transformer, which could reduce the confusion between different object contents, thereby effectively guiding the network to focus on entities with real semantic. Experimental results well verify the efficacy and generalizability of our EGE-CMP, outperforming several SOTA cross-modal baselines like CLIP, UNITER and CAPTURE.
CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer
Authors: Yao Mu, Shoufa Chen, Mingyu Ding, Jianyu Chen, Runjian Chen, Ping Luo
Abstract
Transformer has achieved great successes in learning vision and language representation, which is general across various downstream tasks. In visual control, learning transferable state representation that can transfer between different control tasks is important to reduce the training sample size. However, porting Transformer to sample-efficient visual control remains a challenging and unsolved problem. To this end, we propose a novel Control Transformer (CtrlFormer), possessing many appealing benefits that prior arts do not have. Firstly, CtrlFormer jointly learns self-attention mechanisms between visual tokens and policy tokens among different control tasks, where multitask representation can be learned and transferred without catastrophic forgetting. Secondly, we carefully design a contrastive reinforcement learning paradigm to train CtrlFormer, enabling it to achieve high sample efficiency, which is important in control problems. For example, in the DMControl benchmark, unlike recent advanced methods that failed by producing a zero score in the "Cartpole" task after transfer learning with 100k samples, CtrlFormer can achieve a state-of-the-art score with only 100k samples while maintaining the performance of previous tasks. The code and models are released in our project homepage.
SimA: Simple Softmax-free Attention for Vision Transformers
Abstract
Recently, vision transformers have become very popular. However, deploying them in many applications is computationally expensive partly due to the Softmax layer in the attention block. We introduce a simple but effective, Softmax-free attention block, SimA, which normalizes query and key matrices with simple $\ell_1$-norm instead of using Softmax layer. Then, the attention block in SimA is a simple multiplication of three matrices, so SimA can dynamically change the ordering of the computation at the test time to achieve linear computation on the number of tokens or the number of channels. We empirically show that SimA applied to three SOTA variations of transformers, DeiT, XCiT, and CvT, results in on-par accuracy compared to the SOTA models, without any need for Softmax layer. Interestingly, changing SimA from multi-head to single-head has only a small effect on the accuracy, which simplifies the attention block further. The code is available here: $\href{https://github.com/UCDvision/sima}{\text{This https URL}}$
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Abstract
We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region captioning and referring expression comprehension, to natural language processing tasks such as question answering and paraphrasing. Developing a single unified model for such a large variety of tasks poses unique challenges due to the heterogeneous inputs and outputs pertaining to each task, including RGB images, per-pixel maps, binary masks, bounding boxes, and language. We achieve this unification by homogenizing every supported input and output into a sequence of discrete vocabulary tokens. This common representation across all tasks allows us to train a single transformer-based architecture, jointly on over 80 diverse datasets in the vision and language fields. Unified-IO is the first model capable of performing all 7 tasks on the GRIT benchmark and produces strong results across 16 diverse benchmarks like NYUv2-Depth, ImageNet, VQA2.0, OK-VQA, Swig, VizWizGround, BoolQ, and SciTail, with no task or benchmark specific fine-tuning. Demos for Unified-IO are available at https://unified-io.allenai.org.
Keyword: autonomous driving
GMI-DRL: Empowering Multi-GPU Deep Reinforcement Learning with GPU Spatial Multiplexing
Abstract
With the increasing popularity of robotics in industrial control and autonomous driving, deep reinforcement learning (DRL) raises the attention of various fields. However, DRL computation on the modern powerful GPU platform is still inefficient due to its heterogeneous workloads and interleaved execution paradigm. To this end, we propose GMI-DRL, a systematic design to accelerate multi-GPU DRL via GPU spatial multiplexing. We introduce a novel design of resource-adjustable GPU multiplexing instances (GMIs) to match the actual needs of DRL tasks, an adaptive GMI management strategy to simultaneously achieve high GPU utilization and computation throughput, and a highly efficient inter-GMI communication support to meet the demands of various DRL communication patterns. Comprehensive experiments reveal that GMI-DRL outperforms state-of-the-art NVIDIA Isaac Gym with NCCL (up to 2.81X) and Horovod (up to 2.34X) support in training throughput on the latest DGX-A100 platform. Our work provides an initial user experience with GPU spatial multiplexing in processing heterogeneous workloads with a mixture of computation and communication.
Effective Solid State LiDAR Odometry Using Continuous-time Filter Registration
Authors: Xin Zheng, Jianke Zhu
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Abstract
Solid-state LiDARs are more compact and cheaper than the conventional mechanical multi-line spinning LiDARs, which have become increasingly popular in autonomous driving recently. However, there are several challenges for these new LiDAR sensors, including severe motion distortions, small field of view and sparse point cloud, which hinder them from being widely used in LiDAR odometry. To tackle these problems, we present an effective continuous-time LiDAR odometry (ECTLO) method for the Risley prism-based LiDARs with non-repetitive scanning patterns. To account for the noisy data, a filter-based point-to-plane Gaussian Mixture Model is used for robust registration. Moreover, a LiDAR-only continuous-time motion model is employed to relieve the inevitable distortions. To facilitate the implicit data association in parallel, we maintain all map points within a single range image. Extensive experiments have been conducted on various testbeds using the solid-state LiDARs with different scanning patterns, whose promising results demonstrate the efficacy of our proposed approach.
SafeRL-Kit: Evaluating Efficient Reinforcement Learning Methods for Safe Autonomous Driving
Authors: Linrui Zhang, Qin Zhang, Li Shen, Bo Yuan, Xueqian Wang
Abstract
Safe reinforcement learning (RL) has achieved significant success on risk-sensitive tasks and shown promise in autonomous driving (AD) as well. Considering the distinctiveness of this community, efficient and reproducible baselines are still lacking for safe AD. In this paper, we release SafeRL-Kit to benchmark safe RL methods for AD-oriented tasks. Concretely, SafeRL-Kit contains several latest algorithms specific to zero-constraint-violation tasks, including Safety Layer, Recovery RL, off-policy Lagrangian method, and Feasible Actor-Critic. In addition to existing approaches, we propose a novel first-order method named Exact Penalty Optimization (EPO) and sufficiently demonstrate its capability in safe AD. All algorithms in SafeRL-Kit are implemented (i) under the off-policy setting, which improves sample efficiency and can better leverage past logs; (ii) with a unified learning framework, providing off-the-shelf interfaces for researchers to incorporate their domain-specific knowledge into fundamental safe RL methods. Conclusively, we conduct a comparative evaluation of the above algorithms in SafeRL-Kit and shed light on their efficacy for safe autonomous driving. The source code is available at \href{ https://github.com/zlr20/saferl_kit}{this https URL}.
Improving Diversity of Multiple Trajectory Prediction based on Map-adaptive Lane Loss
Authors: Sanmin Kim, Hyeongseok Jeon, Junwon Choi, Dongsuk Kum
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Prior arts in the field of motion predictions for autonomous driving tend to focus on finding a trajectory that is close to the ground truth trajectory. Such problem formulations and approaches, however, frequently lead to loss of diversity and biased trajectory predictions. Therefore, they are unsuitable for real-world autonomous driving where diverse and road-dependent multimodal trajectory predictions are critical for safety. To this end, this study proposes a novel loss function, \textit{Lane Loss}, that ensures map-adaptive diversity and accommodates geometric constraints. A two-stage trajectory prediction architecture with a novel trajectory candidate proposal module, \textit{Trajectory Prediction Attention (TPA)}, is trained with Lane Loss encourages multiple trajectories to be diversely distributed, covering feasible maneuvers in a map-aware manner. Furthermore, considering that the existing trajectory performance metrics are focusing on evaluating the accuracy based on the ground truth future trajectory, a quantitative evaluation metric is also suggested to evaluate the diversity of predicted multiple trajectories. The experiments performed on the Argoverse dataset show that the proposed method significantly improves the diversity of the predicted trajectories without sacrificing the prediction accuracy.
A Human-Centric Method for Generating Causal Explanations in Natural Language for Autonomous Vehicle Motion Planning
Authors: Balint Gyevnar, Massimiliano Tamborski, Cheng Wang, Christopher G. Lucas, Shay B. Cohen, Stefano V. Albrecht
Abstract
Inscrutable AI systems are difficult to trust, especially if they operate in safety-critical settings like autonomous driving. Therefore, there is a need to build transparent and queryable systems to increase trust levels. We propose a transparent, human-centric explanation generation method for autonomous vehicle motion planning and prediction based on an existing white-box system called IGP2. Our method integrates Bayesian networks with context-free generative rules and can give causal natural language explanations for the high-level driving behaviour of autonomous vehicles. Preliminary testing on simulated scenarios shows that our method captures the causes behind the actions of autonomous vehicles and generates intelligible explanations with varying complexity.
VectorMapNet: End-to-end Vectorized HD Map Learning
Authors: Yicheng Liu, Yue Wang, Yilun Wang, Hang Zhao
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Abstract
Autonomous driving systems require a good understanding of surrounding environments, including moving obstacles and static High-Definition (HD) semantic maps. Existing methods approach the semantic map problem by offline manual annotations, which suffer from serious scalability issues. More recent learning-based methods produce dense rasterized segmentation predictions which do not include instance information of individual map elements and require heuristic post-processing that involves many hand-designed components, to obtain vectorized maps. To that end, we introduce an end-to-end vectorized HD map learning pipeline, termed VectorMapNet. VectorMapNet takes onboard sensor observations and predicts a sparse set of polylines primitives in the bird's-eye view to model the geometry of HD maps. Based on this pipeline, our method can explicitly model the spatial relation between map elements and generate vectorized maps that are friendly for downstream autonomous driving tasks without the need for post-processing. In our experiments, VectorMapNet achieves strong HD map learning performance on nuScenes dataset, surpassing previous state-of-the-art methods by 14.2 mAP. Qualitatively, we also show that VectorMapNet is capable of generating comprehensive maps and capturing more fine-grained details of road geometry. To the best of our knowledge, VectorMapNet is the first work designed toward end-to-end vectorized HD map learning problems.
New submissions for Mon, 20 Jun 22
Keyword: SLAM
An Algorithm for the SE(3)-Transformation on Neural Implicit Maps for Remapping Functions
Efficient WiFi LiDAR SLAM for Autonomous Robots in Large Environments
Keyword: odometry
Effective Solid State LiDAR Odometry Using Continuous-time Filter Registration
Efficient WiFi LiDAR SLAM for Autonomous Robots in Large Environments
Keyword: livox
There is no result
Keyword: loam
There is no result
Keyword: lidar
Effective Solid State LiDAR Odometry Using Continuous-time Filter Registration
Efficient WiFi LiDAR SLAM for Autonomous Robots in Large Environments
Keyword: loop detection
There is no result
Keyword: nerf
There is no result
Keyword: mapping
On the Influence of Enforcing Model Identifiability on Learning dynamics of Gaussian Mixture Models
The Role of Depth, Width, and Activation Complexity in the Number of Linear Regions of Neural Networks
Learning Using Privileged Information for Zero-Shot Action Recognition
Statistical and Neural Methods for Cross-lingual Entity Label Mapping in Knowledge Graphs
An Algorithm for the SE(3)-Transformation on Neural Implicit Maps for Remapping Functions
Efficient WiFi LiDAR SLAM for Autonomous Robots in Large Environments
Keyword: localization
Scalable Temporal Localization of Sensitive Activities in Movies and TV Episodes
CDNet: Contrastive Disentangled Network for Fine-Grained Image Categorization of Ocular B-Scan Ultrasound
Efficient WiFi LiDAR SLAM for Autonomous Robots in Large Environments
DGMIL: Distribution Guided Multiple Instance Learning for Whole Slide Image Classification
SwarmHawk: Self-Sustaining Multi-Agent System for Landing on a Moving Platform through an Agent Supervision
Keyword: transformer
IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes
GAAMA 2.0: An Integrated System that Answers Boolean and Extractive Question
Backdoor Attacks on Vision Transformers
Rectify ViT Shortcut Learning by Visual Saliency
Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection
Bootstrapped Transformer for Offline Reinforcement Learning
Using Transfer Learning for Code-Related Tasks
Local Slot Attention for Vision-and-Language Navigation
Holistic Transformer: A Joint Neural Network for Trajectory Prediction and Decision-Making of Autonomous Vehicles
Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product Retrieval
CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer
SimA: Simple Softmax-free Attention for Vision Transformers
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Keyword: autonomous driving
GMI-DRL: Empowering Multi-GPU Deep Reinforcement Learning with GPU Spatial Multiplexing
Effective Solid State LiDAR Odometry Using Continuous-time Filter Registration
SafeRL-Kit: Evaluating Efficient Reinforcement Learning Methods for Safe Autonomous Driving
Improving Diversity of Multiple Trajectory Prediction based on Map-adaptive Lane Loss
A Human-Centric Method for Generating Causal Explanations in Natural Language for Autonomous Vehicle Motion Planning
VectorMapNet: End-to-end Vectorized HD Map Learning