New submissions for Wed, 25 May 22

Keyword: SLAM

There is no result

Keyword: odometry

LOCUS 2.0: Robust and Computationally Efficient Lidar Odometry for Real-Time Underground 3D Mapping

Authors: Andrzej Reinke, Matteo Palieri, Benjamin Morrell, Yun Chang, Kamak Ebadi, Luca Carlone, Ali-akbar Agha-mohammadi
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2205.11784
Pdf link: https://arxiv.org/pdf/2205.11784
Abstract Lidar odometry has attracted considerable attention as a robust localization method for autonomous robots operating in complex GNSS-denied environments. However, achieving reliable and efficient performance on heterogeneous platforms in large-scale environments remains an open challenge due to the limitations of onboard computation and memory resources needed for autonomous operation. In this work, we present LOCUS 2.0, a robust and computationally-efficient \lidar odometry system for real-time underground 3D mapping. LOCUS 2.0 includes a novel normals-based \morrell{Generalized Iterative Closest Point (GICP)} formulation that reduces the computation time of point cloud alignment, an adaptive voxel grid filter that maintains the desired computation load regardless of the environment's geometry, and a sliding-window map approach that bounds the memory consumption. The proposed approach is shown to be suitable to be deployed on heterogeneous robotic platforms involved in large-scale explorations under severe computation and memory constraints. We demonstrate LOCUS 2.0, a key element of the CoSTAR team's entry in the DARPA Subterranean Challenge, across various underground scenarios. We release LOCUS 2.0 as an open-source library and also release a \lidar-based odometry dataset in challenging and large-scale underground environments. The dataset features legged and wheeled platforms in multiple environments including fog, dust, darkness, and geometrically degenerate surroundings with a total of $11~h$ of operations and $16~km$ of distance traveled.
Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

LOCUS 2.0: Robust and Computationally Efficient Lidar Odometry for Real-Time Underground 3D Mapping
Authors: Andrzej Reinke, Matteo Palieri, Benjamin Morrell, Yun Chang, Kamak Ebadi, Luca Carlone, Ali-akbar Agha-mohammadi
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2205.11784
Pdf link: https://arxiv.org/pdf/2205.11784
Abstract Lidar odometry has attracted considerable attention as a robust localization method for autonomous robots operating in complex GNSS-denied environments. However, achieving reliable and efficient performance on heterogeneous platforms in large-scale environments remains an open challenge due to the limitations of onboard computation and memory resources needed for autonomous operation. In this work, we present LOCUS 2.0, a robust and computationally-efficient \lidar odometry system for real-time underground 3D mapping. LOCUS 2.0 includes a novel normals-based \morrell{Generalized Iterative Closest Point (GICP)} formulation that reduces the computation time of point cloud alignment, an adaptive voxel grid filter that maintains the desired computation load regardless of the environment's geometry, and a sliding-window map approach that bounds the memory consumption. The proposed approach is shown to be suitable to be deployed on heterogeneous robotic platforms involved in large-scale explorations under severe computation and memory constraints. We demonstrate LOCUS 2.0, a key element of the CoSTAR team's entry in the DARPA Subterranean Challenge, across various underground scenarios. We release LOCUS 2.0 as an open-source library and also release a \lidar-based odometry dataset in challenging and large-scale underground environments. The dataset features legged and wheeled platforms in multiple environments including fog, dust, darkness, and geometrically degenerate surroundings with a total of $11~h$ of operations and $16~km$ of distance traveled.
Robust 3D Object Detection in Cold Weather Conditions
Authors: Aldi Piroli, Vinzenz Dallabetta, Marc Walessa, Daniel Meissner, Johannes Kopp, Klaus Dietmayer
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2205.11925
Pdf link: https://arxiv.org/pdf/2205.11925
Abstract Adverse weather conditions can negatively affect LiDAR-based object detectors. In this work, we focus on the phenomenon of vehicle gas exhaust condensation in cold weather conditions. This everyday effect can influence the estimation of object sizes, orientations and introduce ghost object detections, compromising the reliability of the state of the art object detectors. We propose to solve this problem by using data augmentation and a novel training loss term. To effectively train deep neural networks, a large set of labeled data is needed. In case of adverse weather conditions, this process can be extremely laborious and expensive. We address this issue in two steps: First, we present a gas exhaust data generation method based on 3D surface reconstruction and sampling which allows us to generate large sets of gas exhaust clouds from a small pool of labeled data. Second, we introduce a point cloud augmentation process that can be used to add gas exhaust to datasets recorded in good weather conditions. Finally, we formulate a new training loss term that leverages the augmented point cloud to increase object detection robustness by penalizing predictions that include noise. In contrast to other works, our method can be used with both grid-based and point-based detectors. Moreover, since our approach does not require any network architecture changes, inference times remain unchanged. Experimental results on real data show that our proposed method greatly increases robustness to gas exhaust and noisy data.
Keyword: loop detection

There is no result

Keyword: autonomous driving

Towards Model Generalization for Monocular 3D Object Detection
Authors: Zhenyu Li, Zehui Chen, Ang Li, Liangji Fang, Qinhong Jiang, Xianming Liu, Junjun Jiang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2205.11664
Pdf link: https://arxiv.org/pdf/2205.11664
Abstract Monocular 3D object detection (Mono3D) has achieved tremendous improvements with emerging large-scale autonomous driving datasets and the rapid development of deep learning techniques. However, caused by severe domain gaps (e.g., the field of view (FOV), pixel size, and object size among datasets), Mono3D detectors have difficulty in generalization, leading to drastic performance degradation on unseen domains. To solve these issues, we combine the position-invariant transform and multi-scale training with the pixel-size depth strategy to construct an effective unified camera-generalized paradigm (CGP). It fully considers discrepancies in the FOV and pixel size of images captured by different cameras. Moreover, we further investigate the obstacle in quantitative metrics when cross-dataset inference through an exhaustive systematic study. We discern that the size bias of prediction leads to a colossal failure. Hence, we propose the 2D-3D geometry-consistent object scaling strategy (GCOS) to bridge the gap via an instance-level augment. Our method called DGMono3D achieves remarkable performance on all evaluated datasets and surpasses the SoTA unsupervised domain adaptation scheme even without utilizing data on the target domain.
Collaborative 3D Object Detection for Automatic Vehicle Systems via Learnable Communications
Authors: Junyong Wang, Yuan Zeng, Yi Gong
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2205.11849
Pdf link: https://arxiv.org/pdf/2205.11849
Abstract Accurate detection of objects in 3D point clouds is a key problem in autonomous driving systems. Collaborative perception can incorporate information from spatially diverse sensors and provide significant benefits for improving the perception accuracy of autonomous driving systems. In this work, we consider that the autonomous vehicle uses local point cloud data and combines information from neighboring infrastructures through wireless links for cooperative 3D object detection. However, information sharing among vehicle and infrastructures in predefined communication schemes may result in communication congestion and/or bring limited performance improvement. To this end, we propose a novel collaborative 3D object detection framework that consists of three components: feature learning networks that map point clouds into feature maps; an efficient communication block that propagates compact and fine-grained query feature maps from vehicle to support infrastructures and optimizes attention weights between query and key to refine support feature maps; a region proposal network that fuses local feature maps and weighted support feature maps for 3D object detection. We evaluate the performance of the proposed framework using a synthetic cooperative dataset created in two complex driving scenarios: a roundabout and a T-junction. Experiment results and bandwidth usage analysis demonstrate that our approach can save communication and computation costs and significantly improve detection performance under different detection difficulties in all scenarios.
Real-Time Trajectory Planning for Autonomous Driving with Gaussian Process and Incremental Refinement
Authors: Cheng Jie, Chen Yingbing, Zhang Qingwen, Gan Lu, Liu Ming
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2205.11853
Pdf link: https://arxiv.org/pdf/2205.11853
Abstract Real-time kinodynamic trajectory planning in dynamic environments is critical yet challenging for autonomous driving. In this letter, we propose an efficient trajectory planning system for autonomous driving in complex dynamic scenarios through iterative and incremental path-speed optimization. Exploiting the decoupled structure of the planning problem, a path planner based on Gaussian process first generates a continuous arc-length parameterized path in the Fren\'{e}t frame, considering static obstacle avoidance and curvature constraints. We theoretically prove that it is a good generalization of the well-known jerk optimal solution. An efficient s-t graph search method is introduced to find a speed profile along the generated path to deal with dynamic environments. Finally, the path and speed are optimized incrementally and iteratively to ensure kinodynamic feasibility. Various simulated scenarios with both static obstacles and dynamic agents verify the effectiveness and robustness of our proposed method. Experimental results show that our method can run at 20 Hz. The source code is released as an open-source package.
Memory based neural networks for end-to-end autonomous driving
Authors: Sergio Paniego Blanco, Sakshay Mahna, Utkarsh A. Mishra, JoseMaria Canas
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2205.12124
Pdf link: https://arxiv.org/pdf/2205.12124
Abstract Recent works in end-to-end control for autonomous driving have investigated the use of vision-based exteroceptive perception. Inspired by such results, we propose a new end-to-end memory-based neural architecture for robot steering and throttle control. We describe and compare this architecture with previous approaches using fundamental error metrics (MAE, MSE) and several external metrics based on their performance on simulated test circuits. The presented work demonstrates the advantages of using internal memory for better generalization capabilities of the model and allowing it to drive in a broader amount of circuits/situations. We analyze the algorithm in a wide range of environments and conclude that the proposed pipeline is robust to varying camera configurations. All the present work, including datasets, network models architectures, weights, simulator, and comparison software, is open source and easy to replicate and extend.
Learning to Drive Using Sparse Imitation Reinforcement Learning
Authors: Yuci Han, Alper Yilmaz
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2205.12128
Pdf link: https://arxiv.org/pdf/2205.12128
Abstract In this paper, we propose Sparse Imitation Reinforcement Learning (SIRL), a hybrid end-to-end control policy that combines the sparse expert driving knowledge with reinforcement learning (RL) policy for autonomous driving (AD) task in CARLA simulation environment. The sparse expert is designed based on hand-crafted rules which is suboptimal but provides a risk-averse strategy by enforcing experience for critical scenarios such as pedestrian and vehicle avoidance, and traffic light detection. As it has been demonstrated, training a RL agent from scratch is data-inefficient and time consuming particularly for the urban driving task, due to the complexity of situations stemming from the vast size of state space. Our SIRL strategy provides a solution to solve these problems by fusing the output distribution of the sparse expert policy and the RL policy to generate a composite driving policy. With the guidance of the sparse expert during the early training stage, SIRL strategy accelerates the training process and keeps the RL exploration from causing a catastrophe outcome, and ensures safe exploration. To some extent, the SIRL agent is imitating the driving expert's behavior. At the same time, it continuously gains knowledge during training therefore it keeps making improvement beyond the sparse expert, and can surpass both the sparse expert and a traditional RL agent. We experimentally validate the efficacy of proposed SIRL approach in a complex urban scenario within the CARLA simulator. Besides, we compare the SIRL agent's performance for risk-averse exploration and high learning efficiency with the traditional RL approach. We additionally demonstrate the SIRL agent's generalization ability to transfer the driving skill to unseen environment.
Keyword: mapping

VPAIR -- Aerial Visual Place Recognition and Localization in Large-scale Outdoor Environments
Authors: Michael Schleiss, Fahmi Rouatbi, Daniel Cremers
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2205.11567
Pdf link: https://arxiv.org/pdf/2205.11567
Abstract Visual Place Recognition and Visual Localization are essential components in navigation and mapping for autonomous vehicles especially in GNSS-denied navigation scenarios. Recent work has focused on ground or close to ground applications such as self-driving cars or indoor-scenarios and low-altitude drone flights. However, applications such as Urban Air Mobility require operations in large-scale outdoor environments at medium to high altitudes. We present a new dataset named VPAIR. The dataset was recorded on board a light aircraft flying at an altitude of more than 300 meters above ground capturing images with a downwardfacing camera. Each image is paired with a high resolution reference render including dense depth information and 6-DoF reference poses. The dataset covers a more than one hundred kilometers long trajectory over various types of challenging landscapes, e.g. urban, farmland and forests. Experiments on this dataset illustrate the challenges introduced by the change in perspective to a bird's eye view such as in-plane rotations.
Utilizing Language-Image Pretraining for Efficient and Robust Bilingual Word Alignment
Authors: Tuan Dinh, Jy-yong Sohn, Shashank Rajput, Timothy Ossowski, Yifei Ming, Junjie Hu, Dimitris Papailiopoulos, Kangwook Lee
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2205.11616
Pdf link: https://arxiv.org/pdf/2205.11616
Abstract Word translation without parallel corpora has become feasible, rivaling the performance of supervised methods. Recent findings have shown that the accuracy and robustness of unsupervised word translation (UWT) can be improved by making use of visual observations, which are universal representations across languages. In this work, we investigate the potential of using not only visual observations but also pretrained language-image models for enabling a more efficient and robust UWT. Specifically, we develop a novel UWT method dubbed Word Alignment using Language-Image Pretraining (WALIP), which leverages visual observations via the shared embedding space of images and texts provided by CLIP models (Radford et al., 2021). WALIP has a two-step procedure. First, we retrieve word pairs with high confidences of similarity, computed using our proposed image-based fingerprints, which define the initial pivot for the word alignment. Second, we apply our robust Procrustes algorithm to estimate the linear mapping between two embedding spaces, which iteratively corrects and refines the estimated alignment. Our extensive experiments show that WALIP improves upon the state-of-the-art performance of bilingual word alignment for a few language pairs across different word embeddings and displays great robustness to the dissimilarity of language pairs or training corpora for two word embeddings.
Fast GPU bounding boxes on tree-structured scenes
Authors: Raph Levien
Subjects: Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2205.11659
Pdf link: https://arxiv.org/pdf/2205.11659
Abstract Computation of bounding boxes is a fundamental problem in high performance rendering, as it is an input to visibility culling and binning operations. In a scene description structured as a tree, clip nodes and blend nodes entail intersection and union of bounding boxes, respectively. These are straightforward to compute on the CPU using a sequential algorithm, but an efficient, parallel GPU algorithm is more elusive. This paper presents a fast and practical solution, with a new algorithm for the classic parentheses matching problem at its core. The core algorithm is presented abstractly (in terms of a PRAM abstraction), then with a concrete mapping to the thread, workgroup, and dispatch levels of real GPU hardware. The algorithm is implemented portably using compute shaders, and performance results show a dramatic speedup over a sequential CPU version, and indeed a reasonable fraction of maximum theoretical throughput of the GPU hardware. The immediate motivating application is 2D rendering, but the algorithms generalize to other domains, and the core parentheses matching problem has other applications including parsing.
LOCUS 2.0: Robust and Computationally Efficient Lidar Odometry for Real-Time Underground 3D Mapping
Authors: Andrzej Reinke, Matteo Palieri, Benjamin Morrell, Yun Chang, Kamak Ebadi, Luca Carlone, Ali-akbar Agha-mohammadi
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2205.11784
Pdf link: https://arxiv.org/pdf/2205.11784
Abstract Lidar odometry has attracted considerable attention as a robust localization method for autonomous robots operating in complex GNSS-denied environments. However, achieving reliable and efficient performance on heterogeneous platforms in large-scale environments remains an open challenge due to the limitations of onboard computation and memory resources needed for autonomous operation. In this work, we present LOCUS 2.0, a robust and computationally-efficient \lidar odometry system for real-time underground 3D mapping. LOCUS 2.0 includes a novel normals-based \morrell{Generalized Iterative Closest Point (GICP)} formulation that reduces the computation time of point cloud alignment, an adaptive voxel grid filter that maintains the desired computation load regardless of the environment's geometry, and a sliding-window map approach that bounds the memory consumption. The proposed approach is shown to be suitable to be deployed on heterogeneous robotic platforms involved in large-scale explorations under severe computation and memory constraints. We demonstrate LOCUS 2.0, a key element of the CoSTAR team's entry in the DARPA Subterranean Challenge, across various underground scenarios. We release LOCUS 2.0 as an open-source library and also release a \lidar-based odometry dataset in challenging and large-scale underground environments. The dataset features legged and wheeled platforms in multiple environments including fog, dust, darkness, and geometrically degenerate surroundings with a total of $11~h$ of operations and $16~km$ of distance traveled.
NFL: Robust Learned Index via Distribution Transformation
Authors: Shangyu Wu, Yufei Cui, Jinghuan Yu, Xuan Sun, Tei-Wei Kuo, Chun Jason Xue
Subjects: Databases (cs.DB); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2205.11807
Pdf link: https://arxiv.org/pdf/2205.11807
Abstract Recent works on learned index open a new direction for the indexing field. The key insight of the learned index is to approximate the mapping between keys and positions with piece-wise linear functions. Such methods require partitioning key space for a better approximation. Although lots of heuristics are proposed to improve the approximation quality, the bottleneck is that the segmentation overheads could hinder the overall performance. This paper tackles the approximation problem by applying a \textit{distribution transformation} to the keys before constructing the learned index. A two-stage Normalizing-Flow-based Learned index framework (NFL) is proposed, which first transforms the original complex key distribution into a near-uniform distribution, then builds a learned index leveraging the transformed keys. For effective distribution transformation, we propose a Numerical Normalizing Flow (Numerical NF). Based on the characteristics of the transformed keys, we propose a robust After-Flow Learned Index (AFLI). To validate the performance, comprehensive evaluations are conducted on both synthetic and real-world workloads, which shows that the proposed NFL produces the highest throughput and the lowest tail latency compared to the state-of-the-art learned indexes.
Approximation speed of quantized vs. unquantized ReLU neural networks and beyond
Authors: Antoine Gonon (DANTE, ARIC), Nicolas Brisebarre (ARIC), Rémi Gribonval (DANTE), Elisa Riccietti (DANTE)
Subjects: Information Theory (cs.IT); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2205.11874
Pdf link: https://arxiv.org/pdf/2205.11874
Abstract We consider general approximation families encompassing ReLU neural networks. On the one hand, we introduce a new property, that we call $\infty$-encodability, which lays a framework that we use (i) to guarantee that ReLU networks can be uniformly quantized and still have approximation speeds comparable to unquantized ones, and (ii) to prove that ReLU networks share a common limitation with many other approximation families: the approximation speed of a set C is bounded from above by an encoding complexity of C (a complexity well-known for many C's). The property of $\infty$-encodability allows us to unify and generalize known results in which it was implicitly used. On the other hand, we give lower and upper bounds on the Lipschitz constant of the mapping that associates the weights of a network to the function they represent in L^p. It is given in terms of the width, the depth of the network and a bound on the weight's norm, and it is based on well-known upper bounds on the Lipschitz constants of the functions represented by ReLU networks. This allows us to recover known results, to establish new bounds on covering numbers, and to characterize the accuracy of naive uniform quantization of ReLU networks.
On statistic alignment for domain adaptation in structural health monitoring
Authors: Jack Poole, Paul Gardner, Nikolaos Dervilis, Lawrence Bull, Keith Worden
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2205.12052
Pdf link: https://arxiv.org/pdf/2205.12052
Abstract The practical application of structural health monitoring (SHM) is often limited by the availability of labelled data. Transfer learning - specifically in the form of domain adaptation (DA) - gives rise to the possibility of leveraging information from a population of physical or numerical structures, by inferring a mapping that aligns the feature spaces. Typical DA methods rely on nonparametric distance metrics, which require sufficient data to perform density estimation. In addition, these methods can be prone to performance degradation under class imbalance. To address these issues, statistic alignment (SA) is discussed, with a demonstration of how these methods can be made robust to class imbalance, including a special case of class imbalance called a partial DA scenario. SA is demonstrated to facilitate damage localisation with no target labels in a numerical case study, outperforming other state-of-the-art DA methods. It is then shown to be capable of aligning the feature spaces of a real heterogeneous population, the Z24 and KW51 bridges, with only 220 samples used from the KW51 bridge. Finally, in scenarios where more complex mappings are required for knowledge transfer, SA is shown to be a vital pre-processing tool, increasing the performance of established DA methods.
Keyword: localization

VPAIR -- Aerial Visual Place Recognition and Localization in Large-scale Outdoor Environments
Authors: Michael Schleiss, Fahmi Rouatbi, Daniel Cremers
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2205.11567
Pdf link: https://arxiv.org/pdf/2205.11567
Abstract Visual Place Recognition and Visual Localization are essential components in navigation and mapping for autonomous vehicles especially in GNSS-denied navigation scenarios. Recent work has focused on ground or close to ground applications such as self-driving cars or indoor-scenarios and low-altitude drone flights. However, applications such as Urban Air Mobility require operations in large-scale outdoor environments at medium to high altitudes. We present a new dataset named VPAIR. The dataset was recorded on board a light aircraft flying at an altitude of more than 300 meters above ground capturing images with a downwardfacing camera. Each image is paired with a high resolution reference render including dense depth information and 6-DoF reference poses. The dataset covers a more than one hundred kilometers long trajectory over various types of challenging landscapes, e.g. urban, farmland and forests. Experiments on this dataset illustrate the challenges introduced by the change in perspective to a bird's eye view such as in-plane rotations.
TransforMatcher: Match-to-Match Attention for Semantic Correspondence
Authors: Seungwook Kim, Juhong Min, Minsu Cho
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2205.11634
Pdf link: https://arxiv.org/pdf/2205.11634
Abstract Establishing correspondences between images remains a challenging task, especially under large appearance changes due to different viewpoints or intra-class variations. In this work, we introduce a strong semantic image matching learner, dubbed TransforMatcher, which builds on the success of transformer networks in vision domains. Unlike existing convolution- or attention-based schemes for correspondence, TransforMatcher performs global match-to-match attention for precise match localization and dynamic refinement. To handle a large number of matches in a dense correlation map, we develop a light-weight attention architecture to consider the global match-to-match interactions. We also propose to utilize a multi-channel correlation map for refinement, treating the multi-level scores as features instead of a single score to fully exploit the richer layer-wise semantics. In experiments, TransforMatcher sets a new state of the art on SPair-71k while performing on par with existing SOTA methods on the PF-PASCAL dataset.
Ranking-Based Siamese Visual Tracking
Authors: Feng Tang, Qiang Ling
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2205.11761
Pdf link: https://arxiv.org/pdf/2205.11761
Abstract Current Siamese-based trackers mainly formulate the visual tracking into two independent subtasks, including classification and localization. They learn the classification subnetwork by processing each sample separately and neglect the relationship among positive and negative samples. Moreover, such tracking paradigm takes only the classification confidence of proposals for the final prediction, which may yield the misalignment between classification and localization. To resolve these issues, this paper proposes a ranking-based optimization algorithm to explore the relationship among different proposals. To this end, we introduce two ranking losses, including the classification one and the IoU-guided one, as optimization constraints. The classification ranking loss can ensure that positive samples rank higher than hard negative ones, i.e., distractors, so that the trackers can select the foreground samples successfully without being fooled by the distractors. The IoU-guided ranking loss aims to align classification confidence scores with the Intersection over Union(IoU) of the corresponding localization prediction for positive samples, enabling the well-localized prediction to be represented by high classification confidence. Specifically, the proposed two ranking losses are compatible with most Siamese trackers and incur no additional computation for inference. Extensive experiments on seven tracking benchmarks, including OTB100, UAV123, TC128, VOT2016, NFS30, GOT-10k and LaSOT, demonstrate the effectiveness of the proposed ranking-based optimization algorithm. The code and raw results are available at https://github.com/sansanfree/RBO.
LOCUS 2.0: Robust and Computationally Efficient Lidar Odometry for Real-Time Underground 3D Mapping
Authors: Andrzej Reinke, Matteo Palieri, Benjamin Morrell, Yun Chang, Kamak Ebadi, Luca Carlone, Ali-akbar Agha-mohammadi
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2205.11784
Pdf link: https://arxiv.org/pdf/2205.11784
Abstract Lidar odometry has attracted considerable attention as a robust localization method for autonomous robots operating in complex GNSS-denied environments. However, achieving reliable and efficient performance on heterogeneous platforms in large-scale environments remains an open challenge due to the limitations of onboard computation and memory resources needed for autonomous operation. In this work, we present LOCUS 2.0, a robust and computationally-efficient \lidar odometry system for real-time underground 3D mapping. LOCUS 2.0 includes a novel normals-based \morrell{Generalized Iterative Closest Point (GICP)} formulation that reduces the computation time of point cloud alignment, an adaptive voxel grid filter that maintains the desired computation load regardless of the environment's geometry, and a sliding-window map approach that bounds the memory consumption. The proposed approach is shown to be suitable to be deployed on heterogeneous robotic platforms involved in large-scale explorations under severe computation and memory constraints. We demonstrate LOCUS 2.0, a key element of the CoSTAR team's entry in the DARPA Subterranean Challenge, across various underground scenarios. We release LOCUS 2.0 as an open-source library and also release a \lidar-based odometry dataset in challenging and large-scale underground environments. The dataset features legged and wheeled platforms in multiple environments including fog, dust, darkness, and geometrically degenerate surroundings with a total of $11~h$ of operations and $16~km$ of distance traveled.
OnePose: One-Shot Object Pose Estimation without CAD Models
Authors: Jiaming Sun, Zihao Wang, Siyu Zhang, Xingyi He, Hongcheng Zhao, Guofeng Zhang, Xiaowei Zhou
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2205.12257
Pdf link: https://arxiv.org/pdf/2205.12257
Abstract We propose a new method named OnePose for object pose estimation. Unlike existing instance-level or category-level methods, OnePose does not rely on CAD models and can handle objects in arbitrary categories without instance- or category-specific network training. OnePose draws the idea from visual localization and only requires a simple RGB video scan of the object to build a sparse SfM model of the object. Then, this model is registered to new query images with a generic feature matching network. To mitigate the slow runtime of existing visual localization methods, we propose a new graph attention network that directly matches 2D interest points in the query image with the 3D points in the SfM model, resulting in efficient and robust pose estimation. Combined with a feature-based pose tracker, OnePose is able to stably detect and track 6D poses of everyday household objects in real-time. We also collected a large-scale dataset that consists of 450 sequences of 150 objects.
Keyword: transformer

Simple Recurrence Improves Masked Language Models
Authors: Tao Lei, Ran Tian, Jasmijn Bastings, Ankur P. Parikh
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2205.11588
Pdf link: https://arxiv.org/pdf/2205.11588
Abstract In this work, we explore whether modeling recurrence into the Transformer architecture can both be beneficial and efficient, by building an extremely simple recurrent module into the Transformer. We compare our model to baselines following the training and evaluation recipe of BERT. Our results confirm that recurrence can indeed improve Transformer models by a consistent margin, without requiring low-level performance optimizations, and while keeping the number of parameters constant. For example, our base model achieves an absolute improvement of 2.1 points averaged across 10 tasks and also demonstrates increased stability in fine-tuning over a range of learning rates.
Towards Opening the Black Box of Neural Machine Translation: Source and Target Interpretations of the Transformer
Authors: Javier Ferrando, Gerard I. Gállego, Belen Alastruey, Carlos Escolano, Marta R. Costa-jussà
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2205.11631
Pdf link: https://arxiv.org/pdf/2205.11631
Abstract In Neural Machine Translation (NMT), each token prediction is conditioned on the source sentence and the target prefix (what has been previously translated at a decoding step). However, previous work on interpretability in NMT has focused solely on source sentence tokens attributions. Therefore, we lack a full understanding of the influences of every input token (source sentence and target prefix) in the model predictions. In this work, we propose an interpretability method that tracks complete input token attributions. Our method, which can be extended to any encoder-decoder Transformer-based model, allows us to better comprehend the inner workings of current NMT models. We apply the proposed method to both bilingual and multilingual Transformers and present insights into their behaviour.
TransforMatcher: Match-to-Match Attention for Semantic Correspondence
Authors: Seungwook Kim, Juhong Min, Minsu Cho
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2205.11634
Pdf link: https://arxiv.org/pdf/2205.11634
Abstract Establishing correspondences between images remains a challenging task, especially under large appearance changes due to different viewpoints or intra-class variations. In this work, we introduce a strong semantic image matching learner, dubbed TransforMatcher, which builds on the success of transformer networks in vision domains. Unlike existing convolution- or attention-based schemes for correspondence, TransforMatcher performs global match-to-match attention for precise match localization and dynamic refinement. To handle a large number of matches in a dense correlation map, we develop a light-weight attention architecture to consider the global match-to-match interactions. We also propose to utilize a multi-channel correlation map for refinement, treating the multi-level scores as features instead of a single score to fully exploit the richer layer-wise semantics. In experiments, TransforMatcher sets a new state of the art on SPair-71k while performing on par with existing SOTA methods on the PF-PASCAL dataset.
FlexiBERT: Are Current Transformer Architectures too Homogeneous and Rigid?
Authors: Shikhar Tuli, Bhishma Dedhia, Shreshth Tuli, Niraj K. Jha
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2205.11656
Pdf link: https://arxiv.org/pdf/2205.11656
Abstract The existence of a plethora of language models makes the problem of selecting the best one for a custom task challenging. Most state-of-the-art methods leverage transformer-based models (e.g., BERT) or their variants. Training such models and exploring their hyperparameter space, however, is computationally expensive. Prior work proposes several neural architecture search (NAS) methods that employ performance predictors (e.g., surrogate models) to address this issue; however, analysis has been limited to homogeneous models that use fixed dimensionality throughout the network. This leads to sub-optimal architectures. To address this limitation, we propose a suite of heterogeneous and flexible models, namely FlexiBERT, that have varied encoder layers with a diverse set of possible operations and different hidden dimensions. For better-posed surrogate modeling in this expanded design space, we propose a new graph-similarity-based embedding scheme. We also propose a novel NAS policy, called BOSHNAS, that leverages this new scheme, Bayesian modeling, and second-order optimization, to quickly train and use a neural surrogate model to converge to the optimal architecture. A comprehensive set of experiments shows that the proposed policy, when applied to the FlexiBERT design space, pushes the performance frontier upwards compared to traditional models. FlexiBERT-Mini, one of our proposed models, has 3% fewer parameters than BERT-Mini and achieves 8.9% higher GLUE score. A FlexiBERT model with equivalent performance as the best homogeneous model achieves 2.6x smaller size. FlexiBERT-Large, another proposed model, achieves state-of-the-art results, outperforming the baseline models by at least 5.7% on the GLUE benchmark.
Workflow Discovery from Dialogues in the Low Data Regime
Authors: Amine El Hattami, Stefania Raimondo, Issam Laradji, David Vazquez, Pau Rodriguez, Chris Pal
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2205.11690
Pdf link: https://arxiv.org/pdf/2205.11690
Abstract Text-based dialogues are now widely used to solve real-world problems. In cases where solution strategies are already known, they can sometimes be codified into workflows and used to guide humans or artificial agents through the task of helping clients. We are interested in the situation where a formal workflow may not yet exist, but we wish to discover the steps of actions that have been taken to resolve problems. We examine a novel transformer-based approach for this situation and we present experiments where we summarize dialogues in the Action-Based Conversations Dataset (ABCD) with workflows. Since the ABCD dialogues were generated using known workflows to guide agents we can evaluate our ability to extract such workflows using ground truth sequences of action steps, organized as workflows. We propose and evaluate an approach that conditions models on the set of allowable action steps and we show that using this strategy we can improve workflow discovery (WD) performance. Our conditioning approach also improves zero-shot and few-shot WD performance when transferring learned models to entirely new domains (i.e. the MultiWOZ setting). Further, a modified variant of our architecture achieves state-of-the-art performance on the related but different problems of Action State Tracking (AST) and Cascading Dialogue Success (CDS) on the ABCD.
SCVRL: Shuffled Contrastive Video Representation Learning
Authors: Michael Dorkenwald, Fanyi Xiao, Biagio Brattoli, Joseph Tighe, Davide Modolo
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2205.11710
Pdf link: https://arxiv.org/pdf/2205.11710
Abstract We propose SCVRL, a novel contrastive-based framework for self-supervised learning for videos. Differently from previous contrast learning based methods that mostly focus on learning visual semantics (e.g., CVRL), SCVRL is capable of learning both semantic and motion patterns. For that, we reformulate the popular shuffling pretext task within a modern contrastive learning paradigm. We show that our transformer-based network has a natural capacity to learn motion in self-supervised settings and achieves strong performance, outperforming CVRL on four benchmarks.
ItemSage: Learning Product Embeddings for Shopping Recommendations at Pinterest
Authors: Paul Baltescu, Haoyu Chen, Nikil Pancha, Andrew Zhai, Jure Leskovec, Charles Rosenberg
Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2205.11728
Pdf link: https://arxiv.org/pdf/2205.11728
Abstract Learned embeddings for products are an important building block for web-scale e-commerce recommendation systems. At Pinterest, we build a single set of product embeddings called ItemSage to provide relevant recommendations in all shopping use cases including user, image and search based recommendations. This approach has led to significant improvements in engagement and conversion metrics, while reducing both infrastructure and maintenance cost. While most prior work focuses on building product embeddings from features coming from a single modality, we introduce a transformer-based architecture capable of aggregating information from both text and image modalities and show that it significantly outperforms single modality baselines. We also utilize multi-task learning to make ItemSage optimized for several engagement types, leading to a candidate generation system that is efficient for all of the engagement objectives of the end-to-end recommendation system. Extensive offline experiments are conducted to illustrate the effectiveness of our approach and results from online A/B experiments show substantial gains in key business metrics (up to +7% gross merchandise value/user and +11% click volume).
PERT: A New Solution to Pinyin to Character Conversion Task
Authors: Jinghui Xiao, Qun Liu, Xin Jiang, Yuanfeng Xiong, Haiteng Wu, Zhe Zhang
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2205.11737
Pdf link: https://arxiv.org/pdf/2205.11737
Abstract Pinyin to Character conversion (P2C) task is the key task of Input Method Engine (IME) in commercial input software for Asian languages, such as Chinese, Japanese, Thai language and so on. It's usually treated as sequence labelling task and resolved by language model, i.e. n-gram or RNN. However, the low capacity of the n-gram or RNN limits its performance. This paper introduces a new solution named PERT which stands for bidirectional Pinyin Encoder Representations from Transformers. It achieves significant improvement of performance over baselines. Furthermore, we combine PERT with n-gram under a Markov framework, and improve performance further. Lastly, the external lexicon is incorporated into PERT so as to resolve the OOD issue of IME.
BabyBear: Cheap inference triage for expensive language models
Authors: Leila Khalili, Yao You, John Bohannon
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Performance (cs.PF)
Arxiv link: https://arxiv.org/abs/2205.11747
Pdf link: https://arxiv.org/pdf/2205.11747
Abstract Transformer language models provide superior accuracy over previous models but they are computationally and environmentally expensive. Borrowing the concept of model cascading from computer vision, we introduce BabyBear, a framework for cascading models for natural language processing (NLP) tasks to minimize cost. The core strategy is inference triage, exiting early when the least expensive model in the cascade achieves a sufficiently high-confidence prediction. We test BabyBear on several open source data sets related to document classification and entity recognition. We find that for common NLP tasks a high proportion of the inference load can be accomplished with cheap, fast models that have learned by observing a deep learning model. This allows us to reduce the compute cost of large-scale classification jobs by more than 50% while retaining overall accuracy. For named entity recognition, we save 33% of the deep learning compute while maintaining an F1 score higher than 95% on the CoNLL benchmark.
UMSNet: An Universal Multi-sensor Network for Human Activity Recognition
Authors: Jialiang Wang, Haotian Wei, Yi Wang, Shu Yang, Chi Li
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2205.11756
Pdf link: https://arxiv.org/pdf/2205.11756
Abstract Human activity recognition (HAR) based on multimodal sensors has become a rapidly growing branch of biometric recognition and artificial intelligence. However, how to fully mine multimodal time series data and effectively learn accurate behavioral features has always been a hot topic in this field. Practical applications also require a well-generalized framework that can quickly process a variety of raw sensor data and learn better feature representations. This paper proposes a universal multi-sensor network (UMSNet) for human activity recognition. In particular, we propose a new lightweight sensor residual block (called LSR block), which improves the performance by reducing the number of activation function and normalization layers, and adding inverted bottleneck structure and grouping convolution. Then, the Transformer is used to extract the relationship of series features to realize the classification and recognition of human activities. Our framework has a clear structure and can be directly applied to various types of multi-modal Time Series Classification (TSC) tasks after simple specialization. Extensive experiments show that the proposed UMSNet outperforms other state-of-the-art methods on two popular multi-sensor human activity recognition datasets (i.e. HHAR dataset and MHEALTH dataset).
Meta Policy Learning for Cold-Start Conversational Recommendation
Authors: Zhendong Chu, Hongning Wang, Yun Xiao, Bo Long, Lingfei Wu
Subjects: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2205.11788
Pdf link: https://arxiv.org/pdf/2205.11788
Abstract Conversational recommender systems (CRS) explicitly solicit users' preferences for improved recommendations on the fly. Most existing CRS solutions employ reinforcement learning methods to train a single policy for a population of users. However, for users new to the system, such a global policy becomes ineffective to produce conversational recommendations, i.e., the cold-start challenge. In this paper, we study CRS policy learning for cold-start users via meta reinforcement learning. We propose to learn a meta policy and adapt it to new users with only a few trials of conversational recommendations. To facilitate policy adaptation, we design three synergetic components. First is a meta-exploration policy dedicated to identify user preferences via exploratory conversations. Second is a Transformer-based state encoder to model a user's both positive and negative feedback during the conversation. And third is an adaptive item recommender based on the embedded states. Extensive experiments on three datasets demonstrate the advantage of our solution in serving new users, compared with a rich set of state-of-the-art CRS solutions.
Symbolic Expression Transformer: A Computer Vision Approach for Symbolic Regression
Authors: Jiachen Li, Ye Yuan, Hong-Bin Shen
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2205.11798
Pdf link: https://arxiv.org/pdf/2205.11798
Abstract Symbolic Regression (SR) is a type of regression analysis to automatically find the mathematical expression that best fits the data. Currently, SR still basically relies on various searching strategies so that a sample-specific model is required to be optimized for every expression, which significantly limits the model's generalization and efficiency. Inspired by the fact that human beings can infer a mathematical expression based on the curve of it, we propose Symbolic Expression Transformer (SET), a sample-agnostic model from the perspective of computer vision for SR. Specifically, the collected data is represented as images and an image caption model is employed for translating images to symbolic expressions. A large-scale dataset without overlap between training and testing sets in the image domain is released. Our results demonstrate the effectiveness of SET and suggest the promising direction of image-based model for solving the challenging SR problem.
Unsupervised Difference Learning for Noisy Rigid Image Alignment
Authors: Yu-Xuan Chen, Dagan Feng, Hong-Bin Shen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2205.11829
Pdf link: https://arxiv.org/pdf/2205.11829
Abstract Rigid image alignment is a fundamental task in computer vision, while the traditional algorithms are either too sensitive to noise or time-consuming. Recent unsupervised image alignment methods developed based on spatial transformer networks show an improved performance on clean images but will not achieve satisfactory performance on noisy images due to its heavy reliance on pixel value comparations. To handle such challenging applications, we report a new unsupervised difference learning (UDL) strategy and apply it to rigid image alignment. UDL exploits the quantitative properties of regression tasks and converts the original unsupervised problem to pseudo supervised problem. Under the new UDL-based image alignment pipeline, rotation can be accurately estimated on both clean and noisy images and translations can then be easily solved. Experimental results on both nature and cryo-EM images demonstrate the efficacy of our UDL-based unsupervised rigid image alignment method.
Community Question Answering Entity Linking via Leveraging Auxiliary Data
Authors: Yuhan Li, Wei Shen, Jianbo Gao, Yadong Wang
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2205.11917
Pdf link: https://arxiv.org/pdf/2205.11917
Abstract Community Question Answering (CQA) platforms contain plenty of CQA texts (i.e., questions and answers corresponding to the question) where named entities appear ubiquitously. In this paper, we define a new task of CQA entity linking (CQAEL) as linking the textual entity mentions detected from CQA texts with their corresponding entities in a knowledge base. This task can facilitate many downstream applications including expert finding and knowledge base enrichment. Traditional entity linking methods mainly focus on linking entities in news documents, and are suboptimal over this new task of CQAEL since they cannot effectively leverage various informative auxiliary data involved in the CQA platform to aid entity linking, such as parallel answers and two types of meta-data (i.e., topic tags and users). To remedy this crucial issue, we propose a novel transformer-based framework to effectively harness the knowledge delivered by different kinds of auxiliary data to promote the linking performance. We validate the superiority of our framework through extensive experiments over a newly released CQAEL data set against state-of-the-art entity linking methods.
Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition
Authors: Yuting Yang, Binbin Du, Yuke Li
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2205.11998
Pdf link: https://arxiv.org/pdf/2205.11998
Abstract The choice of modeling units affects the performance of the acoustic modeling and plays an important role in automatic speech recognition (ASR). In mandarin scenarios, the Chinese characters represent meaning but are not directly related to the pronunciation. Thus only considering the writing of Chinese characters as modeling units is insufficient to capture speech features. In this paper, we present a novel method involves with multi-level modeling units, which integrates multi-level information for mandarin speech recognition. Specifically, the encoder block considers syllables as modeling units, and the decoder block deals with character modeling units. During inference, the input feature sequences are converted into syllable sequences by the encoder block and then converted into Chinese characters by the decoder block. This process is conducted by a unified end-to-end model without introducing additional conversion models. By introducing InterCE auxiliary task, our method achieves competitive results with CER of 4.1%/4.6% and 4.6%/5.2% on the widely used AISHELL-1 benchmark without a language model, using the Conformer and the Transformer backbones respectively.
Analysing the Greek Parliament Records with Emotion Classification
Authors: John Pavlopoulos, Vanessa Lislevand
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2205.12012
Pdf link: https://arxiv.org/pdf/2205.12012
Abstract In this project, we tackle emotion classification for the Greek language, presenting and releasing a new dataset in Greek. We fine-tune and assess Transformer-based masked language models that were pre-trained on monolingual and multilingual resources, and we present the results per emotion and by aggregating at the sentiment and subjectivity level. The potential of the presented resources is investigated by detecting and studying the emotion of `disgust' in the Greek Parliament records. We: (a) locate the months with the highest values from 1989 to present, (b) rank the Greek political parties based on the presence of this emotion in their speeches, and (c) study the emotional context shift of words used to stigmatise people.
RetroMAE: Pre-training Retrieval-oriented Transformers via Masked Auto-Encoder
Authors: Zheng Liu, Yingxia Shao
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2205.12035
Pdf link: https://arxiv.org/pdf/2205.12035
Abstract Pre-trained models have demonstrated superior power on many important tasks. However, it is still an open problem of designing effective pre-training strategies so as to promote the models' usability on dense retrieval. In this paper, we propose a novel pre-training framework for dense retrieval based on the Masked Auto-Encoder, known as RetroMAE. Our proposed framework is highlighted for the following critical designs: 1) a MAE based pre-training workflow, where the input sentence is polluted on both encoder and decoder side with different masks, and original sentence is reconstructed based on both sentence embedding and masked sentence; 2) asymmetric model architectures, with a large-scale expressive transformer for sentence encoding and a extremely simplified transformer for sentence reconstruction; 3) asymmetric masking ratios, with a moderate masking on the encoder side (15%) and an aggressive masking ratio on the decoder side (50~90%). We pre-train a BERT like encoder on English Wikipedia and BookCorpus, where it notably outperforms the existing pre-trained models on a wide range of dense retrieval benchmarks, like MS MARCO, Open-domain Question Answering, and BEIR.
Privacy-Preserving Image Classification Using Vision Transformer
Authors: Zheng Qi, AprilPyone MaungMaung, Yuma Kinoshita, Hitoshi Kiya
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2205.12041
Pdf link: https://arxiv.org/pdf/2205.12041
Abstract In this paper, we propose a privacy-preserving image classification method that is based on the combined use of encrypted images and the vision transformer (ViT). The proposed method allows us not only to apply images without visual information to ViT models for both training and testing but to also maintain a high classification accuracy. ViT utilizes patch embedding and position embedding for image patches, so this architecture is shown to reduce the influence of block-wise image transformation. In an experiment, the proposed method for privacy-preserving image classification is demonstrated to outperform state-of-the-art methods in terms of classification accuracy and robustness against various attacks.
PoeLM: A Meter- and Rhyme-Controllable Language Model for Unsupervised Poetry Generation
Authors: Aitor Ormazabal, Mikel Artetxe, Manex Agirrezabal, Aitor Soroa, Eneko Agirre
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2205.12206
Pdf link: https://arxiv.org/pdf/2205.12206
Abstract Formal verse poetry imposes strict constraints on the meter and rhyme scheme of poems. Most prior work on generating this type of poetry uses existing poems for supervision, which are difficult to obtain for most languages and poetic forms. In this work, we propose an unsupervised approach to generate poems following any given meter and rhyme scheme, without requiring any poetic text for training. Our method works by splitting a regular, non-poetic corpus into phrases, prepending control codes that describe the length and end rhyme of each phrase, and training a transformer language model in the augmented corpus. During inference, we build control codes for the desired meter and rhyme scheme, and condition our language model on them to generate formal verse poetry. Experiments in Spanish and Basque show that our approach is able to generate valid poems, which are often comparable in quality to those written by humans.
ASSET: Autoregressive Semantic Scene Editing with Transformers at High Resolutions
Authors: Difan Liu, Sandesh Shetty, Tobias Hinz, Matthew Fisher, Richard Zhang, Taesung Park, Evangelos Kalogerakis
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2205.12231
Pdf link: https://arxiv.org/pdf/2205.12231
Abstract We present ASSET, a neural architecture for automatically modifying an input high-resolution image according to a user's edits on its semantic segmentation map. Our architecture is based on a transformer with a novel attention mechanism. Our key idea is to sparsify the transformer's attention matrix at high resolutions, guided by dense attention extracted at lower image resolutions. While previous attention mechanisms are computationally too expensive for handling high-resolution images or are overly constrained within specific image regions hampering long-range interactions, our novel attention mechanism is both computationally efficient and effective. Our sparsified attention mechanism is able to capture long-range interactions and context, leading to synthesizing interesting phenomena in scenes, such as reflections of landscapes onto water or flora consistent with the rest of the landscape, that were not possible to generate reliably with previous convnets and transformer approaches. We present qualitative and quantitative results, along with user studies, demonstrating the effectiveness of our method.
TALM: Tool Augmented Language Models
Authors: Aaron Parisi, Yao Zhao, Noah Fiedel
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2205.12255
Pdf link: https://arxiv.org/pdf/2205.12255
Abstract Transformer based language models (LMs) demonstrate increasing performance with scale across a wide variety of tasks. Scale alone however cannot enable models to solve tasks that require access to ephemeral, changing, or private data that was unavailable at training time. Many useful tasks may also benefit from LMs being able to access APIs that read or modify state. In this work, we present Tool Augmented Language Models (TALM), combining a text-only approach to augment language models with non-differentiable tools, and an iterative "self-play" technique to bootstrap performance starting from few tool demonstrations. TALM exhibits strong performance on both a knowledge-heavy QA task and a reasoning oriented math task with simple tools. At a given model scale, TALM significantly outperforms non-augmented LMs. We further demonstrate that TALM successfully performs out-of-distribution inferences on both QA and math tasks, where non-augmented LMs fail. Our results suggest that Tool Augmented Language Models are a promising direction to enrich LMs' capabilities, with less dependence on scale.
History Compression via Language Models in Reinforcement Learning
Authors: Fabian Paischer, Thomas Adler, Vihang Patil, Angela Bitto-Nemling, Markus Holzleitner, Sebastian Lehner, Hamid Eghbal-zadeh, Sepp Hochreiter
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2205.12258
Pdf link: https://arxiv.org/pdf/2205.12258
Abstract In a partially observable Markov decision process (POMDP), an agent typically uses a representation of the past to approximate the underlying MDP. We propose to utilize a frozen Pretrained Language Transformer (PLT) for history representation and compression to improve sample efficiency. To avoid training of the Transformer, we introduce FrozenHopfield, which automatically associates observations with original token embeddings. To form these associations, a modern Hopfield network stores the original token embeddings, which are retrieved by queries that are obtained by a random but fixed projection of observations. Our new method, HELM, enables actor-critic network architectures that contain a pretrained language Transformer for history representation as a memory module. Since a representation of the past need not be learned, HELM is much more sample efficient than competitors. On Minigrid and Procgen environments HELM achieves new state-of-the-art results. Our code is available at https://github.com/ml-jku/helm.

zhuhu00 / Paper-Daily-Notice

New submissions for Wed, 25 May 22 #169

Keyword: SLAM

Keyword: odometry

LOCUS 2.0: Robust and Computationally Efficient Lidar Odometry for Real-Time Underground 3D Mapping

Keyword: livox

Keyword: loam

Keyword: lidar

LOCUS 2.0: Robust and Computationally Efficient Lidar Odometry for Real-Time Underground 3D Mapping

Robust 3D Object Detection in Cold Weather Conditions

Keyword: loop detection

Keyword: autonomous driving

Towards Model Generalization for Monocular 3D Object Detection

Collaborative 3D Object Detection for Automatic Vehicle Systems via Learnable Communications

Real-Time Trajectory Planning for Autonomous Driving with Gaussian Process and Incremental Refinement

Memory based neural networks for end-to-end autonomous driving

Learning to Drive Using Sparse Imitation Reinforcement Learning

Keyword: mapping

VPAIR -- Aerial Visual Place Recognition and Localization in Large-scale Outdoor Environments

Utilizing Language-Image Pretraining for Efficient and Robust Bilingual Word Alignment

Fast GPU bounding boxes on tree-structured scenes

LOCUS 2.0: Robust and Computationally Efficient Lidar Odometry for Real-Time Underground 3D Mapping

NFL: Robust Learned Index via Distribution Transformation

Approximation speed of quantized vs. unquantized ReLU neural networks and beyond

On statistic alignment for domain adaptation in structural health monitoring

Keyword: localization

VPAIR -- Aerial Visual Place Recognition and Localization in Large-scale Outdoor Environments

TransforMatcher: Match-to-Match Attention for Semantic Correspondence

Ranking-Based Siamese Visual Tracking

LOCUS 2.0: Robust and Computationally Efficient Lidar Odometry for Real-Time Underground 3D Mapping

OnePose: One-Shot Object Pose Estimation without CAD Models

Keyword: transformer

Simple Recurrence Improves Masked Language Models

Towards Opening the Black Box of Neural Machine Translation: Source and Target Interpretations of the Transformer

TransforMatcher: Match-to-Match Attention for Semantic Correspondence

FlexiBERT: Are Current Transformer Architectures too Homogeneous and Rigid?

Workflow Discovery from Dialogues in the Low Data Regime

SCVRL: Shuffled Contrastive Video Representation Learning

ItemSage: Learning Product Embeddings for Shopping Recommendations at Pinterest

PERT: A New Solution to Pinyin to Character Conversion Task

BabyBear: Cheap inference triage for expensive language models

UMSNet: An Universal Multi-sensor Network for Human Activity Recognition

Meta Policy Learning for Cold-Start Conversational Recommendation

Symbolic Expression Transformer: A Computer Vision Approach for Symbolic Regression

Unsupervised Difference Learning for Noisy Rigid Image Alignment

Community Question Answering Entity Linking via Leveraging Auxiliary Data

Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition

Analysing the Greek Parliament Records with Emotion Classification

RetroMAE: Pre-training Retrieval-oriented Transformers via Masked Auto-Encoder

Privacy-Preserving Image Classification Using Vision Transformer

PoeLM: A Meter- and Rhyme-Controllable Language Model for Unsupervised Poetry Generation

ASSET: Autoregressive Semantic Scene Editing with Transformers at High Resolutions

TALM: Tool Augmented Language Models

History Compression via Language Models in Reinforcement Learning