New submissions for Wed, 29 Jun 22

Keyword: SLAM

There is no result

Keyword: odometry

Position-Agnostic Autonomous Navigation in Vineyards with Deep Reinforcement Learning

Authors: Mauro Martini, Simone Cerrato, Francesco Salvetti, Simone Angarano, Marcello Chiaberge
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2206.14155
Pdf link: https://arxiv.org/pdf/2206.14155
Abstract Precision agriculture is rapidly attracting research to efficiently introduce automation and robotics solutions to support agricultural activities. Robotic navigation in vineyards and orchards offers competitive advantages in autonomously monitoring and easily accessing crops for harvesting, spraying and performing time-consuming necessary tasks. Nowadays, autonomous navigation algorithms exploit expensive sensors which also require heavy computational cost for data processing. Nonetheless, vineyard rows represent a challenging outdoor scenario where GPS and Visual Odometry techniques often struggle to provide reliable positioning information. In this work, we combine Edge AI with Deep Reinforcement Learning to propose a cutting-edge lightweight solution to tackle the problem of autonomous vineyard navigation without exploiting precise localization data and overcoming task-tailored algorithms with a flexible learning-based approach. We train an end-to-end sensorimotor agent which directly maps noisy depth images and position-agnostic robot state information to velocity commands and guides the robot to the end of a row, continuously adjusting its heading for a collision-free central trajectory. Our extensive experimentation in realistic simulated vineyards demonstrates the effectiveness of our solution and the generalization capabilities of our agent.
Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

Accurate and Real-time Pseudo Lidar Detection: Is Stereo Neural Network Really Necessary?
Authors: Haitao Meng, Changcai Li, Gang Chen, Alois Knoll
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.13858
Pdf link: https://arxiv.org/pdf/2206.13858
Abstract The proposal of Pseudo-Lidar representation has significantly narrowed the gap between visual-based and active Lidar-based 3D object detection. However, current researches exclusively focus on pushing the accuracy improvement of Pseudo-Lidar by taking the advantage of complex and time-consuming neural networks. Seldom explore the profound characteristics of Pseudo-Lidar representation to obtain the promoting opportunities. In this paper, we dive deep into the pseudo Lidar representation and argue that the performance of 3D object detection is not fully dependent on the high precision stereo depth estimation. We demonstrate that even for the unreliable depth estimation, with proper data processing and refining, it can achieve comparable 3D object detection accuracy. With this finding, we further show the possibility that utilizing fast but inaccurate stereo matching algorithms in the Pseudo-Lidar system to achieve low latency responsiveness. In the experiments, we develop a system with a less powerful stereo matching predictor and adopt the proposed refinement schemes to improve the accuracy. The evaluation on the KITTI benchmark shows that the presented system achieves competitive accuracy to the state-of-the-art approaches with only 23 ms computing, showing it is a suitable candidate for deploying to real car-hold applications.
Keyword: loop detection

There is no result

Keyword: nerf

There is no result

Keyword: mapping

BeamsNet: A data-driven Approach Enhancing Doppler Velocity Log Measurements for Autonomous Underwater Vehicle Navigation
Authors: Nadav Cohen, Itzik Klein
Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2206.13603
Pdf link: https://arxiv.org/pdf/2206.13603
Abstract Autonomous underwater vehicles (AUV) perform various applications such as seafloor mapping and underwater structure health monitoring. Commonly, an inertial navigation system aided by a Doppler velocity log (DVL) is used to provide the vehicle's navigation solution. In such fusion, the DVL provides the velocity vector of the AUV, which determines the navigation solution's accuracy and helps estimate the navigation states. This paper proposes BeamsNet, an end-to-end deep learning framework to regress the estimated DVL velocity vector that improves the accuracy of the velocity vector estimate, and could replace the model-based approach. Two versions of BeamsNet, differing in their input to the network, are suggested. The first uses the current DVL beam measurements and inertial sensors data, while the other utilizes only DVL data, taking the current and past DVL measurements for the regression process. Both simulation and sea experiments were made to validate the proposed learning approach relative to the model-based approach. Sea experiments were made with the Snapir AUV in the Mediterranean Sea, collecting approximately four hours of DVL and inertial sensor data. Our results show that the proposed approach achieved an improvement of more than 60% in estimating the DVL velocity vector.
Towards Global-Scale Crowd+AI Techniques to Map and Assess Sidewalks for People with Disabilities
Authors: Maryam Hosseini, Mikey Saugstad, Fabio Miranda, Andres Sevtsuk, Claudio T. Silva, Jon E. Froehlich
Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2206.13677
Pdf link: https://arxiv.org/pdf/2206.13677
Abstract There is a lack of data on the location, condition, and accessibility of sidewalks across the world, which not only impacts where and how people travel but also fundamentally limits interactive mapping tools and urban analytics. In this paper, we describe initial work in semi-automatically building a sidewalk network topology from satellite imagery using hierarchical multi-scale attention models, inferring surface materials from street-level images using active learning-based semantic segmentation, and assessing sidewalk condition and accessibility features using Crowd+AI. We close with a call to create a database of labeled satellite and streetscape scenes for sidewalks and sidewalk accessibility issues along with standardized benchmarks.
3D Multi-Object Tracking with Differentiable Pose Estimation
Authors: Dominik Schmauser, Zeju Qiu, Norman Müller, Matthias Nießner
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.13785
Pdf link: https://arxiv.org/pdf/2206.13785
Abstract We propose a novel approach for joint 3D multi-object tracking and reconstruction from RGB-D sequences in indoor environments. To this end, we detect and reconstruct objects in each frame while predicting dense correspondences mappings into a normalized object space. We leverage those correspondences to inform a graph neural network to solve for the optimal, temporally-consistent 7-DoF pose trajectories of all objects. The novelty of our method is two-fold: first, we propose a new graph-based approach for differentiable pose estimation over time to learn optimal pose trajectories; second, we present a joint formulation of reconstruction and pose estimation along the time axis for robust and geometrically consistent multi-object tracking. In order to validate our approach, we introduce a new synthetic dataset comprising 2381 unique indoor sequences with a total of 60k rendered RGB-D images for multi-object tracking with moving objects and camera positions derived from the synthetic 3D-FRONT dataset. We demonstrate that our method improves the accumulated MOTA score for all test sequences by 24.8% over existing state-of-the-art methods. In several ablations on synthetic and real-world sequences, we show that our graph-based, fully end-to-end-learnable approach yields a significant boost in tracking performance.
Physical Layer Abstraction Model for RadioWeaves
Authors: Rimalapudi Sarvendranath, Unnikrishnan Kunnath Ganesan, Zakir Hussain Shaik, Erik G. Larsson
Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2206.13924
Pdf link: https://arxiv.org/pdf/2206.13924
Abstract RadioWeaves, in which distributed antennas with integrated radio and compute resources serve a large number of users, is envisioned to provide high data rates in next generation wireless systems. In this paper, we develop a physical layer abstraction model to evaluate the performance of different RadioWeaves deployment scenarios. This model helps speed up system-level simulators of the RadioWeaves and is made up of two blocks. The first block generates a vector of signal-to-interference-plus-noise ratios (SINRs) corresponding to each coherence block, and the second block predicts the packet error rate corresponding to the SINRs generated. The vector of SINRs generated depends on different parameters such as the number of users, user locations, antenna configurations, and precoders. We have also considered different antenna gain patterns, such as omni-directional and directional microstrip patch antennas. Our model exploits the benefits of exponential effective SINR mapping (EESM). We study the robustness and accuracy of the EESM for RadioWeaves.
Primitive Graph Learning for Unified Vector Mapping
Authors: Lei Wang, Min Dai, Jianan He, Jingwei Huang, Mingwei Sun
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.13963
Pdf link: https://arxiv.org/pdf/2206.13963
Abstract Large-scale vector mapping is important for transportation, city planning, and survey and census. We propose GraphMapper, a unified framework for end-to-end vector map extraction from satellite images. Our key idea is a novel unified representation of shapes of different topologies named "primitive graph", which is a set of shape primitives and their pairwise relationship matrix. Then, we convert vector shape prediction, regularization, and topology reconstruction into a unique primitive graph learning problem. Specifically, GraphMapper is a generic primitive graph learning network based on global shape context modelling through multi-head-attention. An embedding space sorting method is developed for accurate primitive relationship modelling. We empirically demonstrate the effectiveness of GraphMapper on two challenging mapping tasks, building footprint regularization and road network topology reconstruction. Our model outperforms state-of-the-art methods by 8-10% in both tasks on public benchmarks. All code will be publicly available.
Show Me Your Face, And I'll Tell You How You Speak
Authors: Christen Millerdurai, Lotfy Abdel Khaliq, Timon Ulrich
Subjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2206.14009
Pdf link: https://arxiv.org/pdf/2206.14009
Abstract When we speak, the prosody and content of the speech can be inferred from the movement of our lips. In this work, we explore the task of lip to speech synthesis, i.e., learning to generate speech given only the lip movements of a speaker where we focus on learning accurate lip to speech mappings for multiple speakers in unconstrained, large vocabulary settings. We capture the speaker's voice identity through their facial characteristics, i.e., age, gender, ethnicity and condition them along with the lip movements to generate speaker identity aware speech. To this end, we present a novel method "Lip2Speech", with key design choices to achieve accurate lip to speech synthesis in unconstrained scenarios. We also perform various experiments and extensive evaluation using quantitative, qualitative metrics and human evaluation.
Taxonomy and evolution predicting using deep learning in images
Authors: Jiewen Xiao, Wenbin Liao, Ming Zhang, Jing Wang, Jianxin Wang, Yihua Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.14011
Pdf link: https://arxiv.org/pdf/2206.14011
Abstract Molecular and morphological characters, as important parts of biological taxonomy, are contradictory but need to be integrated. Organism's image recognition and bioinformatics are emerging and hot problems nowadays but with a gap between them. In this work, a multi-branching recognition framework mediated by genetic information bridges this barrier, which establishes the link between macro-morphology and micro-molecular information of mushrooms. The novel multi-perspective structure is proposed to fuse the feature images from three branching models, which significantly improves the accuracy of recognition by about 10% and up to more than 90%. Further, genetic information is implemented to the mushroom image recognition task by using genetic distance embeddings as the representation space for predicting image distance and species identification. Semantic overfitting of traditional classification tasks and the granularity of fine-grained image recognition are also discussed in depth for the first time. The generalizability of the model was investigated in fine-grained scenarios using zero-shot learning tasks, which could predict the taxonomic and evolutionary information of unseen samples. We presented the first method to map images to DNA, namely used an encoder mapping image to genetic distances, and then decoded DNA through a pre-trained decoder, where the total test accuracy on 37 species for DNA prediction is 87.45%. This study creates a novel recognition framework by systematically studying the mushroom image recognition problem, bridging the gap between macroscopic biological information and microscopic molecular information, which will provide a new reference for intelligent biometrics in the future.
Keyword: localization

How Many Events do You Need? Event-based Visual Place Recognition Using Sparse But Varying Pixels
Authors: Tobias Fischer, Michael Milford
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2206.13673
Pdf link: https://arxiv.org/pdf/2206.13673
Abstract Event cameras continue to attract interest due to desirable characteristics such as high dynamic range, low latency, virtually no motion blur, and high energy efficiency. One of the potential applications of event camera research lies in visual place recognition for robot localization, where a query observation has to be matched to the corresponding reference place in the database. In this letter, we explore the distinctiveness of event streams from a small subset of pixels (in the tens or hundreds). We demonstrate that the absolute difference in the number of events at those pixel locations accumulated into event frames can be sufficient for the place recognition task, when pixels that display large variations in the reference set are used. Using such sparse (over image coordinates) but varying (variance over the number of events per pixel location) pixels enables frequent and computationally cheap updates of the location estimates. Furthermore, when event frames contain a constant number of events, our method takes full advantage of the event-driven nature of the sensory stream and displays promising robustness to changes in velocity. We evaluate our proposed approach on the Brisbane-Event-VPR dataset in an outdoor driving scenario, as well as the newly contributed indoor QCR-Event-VPR dataset that was captured with a DAVIS346 camera mounted on a mobile robotic platform. Our results show that our approach achieves competitive performance when compared to several baseline methods on those datasets, and is particularly well suited for compute- and energy-constrained platforms such as interplanetary rovers.
Improving Worst Case Visual Localization Coverage via Place-specific Sub-selection in Multi-camera Systems
Authors: Stephen Hausler, Ming Xu, Sourav Garg, Punarjay Chakravarty, Shubham Shrivastava, Ankit Vora, Michael Milford
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.13883
Pdf link: https://arxiv.org/pdf/2206.13883
Abstract 6-DoF visual localization systems utilize principled approaches rooted in 3D geometry to perform accurate camera pose estimation of images to a map. Current techniques use hierarchical pipelines and learned 2D feature extractors to improve scalability and increase performance. However, despite gains in typical recall@0.25m type metrics, these systems still have limited utility for real-world applications like autonomous vehicles because of their worst' areas of performance - the locations where they provide insufficient recall at a certain required error tolerance. Here we investigate the utility of usingplace specific configurations', where a map is segmented into a number of places, each with its own configuration for modulating the pose estimation step, in this case selecting a camera within a multi-camera system. On the Ford AV benchmark dataset, we demonstrate substantially improved worst-case localization performance compared to using off-the-shelf pipelines - minimizing the percentage of the dataset which has low recall at a certain error tolerance, as well as improved overall localization performance. Our proposed approach is particularly applicable to the crowdsharing model of autonomous vehicle deployment, where a fleet of AVs are regularly traversing a known route.
Position-Agnostic Autonomous Navigation in Vineyards with Deep Reinforcement Learning
Authors: Mauro Martini, Simone Cerrato, Francesco Salvetti, Simone Angarano, Marcello Chiaberge
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2206.14155
Pdf link: https://arxiv.org/pdf/2206.14155
Abstract Precision agriculture is rapidly attracting research to efficiently introduce automation and robotics solutions to support agricultural activities. Robotic navigation in vineyards and orchards offers competitive advantages in autonomously monitoring and easily accessing crops for harvesting, spraying and performing time-consuming necessary tasks. Nowadays, autonomous navigation algorithms exploit expensive sensors which also require heavy computational cost for data processing. Nonetheless, vineyard rows represent a challenging outdoor scenario where GPS and Visual Odometry techniques often struggle to provide reliable positioning information. In this work, we combine Edge AI with Deep Reinforcement Learning to propose a cutting-edge lightweight solution to tackle the problem of autonomous vineyard navigation without exploiting precise localization data and overcoming task-tailored algorithms with a flexible learning-based approach. We train an end-to-end sensorimotor agent which directly maps noisy depth images and position-agnostic robot state information to velocity commands and guides the robot to the end of a row, continuously adjusting its heading for a collision-free central trajectory. Our extensive experimentation in realistic simulated vineyards demonstrates the effectiveness of our solution and the generalization capabilities of our agent.
Keyword: transformer

DeepPERF: A Deep Learning-Based Approach For Improving Software Performance
Authors: Spandan Garg, Roshanak Zilouchian Moghaddam, Colin B. Clement, Neel Sundaresan, Chen Wu
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Performance (cs.PF)
Arxiv link: https://arxiv.org/abs/2206.13619
Pdf link: https://arxiv.org/pdf/2206.13619
Abstract Improving software performance is an important yet challenging part of the software development cycle. Today, the majority of performance inefficiencies are identified and patched by performance experts. Recent advancements in deep learning approaches and the wide-spread availability of open source data creates a great opportunity to automate the identification and patching of performance problems. In this paper, we present DeepPERF, a transformer-based approach to suggest performance improvements for C# applications. We pretrain DeepPERF on English and Source code corpora and followed by finetuning for the task of generating performance improvement patches for C# applications. Our evaluation shows that our model can generate the same performance improvement suggestion as the developer fix in ~53% of the cases, getting ~34% of them verbatim in our expert-verified dataset of performance changes made by C# developers. Additionally, we evaluate DeepPERF on 50 open source C# repositories on GitHub using both benchmark and unit tests and find that our model is able to suggest valid performance improvements that can improve both CPU usage and Memory allocations. So far we've submitted 19 pull-requests with 28 different performance optimizations and 11 of these PRs have been approved by the project owners.
TTS-CGAN: A Transformer Time-Series Conditional GAN for Biosignal Data Augmentation
Authors: Xiaomin Li, Anne Hee Hiong Ngu, Vangelis Metsis
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2206.13676
Pdf link: https://arxiv.org/pdf/2206.13676
Abstract Signal measurement appearing in the form of time series is one of the most common types of data used in medical machine learning applications. Such datasets are often small in size, expensive to collect and annotate, and might involve privacy issues, which hinders our ability to train large, state-of-the-art deep learning models for biomedical applications. For time-series data, the suite of data augmentation strategies we can use to expand the size of the dataset is limited by the need to maintain the basic properties of the signal. Generative Adversarial Networks (GANs) can be utilized as another data augmentation tool. In this paper, we present TTS-CGAN, a transformer-based conditional GAN model that can be trained on existing multi-class datasets and generate class-specific synthetic time-series sequences of arbitrary length. We elaborate on the model architecture and design strategies. Synthetic sequences generated by our model are indistinguishable from real ones, and can be used to complement or replace real signals of the same type, thus achieving the goal of data augmentation. To evaluate the quality of the generated data, we modify the wavelet coherence metric to be able to compare the similarity between two sets of signals, and also conduct a case study where a mix of synthetic and real data are used to train a deep learning model for sequence classification. Together with other visualization techniques and qualitative evaluation approaches, we demonstrate that TTS-CGAN generated synthetic data are similar to real data, and that our model performs better than the other state-of-the-art GAN models built for time-series data generation.
Tiny-Sepformer: A Tiny Time-Domain Transformer Network for Speech Separation
Authors: Jian Luo, Jianzong Wang, Ning Cheng, Edward Xiao, Xulong Zhang, Jing Xiao
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2206.13689
Pdf link: https://arxiv.org/pdf/2206.13689
Abstract Time-domain Transformer neural networks have proven their superiority in speech separation tasks. However, these models usually have a large number of network parameters, thus often encountering the problem of GPU memory explosion. In this paper, we proposed Tiny-Sepformer, a tiny version of Transformer network for speech separation. We present two techniques to reduce the model parameters and memory consumption: (1) Convolution-Attention (CA) block, spliting the vanilla Transformer to two paths, multi-head attention and 1D depthwise separable convolution, (2) parameter sharing, sharing the layer parameters within the CA block. In our experiments, Tiny-Sepformer could greatly reduce the model size, and achieves comparable separation performance with vanilla Sepformer on WSJ0-2/3Mix datasets.
Cross-Forgery Analysis of Vision Transformers and CNNs for Deepfake Image Detection
Authors: Davide Alessandro Coccomini, Roberto Caldelli, Fabrizio Falchi, Claudio Gennaro, Giuseppe Amato
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.13829
Pdf link: https://arxiv.org/pdf/2206.13829
Abstract Deepfake Generation Techniques are evolving at a rapid pace, making it possible to create realistic manipulated images and videos and endangering the serenity of modern society. The continual emergence of new and varied techniques brings with it a further problem to be faced, namely the ability of deepfake detection models to update themselves promptly in order to be able to identify manipulations carried out using even the most recent methods. This is an extremely complex problem to solve, as training a model requires large amounts of data, which are difficult to obtain if the deepfake generation method is too recent. Moreover, continuously retraining a network would be unfeasible. In this paper, we ask ourselves if, among the various deep learning techniques, there is one that is able to generalise the concept of deepfake to such an extent that it does not remain tied to one or more specific deepfake generation methods used in the training set. We compared a Vision Transformer with an EfficientNetV2 on a cross-forgery context based on the ForgeryNet dataset. From our experiments, It emerges that EfficientNetV2 has a greater tendency to specialize often obtaining better results on training methods while Vision Transformers exhibit a superior generalization ability that makes them more competent even on images generated with new methodologies.
Accurate and fast identification of minimally prepared bacteria phenotypes using Raman spectroscopy assisted by machine learning
Authors: Benjamin Lundquist Thomsen, Jesper B. Christensen, Olga Rodenko, Iskander Usenov, Rasmus Birkholm Grønnemose, Thomas Emil Andersen, Mikael Lassen
Subjects: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Optics (physics.optics)
Arxiv link: https://arxiv.org/abs/2206.13933
Pdf link: https://arxiv.org/pdf/2206.13933
Abstract The worldwide increase of antimicrobial resistance (AMR) is a serious threat to human health. To avert the spread of AMR, fast reliable diagnostics tools that facilitate optimal antibiotic stewardship are an unmet need. In this regard, Raman spectroscopy promises rapid label- and culture-free identification and antimicrobial susceptibility testing (AST) in a single step. However, even though many Raman-based bacteria-identification and AST studies have demonstrated impressive results, some shortcomings must be addressed. To bridge the gap between proof-of-concept studies and clinical application, we have developed machine learning techniques in combination with a novel data-augmentation algorithm, for fast identification of minimally prepared bacteria phenotypes and the distinctions of methicillin-resistant (MR) from methicillin-susceptible (MS) bacteria. For this we have implemented a spectral transformer model for hyper-spectral Raman images of bacteria. We show that our model outperforms the standard convolutional neural network models on a multitude of classification problems, both in terms of accuracy and in terms of training time. We attain more than 96$\%$ classification accuracy on a dataset consisting of 15 different classes and 95.6$\%$ classification accuracy for six MR-MS bacteria species. More importantly, our results are obtained using only fast and easy-to-produce training and test data
Long Range Language Modeling via Gated State Spaces
Authors: Harsh Mehta, Ankit Gupta, Ashok Cutkosky, Behnam Neyshabur
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2206.13947
Pdf link: https://arxiv.org/pdf/2206.13947
Abstract State space models have shown to be effective at modeling long range dependencies, specially on sequence classification tasks. In this work we focus on autoregressive sequence modeling over English books, Github source code and ArXiv mathematics articles. Based on recent developments around the effectiveness of gated activation functions, we propose a new layer named Gated State Space (GSS) and show that it trains significantly faster than the diagonal version of S4 (i.e. DSS) on TPUs, is fairly competitive with several well-tuned Transformer-based baselines and exhibits zero-shot generalization to longer inputs while being straightforward to implement. Finally, we show that leveraging self-attention to model local dependencies improves the performance of GSS even further.
Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment
Authors: Takeshi Kojima, Yutaka Matsuo, Yusuke Iwasawa
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2206.13951
Pdf link: https://arxiv.org/pdf/2206.13951
Abstract Vision Transformer (ViT) is becoming more popular in image processing. Specifically, we investigate the effectiveness of test-time adaptation (TTA) on ViT, a technique that has emerged to correct its prediction during test-time by itself. First, we benchmark various test-time adaptation approaches on ViT-B16 and ViT-L16. It is shown that the TTA is effective on ViT and the prior-convention (sensibly selecting modulation parameters) is not necessary when using proper loss function. Based on the observation, we propose a new test-time adaptation method called class-conditional feature alignment (CFA), which minimizes both the class-conditional distribution differences and the whole distribution differences of the hidden representation between the source and target in an online manner. Experiments of image classification tasks on common corruption (CIFAR-10-C, CIFAR-100-C, and ImageNet-C) and domain adaptation (digits datasets and ImageNet-Sketch) show that CFA stably outperforms the existing baselines on various datasets. We also verify that CFA is model agnostic by experimenting on ResNet, MLP-Mixer, and several ViT variants (ViT-AugReg, DeiT, and BeiT). Using BeiT backbone, CFA achieves 19.8% top-1 error rate on ImageNet-C, outperforming the existing test-time adaptation baseline 44.0%. This is a state-of-the-art result among TTA methods that do not need to alter training phase.
Continual Learning with Transformers for Image Classification
Authors: Beyza Ermis, Giovanni Zappella, Martin Wistuba, Aditya Rawal, Cedric Archambeau
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.14085
Pdf link: https://arxiv.org/pdf/2206.14085
Abstract In many real-world scenarios, data to train machine learning models become available over time. However, neural network models struggle to continually learn new concepts without forgetting what has been learnt in the past. This phenomenon is known as catastrophic forgetting and it is often difficult to prevent due to practical constraints, such as the amount of data that can be stored or the limited computation sources that can be used. Moreover, training large neural networks, such as Transformers, from scratch is very costly and requires a vast amount of training data, which might not be available in the application domain of interest. A recent trend indicates that dynamic architectures based on an expansion of the parameters can reduce catastrophic forgetting efficiently in continual learning, but this needs complex tuning to balance the growing number of parameters and barely share any information across tasks. As a result, they struggle to scale to a large number of tasks without significant overhead. In this paper, we validate in the computer vision domain a recent solution called Adaptive Distillation of Adapters (ADA), which is developed to perform continual learning using pre-trained Transformers and Adapters on text classification tasks. We empirically demonstrate on different classification tasks that this method maintains a good predictive performance without retraining the model or increasing the number of model parameters over the time. Besides it is significantly faster at inference time compared to the state-of-the-art methods.
SSL-Lanes: Self-Supervised Learning for Motion Forecasting in Autonomous Driving
Authors: Prarthana Bhattacharyya, Chengjie Huang, Krzysztof Czarnecki
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2206.14116
Pdf link: https://arxiv.org/pdf/2206.14116
Abstract Self-supervised learning (SSL) is an emerging technique that has been successfully employed to train convolutional neural networks (CNNs) and graph neural networks (GNNs) for more transferable, generalizable, and robust representation learning. However its potential in motion forecasting for autonomous driving has rarely been explored. In this study, we report the first systematic exploration and assessment of incorporating self-supervision into motion forecasting. We first propose to investigate four novel self-supervised learning tasks for motion forecasting with theoretical rationale and quantitative and qualitative comparisons on the challenging large-scale Argoverse dataset. Secondly, we point out that our auxiliary SSL-based learning setup not only outperforms forecasting methods which use transformers, complicated fusion mechanisms and sophisticated online dense goal candidate optimization algorithms in terms of performance accuracy, but also has low inference time and architectural complexity. Lastly, we conduct several experiments to understand why SSL improves motion forecasting. Code is open-sourced at \url{https://github.com/AutoVision-cloud/SSL-Lanes}.
Keyword: autonomous driving

SSL-Lanes: Self-Supervised Learning for Motion Forecasting in Autonomous Driving
Authors: Prarthana Bhattacharyya, Chengjie Huang, Krzysztof Czarnecki
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2206.14116
Pdf link: https://arxiv.org/pdf/2206.14116
Abstract Self-supervised learning (SSL) is an emerging technique that has been successfully employed to train convolutional neural networks (CNNs) and graph neural networks (GNNs) for more transferable, generalizable, and robust representation learning. However its potential in motion forecasting for autonomous driving has rarely been explored. In this study, we report the first systematic exploration and assessment of incorporating self-supervision into motion forecasting. We first propose to investigate four novel self-supervised learning tasks for motion forecasting with theoretical rationale and quantitative and qualitative comparisons on the challenging large-scale Argoverse dataset. Secondly, we point out that our auxiliary SSL-based learning setup not only outperforms forecasting methods which use transformers, complicated fusion mechanisms and sophisticated online dense goal candidate optimization algorithms in terms of performance accuracy, but also has low inference time and architectural complexity. Lastly, we conduct several experiments to understand why SSL improves motion forecasting. Code is open-sourced at \url{https://github.com/AutoVision-cloud/SSL-Lanes}.
Verifiable Goal Recognition for Autonomous Driving with Occlusions
Authors: Cillian Brewitt, Massimiliano Tamborski, Stefano V. Albrecht
Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2206.14163
Pdf link: https://arxiv.org/pdf/2206.14163
Abstract When used in autonomous driving, goal recognition allows the future behaviour of other vehicles to be more accurately predicted. A recent goal recognition method for autonomous vehicles, GRIT, has been shown to be fast, accurate, interpretable and verifiable. In autonomous driving, vehicles can encounter novel scenarios that were unseen during training, and the environment is partially observable due to occlusions. However, GRIT can only operate in fixed frame scenarios, with full observability. We present a novel goal recognition method named Goal Recognition with Interpretable Trees under Occlusion (OGRIT), which solves these shortcomings of GRIT. We demonstrate that OGRIT can generalise between different scenarios and handle missing data due to occlusions, while still being fast, accurate, interpretable and verifiable.
Pedestrian 3D Bounding Box Prediction
Authors: Saeed Saadatnejad, Yi Zhou Ju, Alexandre Alahi
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2206.14195
Pdf link: https://arxiv.org/pdf/2206.14195
Abstract Safety is still the main issue of autonomous driving, and in order to be globally deployed, they need to predict pedestrians' motions sufficiently in advance. While there is a lot of research on coarse-grained (human center prediction) and fine-grained predictions (human body keypoints prediction), we focus on 3D bounding boxes, which are reasonable estimates of humans without modeling complex motion details for autonomous vehicles. This gives the flexibility to predict in longer horizons in real-world settings. We suggest this new problem and present a simple yet effective model for pedestrians' 3D bounding box prediction. This method follows an encoder-decoder architecture based on recurrent neural networks, and our experiments show its effectiveness in both the synthetic (JTA) and real-world (NuScenes) datasets. The learned representation has useful information to enhance the performance of other tasks, such as action anticipation. Our code is available online: https://github.com/vita-epfl/bounding-box-prediction

zhuhu00 / Paper-Daily-Notice

New submissions for Wed, 29 Jun 22 #191

New submissions for Wed, 29 Jun 22

Keyword: SLAM

Keyword: odometry

Position-Agnostic Autonomous Navigation in Vineyards with Deep Reinforcement Learning

Keyword: livox

Keyword: loam

Keyword: lidar

Accurate and Real-time Pseudo Lidar Detection: Is Stereo Neural Network Really Necessary?

Keyword: loop detection

Keyword: nerf

Keyword: mapping

BeamsNet: A data-driven Approach Enhancing Doppler Velocity Log Measurements for Autonomous Underwater Vehicle Navigation

Towards Global-Scale Crowd+AI Techniques to Map and Assess Sidewalks for People with Disabilities

3D Multi-Object Tracking with Differentiable Pose Estimation

Physical Layer Abstraction Model for RadioWeaves

Primitive Graph Learning for Unified Vector Mapping

Show Me Your Face, And I'll Tell You How You Speak

Taxonomy and evolution predicting using deep learning in images

Keyword: localization

How Many Events do You Need? Event-based Visual Place Recognition Using Sparse But Varying Pixels

Improving Worst Case Visual Localization Coverage via Place-specific Sub-selection in Multi-camera Systems

Position-Agnostic Autonomous Navigation in Vineyards with Deep Reinforcement Learning

Keyword: transformer

DeepPERF: A Deep Learning-Based Approach For Improving Software Performance

TTS-CGAN: A Transformer Time-Series Conditional GAN for Biosignal Data Augmentation

Tiny-Sepformer: A Tiny Time-Domain Transformer Network for Speech Separation

Cross-Forgery Analysis of Vision Transformers and CNNs for Deepfake Image Detection

Accurate and fast identification of minimally prepared bacteria phenotypes using Raman spectroscopy assisted by machine learning

Long Range Language Modeling via Gated State Spaces

Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment

Continual Learning with Transformers for Image Classification

SSL-Lanes: Self-Supervised Learning for Motion Forecasting in Autonomous Driving

Keyword: autonomous driving

SSL-Lanes: Self-Supervised Learning for Motion Forecasting in Autonomous Driving

Verifiable Goal Recognition for Autonomous Driving with Occlusions

Pedestrian 3D Bounding Box Prediction