New submissions for Wed, 8 Jun 22

Keyword: SLAM

Object Scan Context: Object-centric Spatial Descriptor for Place Recognition within 3D Point Cloud Map

Authors: Haodong Yuan, Yudong Zhang, Shengyin Fan, Xue Li, Jian Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.03062
Pdf link: https://arxiv.org/pdf/2206.03062
Abstract Place recognition technology endows a SLAM algorithm with the ability to eliminate accumulated errors and to relocalize itself. Existing methods on point cloud-based place recognition often leverage the matching of global descriptors which are lidar-centric. These methods have the following two major defects: place recognition cannot be performed when the distance between the two point clouds is far, and only the rotation angle can be calculated without the offset in the X and Y direction. To solve these two problems, we propose a novel global descriptor, which is built around the Main Object, in this way, descriptors are no longer dependent on the observation position. We analyze the theory that this method can perfectly solve the above two problems, and conduct a lot of experiments in KITTI and some extreme scenarios, which show that our method has obvious advantages over traditional methods.
Robot Self-Calibration Using Actuated 3D Sensors
Authors: Arne Peters
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.03430
Pdf link: https://arxiv.org/pdf/2206.03430
Abstract Both, robot and hand-eye calibration haven been object to research for decades. While current approaches manage to precisely and robustly identify the parameters of a robot's kinematic model, they still rely on external devices, such as calibration objects, markers and/or external sensors. Instead of trying to fit the recorded measurements to a model of a known object, this paper treats robot calibration as an offline SLAM problem, where scanning poses are linked to a fixed point in space by a moving kinematic chain. As such, the presented framework allows robot calibration using nothing but an arbitrary eye-in-hand depth sensor, thus enabling fully autonomous self-calibration without any external tools. My new approach is utilizes a modified version of the Iterative Closest Point algorithm to run bundle adjustment on multiple 3D recordings estimating the optimal parameters of the kinematic model. A detailed evaluation of the system is shown on a real robot with various attached 3D sensors. The presented results show that the system reaches precision comparable to a dedicated external tracking system at a fraction of its cost.
Keyword: odometry

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

Physics and semantic informed multi-sensor calibration via optimization theory and self-supervised learning
Authors: Shmuel Y. Hayoun, Meir Halachmi, Doron Serebro, Kfir Twizer, Elinor Medezinski, Liron Korkidi, Moshik Cohen, Itai Orr
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2206.02856
Pdf link: https://arxiv.org/pdf/2206.02856
Abstract Achieving safe and reliable autonomous driving relies greatly on the ability to achieve an accurate and robust perception system; however, this cannot be fully realized without precisely calibrated sensors. Environmental and operational conditions as well as improper maintenance can produce calibration errors inhibiting sensor fusion and, consequently, degrading the perception performance. Traditionally, sensor calibration is performed in a controlled environment with one or more known targets. Such a procedure can only be carried out in between drives and requires manual operation; a tedious task if needed to be conducted on a regular basis. This sparked a recent interest in online targetless methods, capable of yielding a set of geometric transformations based on perceived environmental features, however, the required redundancy in sensing modalities makes this task even more challenging, as the features captured by each modality and their distinctiveness may vary. We present a holistic approach to performing joint calibration of a camera-lidar-radar trio. Leveraging prior knowledge and physical properties of these sensing modalities together with semantic information, we propose two targetless calibration methods within a cost minimization framework once via direct online optimization, and second via self-supervised learning (SSL).
SpikiLi: A Spiking Simulation of LiDAR based Real-time Object Detection for Autonomous Driving
Authors: Sambit Mohapatra, Thomas Mesquida, Mona Hodaei, Senthil Yogamani, Heinrich Gotzig, Patrick Mader
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2206.02876
Pdf link: https://arxiv.org/pdf/2206.02876
Abstract Spiking Neural Networks are a recent and new neural network design approach that promises tremendous improvements in power efficiency, computation efficiency, and processing latency. They do so by using asynchronous spike-based data flow, event-based signal generation, processing, and modifying the neuron model to resemble biological neurons closely. While some initial works have shown significant initial evidence of applicability to common deep learning tasks, their applications in complex real-world tasks has been relatively low. In this work, we first illustrate the applicability of spiking neural networks to a complex deep learning task namely Lidar based 3D object detection for automated driving. Secondly, we make a step-by-step demonstration of simulating spiking behavior using a pre-trained convolutional neural network. We closely model essential aspects of spiking neural networks in simulation and achieve equivalent run-time and accuracy on a GPU. When the model is realized on a neuromorphic hardware, we expect to have significantly improved power efficiency.
Object Scan Context: Object-centric Spatial Descriptor for Place Recognition within 3D Point Cloud Map
Authors: Haodong Yuan, Yudong Zhang, Shengyin Fan, Xue Li, Jian Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.03062
Pdf link: https://arxiv.org/pdf/2206.03062
Abstract Place recognition technology endows a SLAM algorithm with the ability to eliminate accumulated errors and to relocalize itself. Existing methods on point cloud-based place recognition often leverage the matching of global descriptors which are lidar-centric. These methods have the following two major defects: place recognition cannot be performed when the distance between the two point clouds is far, and only the rotation angle can be calculated without the offset in the X and Y direction. To solve these two problems, we propose a novel global descriptor, which is built around the Main Object, in this way, descriptors are no longer dependent on the observation position. We analyze the theory that this method can perfectly solve the above two problems, and conduct a lot of experiments in KITTI and some extreme scenarios, which show that our method has obvious advantages over traditional methods.
Keyword: loop detection

There is no result

Keyword: nerf

There is no result

Keyword: mapping

MIRNF: Medical Image Registration via Neural Fields
Authors: Shanlin Sun, Kun Han, Deying Kong, Chenyu You, Xiaohui Xie
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.03111
Pdf link: https://arxiv.org/pdf/2206.03111
Abstract Image registration is widely used in medical image analysis to provide spatial correspondences between two images. Recently learning-based methods utilizing convolutional neural networks (CNNs) have been proposed for solving image registration problems. The learning-based methods tend to be much faster than traditional optimization-based methods, but the accuracy improvements gained from the complex CNN-based methods are modest. Here we introduce a new deep-neural net-based image registration framework, named \textbf{MIRNF}, which represents the correspondence mapping with a continuous function implemented via Neural Fields. MIRNF outputs either a deformation vector or velocity vector given a 3D coordinate as input. To ensure the mapping is diffeomorphic, the velocity vector output from MIRNF is integrated using the Neural ODE solver to derive the correspondences between two images. Furthermore, we propose a hybrid coordinate sampler along with a cascaded architecture to achieve the high-similarity mapping performance and low-distortion deformation fields. We conduct experiments on two 3D MR brain scan datasets, showing that our proposed framework provides state-of-art registration performance while maintaining comparable optimization time.
Examining the Implementation of Digital Health to Strengthen the COVID-19 Pandemic Response and Recovery and Scale up Equitable Vaccine Access in African Countries
Authors: Olufunto A Olusanya, Brianna White, Chad A Melton, Arash Shaban-Nejad
Subjects: Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2206.03286
Pdf link: https://arxiv.org/pdf/2206.03286
Abstract The COVID-19 pandemic has profoundly impacted the world, having taken the lives of over 6 million individuals. Accordingly, this pandemic has caused a shift in conversations surrounding the burden of diseases worldwide, welcoming insights from multidisciplinary fields including digital health and artificial intelligence. Africa faces a heavy disease burden that exacerbates the current COVID-19 pandemic and limits the scope of public health preparedness, response, containment, and case management. Herein, we examined the potential impact of transformative digital health technologies in mitigating the global health crisis with reference to African countries. Furthermore, we proposed recommendations for scaling up digital health technologies and artificial intelligence-based platforms to tackle the transmission of the SARS-CoV-2 and enable equitable vaccine access. Challenges related to the pandemic are numerous. Rapid response and management strategies - that is, contract tracing, case surveillance, diagnostic testing intensity, and most recently vaccine distribution mapping - can overwhelm the health care delivery system that is fragile. Although challenges are vast, digital health technologies can play an essential role in achieving sustainable resilient recovery and building back better. It is plausible that African nations are better equipped to rapidly identify, diagnose, and manage infected individuals for COVID-19, other diseases, future outbreaks, and pandemics.
EEG-based Emotion Recognition with Spatial and Functional Brain Mapping of CNS and PNS Signals
Authors: Zhiyao Cen, Xiangwen Deng, Hengjie Zheng, Jianing Zhao, Anjie Jin, Chentao Fu, Tianqi Wang, Shangming Yang, Jingdian Yang
Subjects: Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2206.03330
Pdf link: https://arxiv.org/pdf/2206.03330
Abstract Emotion plays a significant role in our daily life. Recognition of emotion is wide-spread in the field of health care and human-computer interaction. Emotion is the result of the coordinated activities of cortical and subcortical neural processes, which correlate to specific physiological responses. However, the existing emotion recognition techniques failed to combine various physiological signals as one integrated feature representation. Meanwhile, many researchers ignored the problem of over-fitting model with high accuracy, which was actually false high accuracy caused by improper pre-processing. In this paper, sigmoid baseline filtering is conducted to solve the over-fitting problem from source. To construct a physiological-based algorithm, a 3D spatial and functional brain mapping is proposed based on human physiological mechanism and international electrode system, which combines the signals of the central and peripheral nervous system together. By combining the baseline filtering, 3D brain mapping, and simple 4D-CNN, a novel emotion recognition model is finally proposed. Experiment results demonstrate that the performance of the proposed model is comparable to the state of art algorithms.
DeepOPF-AL: Augmented Learning for Solving AC-OPF Problems with Multiple Load-Solution Mappings
Authors: Xiang Pan, Wanjun Huang, Minghua Chen, Steven H. Low
Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2206.03365
Pdf link: https://arxiv.org/pdf/2206.03365
Abstract The existence of multiple load-solution mappings of non-convex AC-OPF problems poses a fundamental challenge to deep neural network (DNN) schemes. As the training dataset may contain a mixture of data points corresponding to different load-solution mappings, the DNN can fail to learn a legitimate mapping and generate inferior solutions. We propose DeepOPF-AL as an augmented-learning approach to tackle this issue. The idea is to train a DNN to learn a unique mapping from an augmented input, i.e., (load, initial point), to the solution generated by an iterative OPF solver with the load and initial point as intake. We then apply the learned augmented mapping to solve AC-OPF problems much faster than conventional solvers. Simulation results over IEEE test cases show that DeepOPF-AL achieves noticeably better optimality and similar feasibility and speedup performance, as compared to a recent DNN scheme, with the same DNN size yet elevated training complexity.
Localizing Semantic Patches for Accelerating Image Classification
Authors: Chuanguang Yang, Zhulin An, Yongjun Xu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.03367
Pdf link: https://arxiv.org/pdf/2206.03367
Abstract Existing works often focus on reducing the architecture redundancy for accelerating image classification but ignore the spatial redundancy of the input image. This paper proposes an efficient image classification pipeline to solve this problem. We first pinpoint task-aware regions over the input image by a lightweight patch proposal network called AnchorNet. We then feed these localized semantic patches with much smaller spatial redundancy into a general classification network. Unlike the popular design of deep CNN, we aim to carefully design the Receptive Field of AnchorNet without intermediate convolutional paddings. This ensures the exact mapping from a high-level spatial location to the specific input image patch. The contribution of each patch is interpretable. Moreover, AnchorNet is compatible with any downstream architecture. Experimental results on ImageNet show that our method outperforms SOTA dynamic inference methods with fewer inference costs. Our code is available at https://github.com/winycg/AnchorNet.
Keyword: localization

Tight basis cycle representatives for persistent homology of large data sets
Authors: Manu Aggarwal, Vipul Periwal
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2206.02925
Pdf link: https://arxiv.org/pdf/2206.02925
Abstract Persistent homology (PH) is a popular tool for topological data analysis that has found applications across diverse areas of research. It provides a rigorous method to compute robust topological features in discrete experimental observations that often contain various sources of uncertainties. Although powerful in theory, PH suffers from high computation cost that precludes its application to large data sets. Additionally, most analyses using PH are limited to computing the existence of nontrivial features. Precise localization of these features is not generally attempted because, by definition, localized representations are not unique and because of even higher computation cost. For scientific applications, such a precise location is a sine qua non for determining functional significance. Here, we provide a strategy and algorithms to compute tight representative boundaries around nontrivial robust features in large data sets. To showcase the efficiency of our algorithms and the precision of computed boundaries, we analyze three data sets from different scientific fields. In the human genome, we found an unexpected effect on loops through chromosome 13 and the sex chromosomes, upon impairment of chromatin loop formation. In a distribution of galaxies in the universe, we found statistically significant voids. In protein homologs with significantly different topology, we found voids attributable to ligand-interaction, mutation, and differences between species.
TadML: A fast temporal action detection with Mechanics-MLP
Authors: Bowen Deng, Dongchang Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.02997
Pdf link: https://arxiv.org/pdf/2206.02997
Abstract Temporal Action Detection(TAD) is a crucial but challenging task in video understanding.It is aimed at detecting both the type and start-end frame for each action instance in a long, untrimmed video.Most current models adopt both RGB and Optical-Flow streams for the TAD task. Thus, original RGB frames must be converted manually into Optical-Flow frames with additional computation and time cost, which is an obstacle to achieve real-time processing. At present, many models adopt two-stage strategies, which would slow the inference speed down and complicatedly tuning on proposals generating.By comparison, we propose a one-stage anchor-free temporal localization method with RGB stream only, in which a novel Newtonian \emph{Mechanics-MLP} architecture is established. It has comparable accuracy with all existing state-of-the-art models, while surpasses the inference speed of these methods by a large margin. The typical inference speed in this paper is astounding 4.44 video per second on THUMOS14. In applications, because there is no need to convert optical flow, the inference speed will be faster.It also proves that \emph{MLP} has great potential in downstream tasks such as TAD. The source code is available at \url{https://github.com/BonedDeng/TadML}
Keyword: transformer

A Bird's-Eye Tutorial of Graph Attention Architectures
Authors: Kaustubh D. Dhole, Carl Yang
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2206.02849
Pdf link: https://arxiv.org/pdf/2206.02849
Abstract Graph Neural Networks (GNNs) have shown tremendous strides in performance for graph-structured problems especially in the domains of natural language processing, computer vision and recommender systems. Inspired by the success of the transformer architecture, there has been an ever-growing body of work on attention variants of GNNs attempting to advance the state of the art in many of these problems. Incorporating "attention" into graph mining has been viewed as a way to overcome the noisiness, heterogenity and complexity associated with graph-structured data as well as to encode soft-inductive bias. It is hence crucial and advantageous to study these variants from a bird's-eye view to assess their strengths and weaknesses. We provide a systematic and focused tutorial centered around attention based GNNs in a hope to benefit researchers dealing with graph-structured problems. Our tutorial looks at GNN variants from the point of view of the attention function and iteratively builds the reader's understanding of different graph attention variants.
DETR++: Taming Your Multi-Scale Detection Transformer
Authors: Chi Zhang, Lijuan Liu, Xiaoxue Zang, Frederick Liu, Hao Zhang, Xinying Song, Jindong Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2206.02977
Pdf link: https://arxiv.org/pdf/2206.02977
Abstract Convolutional Neural Networks (CNN) have dominated the field of detection ever since the success of AlexNet in ImageNet classification [12]. With the sweeping reform of Transformers [27] in natural language processing, Carion et al. [2] introduce the Transformer-based detection method, i.e., DETR. However, due to the quadratic complexity in the self-attention mechanism in the Transformer, DETR is never able to incorporate multi-scale features as performed in existing CNN-based detectors, leading to inferior results in small object detection. To mitigate this issue and further improve performance of DETR, in this work, we investigate different methods to incorporate multi-scale features and find that a Bi-directional Feature Pyramid (BiFPN) works best with DETR in further raising the detection precision. With this discovery, we propose DETR++, a new architecture that improves detection results by 1.9% AP on MS COCO 2017, 11.5% AP on RICO icon detection, and 9.1% AP on RICO layout extraction over existing baselines.
Structured Context Transformer for Generic Event Boundary Detection
Authors: Congcong Li, Xinyao Wang, Dexiang Hong, Yufei Wang, Libo Zhang, Tiejian Luo, Longyin Wen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.02985
Pdf link: https://arxiv.org/pdf/2206.02985
Abstract Generic Event Boundary Detection (GEBD) aims to detect moments where humans naturally perceive as event boundaries. In this paper, we present Structured Context Transformer (or SC-Transformer) to solve the GEBD task, which can be trained in an end-to-end fashion. Specifically, we use the backbone convolutional neural network (CNN) to extract the features of each video frame. To capture temporal context information of each frame, we design the structure context transformer (SC-Transformer) by re-partitioning input frame sequence. Note that, the overall computation complexity of SC-Transformer is linear to the video length. After that, the group similarities are computed to capture the differences between frames. Then, a lightweight fully convolutional network is used to determine the event boundaries based on the grouped similarity maps. To remedy the ambiguities of boundary annotations, the Gaussian kernel is adopted to preprocess the ground-truth event boundaries to further boost the accuracy. Extensive experiments conducted on the challenging Kinetics-GEBD and TAPOS datasets demonstrate the effectiveness of the proposed method compared to the state-of-the-art methods.
DiMS: Distilling Multiple Steps of Iterative Non-Autoregressive Transformers
Authors: Sajad Norouzi, Rasa Hosseinzadeh, Felipe Perez, Maksims Volkovs
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2206.02999
Pdf link: https://arxiv.org/pdf/2206.02999
Abstract The computational benefits of iterative non-autoregressive transformers decrease as the number of decoding steps increases. As a remedy, we introduce Distill Multiple Steps (DiMS), a simple yet effective distillation technique to decrease the number of required steps to reach a certain translation quality. The distilled model enjoys the computational benefits of early iterations while preserving the enhancements from several iterative steps. DiMS relies on two models namely student and teacher. The student is optimized to predict the output of the teacher after multiple decoding steps while the teacher follows the student via a slow-moving average. The moving average keeps the teacher's knowledge updated and enhances the quality of the labels provided by the teacher. During inference, the student is used for translation and no additional computation is added. We verify the effectiveness of DiMS on various models obtaining improvements of up to 7 BLEU points on distilled and 12 BLEU points on raw WMT datasets for single-step translation. We release our code at https://github.com/layer6ai-labs/DiMS.
OCHADAI at SemEval-2022 Task 2: Adversarial Training for Multilingual Idiomaticity Detection
Authors: Lis Kanashiro Pereira, Ichiro Kobayashi
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2206.03025
Pdf link: https://arxiv.org/pdf/2206.03025
Abstract We propose a multilingual adversarial training model for determining whether a sentence contains an idiomatic expression. Given that a key challenge with this task is the limited size of annotated data, our model relies on pre-trained contextual representations from different multi-lingual state-of-the-art transformer-based language models (i.e., multilingual BERT and XLM-RoBERTa), and on adversarial training, a training method for further enhancing model generalization and robustness. Without relying on any human-crafted features, knowledge bases, or additional datasets other than the target datasets, our model achieved competitive results and ranked 6th place in SubTask A (zero-shot) setting and 15th place in SubTask A (one-shot) setting.
An Empirical Study of IoT Security Aspects at Sentence-Level in Developer Textual Discussions
Authors: Nibir Chandra Mandal, Gias Uddin
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2206.03079
Pdf link: https://arxiv.org/pdf/2206.03079
Abstract IoT is a rapidly emerging paradigm that now encompasses almost every aspect of our modern life. As such, ensuring the security of IoT devices is crucial. IoT devices can differ from traditional computing, thereby the design and implementation of proper security measures can be challenging in IoT devices. We observed that IoT developers discuss their security-related challenges in developer forums like Stack Overflow(SO). However, we find that IoT security discussions can also be buried inside non-security discussions in SO. In this paper, we aim to understand the challenges IoT developers face while applying security practices and techniques to IoT devices. We have two goals: (1) Develop a model that can automatically find security-related IoT discussions in SO, and (2) Study the model output to learn about IoT developer security-related challenges. First, we download 53K posts from SO that contain discussions about IoT. Second, we manually labeled 5,919 sentences from 53K posts as 1 or 0. Third, we use this benchmark to investigate a suite of deep learning transformer models. The best performing model is called SecBot. Fourth, we apply SecBot on the entire posts and find around 30K security related sentences. Fifth, we apply topic modeling to the security-related sentences. Then we label and categorize the topics. Sixth, we analyze the evolution of the topics in SO. We found that (1) SecBot is based on the retraining of the deep learning model RoBERTa. SecBot offers the best F1-Score of 0.935, (2) there are six error categories in misclassified samples by SecBot. SecBot was mostly wrong when the keywords/contexts were ambiguous (e.g., gateway can be a security gateway or a simple gateway), (3) there are 9 security topics grouped into three categories: Software, Hardware, and Network, and (4) the highest number of topics belongs to software security, followed by network security.
Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient Object Detection
Authors: Chao Zeng, Sam Kwong
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.03105
Pdf link: https://arxiv.org/pdf/2206.03105
Abstract Salient Object Detection is the task of predicting the human attended region in a given scene. Fusing depth information has been proven effective in this task. The main challenge of this problem is how to aggregate the complementary information from RGB modality and depth modality. However, conventional deep models heavily rely on CNN feature extractors, and the long-range contextual dependencies are usually ignored. In this work, we propose Dual Swin-Transformer based Mutual Interactive Network. We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs. Before fusing the two branches of features into one, attention-based modules are applied to enhance features from each modality. We design a self-attention-based cross-modality interaction module and a gated modality attention module to leverage the complementary information between the two modalities. For the saliency decoding, we create different stages enhanced with dense connections and keep a decoding memory while the multi-level encoding features are considered simultaneously. Considering the inaccurate depth map issue, we collect the RGB features of early stages into a skip convolution module to give more guidance from RGB modality to the final saliency prediction. In addition, we add edge supervision to regularize the feature learning process. Comprehensive experiments on five standard RGB-D SOD benchmark datasets over four evaluation metrics demonstrate the superiority of the proposed DTMINet method.
Wavelet Prior Attention Learning in Axial Inpainting Network
Authors: Chenjie Cao, Chengrong Wang, Yuntao Zhang, Yanwei Fu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.03113
Pdf link: https://arxiv.org/pdf/2206.03113
Abstract Image inpainting is the task of filling masked or unknown regions of an image with visually realistic contents, which has been remarkably improved by Deep Neural Networks (DNNs) recently. Essentially, as an inverse problem, the inpainting has the underlying challenges of reconstructing semantically coherent results without texture artifacts. Many previous efforts have been made via exploiting attention mechanisms and prior knowledge, such as edges and semantic segmentation. However, these works are still limited in practice by an avalanche of learnable prior parameters and prohibitive computational burden. To this end, we propose a novel model -- Wavelet prior attention learning in Axial Inpainting Network (WAIN), whose generator contains the encoder, decoder, as well as two key components of Wavelet image Prior Attention (WPA) and stacked multi-layer Axial-Transformers (ATs). Particularly, the WPA guides the high-level feature aggregation in the multi-scale frequency domain, alleviating the textual artifacts. Stacked ATs employ unmasked clues to help model reasonable features along with low-level features of horizontal and vertical axes, improving the semantic coherence. Extensive quantitative and qualitative experiments on Celeba-HQ and Places2 datasets are conducted to validate that our WAIN can achieve state-of-the-art performance over the competitors. The codes and models will be released.
Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse
Authors: Lorenzo Noci, Sotiris Anagnostidis, Luca Biggio, Antonio Orvieto, Sidak Pal Singh, Aurelien Lucchi
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2206.03126
Pdf link: https://arxiv.org/pdf/2206.03126
Abstract Transformers have achieved remarkable success in several domains, ranging from natural language processing to computer vision. Nevertheless, it has been recently shown that stacking self-attention layers - the distinctive architectural component of Transformers - can result in rank collapse of the tokens' representations at initialization. The question of if and how rank collapse affects training is still largely unanswered, and its investigation is necessary for a more comprehensive understanding of this architecture. In this work, we shed new light on the causes and the effects of this phenomenon. First, we show that rank collapse of the tokens' representations hinders training by causing the gradients of the queries and keys to vanish at initialization. Furthermore, we provide a thorough description of the origin of rank collapse and discuss how to prevent it via an appropriate depth-dependent scaling of the residual branches. Finally, our analysis unveils that specific architectural hyperparameters affect the gradients of queries and values differently, leading to disproportionate gradient norms. This suggests an explanation for the widespread use of adaptive methods for Transformers' optimization.
Fooling Explanations in Text Classifiers
Authors: Adam Ivankay, Ivan Girardi, Chiara Marchiori, Pascal Frossard
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2206.03178
Pdf link: https://arxiv.org/pdf/2206.03178
Abstract State-of-the-art text classification models are becoming increasingly reliant on deep neural networks (DNNs). Due to their black-box nature, faithful and robust explanation methods need to accompany classifiers for deployment in real-life scenarios. However, it has been shown in vision applications that explanation methods are susceptible to local, imperceptible perturbations that can significantly alter the explanations without changing the predicted classes. We show here that the existence of such perturbations extends to text classifiers as well. Specifically, we introduceTextExplanationFooler (TEF), a novel explanation attack algorithm that alters text input samples imperceptibly so that the outcome of widely-used explanation methods changes considerably while leaving classifier predictions unchanged. We evaluate the performance of the attribution robustness estimation performance in TEF on five sequence classification datasets, utilizing three DNN architectures and three transformer architectures for each dataset. TEF can significantly decrease the correlation between unchanged and perturbed input attributions, which shows that all models and explanation methods are susceptible to TEF perturbations. Moreover, we evaluate how the perturbations transfer to other model architectures and attribution methods, and show that TEF perturbations are also effective in scenarios where the target model and explanation method are unknown. Finally, we introduce a semi-universal attack that is able to compute fast, computationally light perturbations with no knowledge of the attacked classifier nor explanation method. Overall, our work shows that explanations in text classifiers are very fragile and users need to carefully address their robustness before relying on them in critical applications.
Rites de Passage: Elucidating Displacement to Emplacement of Refugees
Authors: Aparup Khatua, Wolfgang Nejdl
Subjects: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2206.03248
Pdf link: https://arxiv.org/pdf/2206.03248
Abstract Social media deliberations allow to explore refugee-related is-sues. AI-based studies have investigated refugee issues mostly around a specific event and considered unimodal approaches. Contrarily, we have employed a multimodal architecture for probing the refugee journeys from their home to host nations. We draw insights from Arnold van Gennep's anthropological work 'Les Rites de Passage', which systematically analyzed an individual's transition from one group or society to another. Based on Gennep's separation-transition-incorporation framework, we have identified four phases of refugee journeys: Arrival of Refugees, Temporal stay at Asylums, Rehabilitation, and Integration of Refugees into the host nation. We collected 0.23 million multimodal tweets from April 2020 to March 2021 for testing this proposed frame-work. We find that a combination of transformer-based language models and state-of-the-art image recognition models, such as fusion of BERT+LSTM and InceptionV4, can out-perform unimodal models. Subsequently, to test the practical implication of our proposed model in real-time, we have considered 0.01 million multimodal tweets related to the 2022 Ukrainian refugee crisis. An F1-score of 71.88 % for this 2022 crisis confirms the generalizability of our proposed framework.
RAAT: Relation-Augmented Attention Transformer for Relation Modeling in Document-Level Event Extraction
Authors: Yuan Liang, Zhuoxuan Jiang, Di Yin, Bo Ren
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2206.03377
Pdf link: https://arxiv.org/pdf/2206.03377
Abstract In document-level event extraction (DEE) task, event arguments always scatter across sentences (across-sentence issue) and multiple events may lie in one document (multi-event issue). In this paper, we argue that the relation information of event arguments is of great significance for addressing the above two issues, and propose a new DEE framework which can model the relation dependencies, called Relation-augmented Document-level Event Extraction (ReDEE). More specifically, this framework features a novel and tailored transformer, named as Relation-augmented Attention Transformer (RAAT). RAAT is scalable to capture multi-scale and multi-amount argument relations. To further leverage relation information, we introduce a separate event relation prediction task and adopt multi-task learning method to explicitly enhance event extraction performance. Extensive experiments demonstrate the effectiveness of the proposed method, which can achieve state-of-the-art performance on two public datasets. Our code is available at https://github. com/TencentYoutuResearch/RAAT.
Tutel: Adaptive Mixture-of-Experts at Scale
Authors: Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, Joe Chau, Peng Cheng, Fan Yang, Mao Yang, Yongqiang Xiong
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.03382
Pdf link: https://arxiv.org/pdf/2206.03382
Abstract In recent years, Mixture-of-Experts (MoE) has emerged as a promising technique for deep learning that can scale the model capacity to trillion-plus parameters while reducing the computing cost via sparse computation. While MoE opens a new frontier of exceedingly large models, its implementation over thousands of GPUs has been limited due to mismatch between the dynamic nature of MoE and static parallelism/pipelining of the system. We present Tutel, a highly scalable stack design and implementation for MoE with dynamically adaptive parallelism and pipelining. Tutel delivers adaptive parallelism switching and adaptive pipelining at runtime, which achieves up to 1.74x and 2.00x single MoE layer speedup, respectively. We also propose a novel two-dimensional hierarchical algorithm for MoE communication speedup that outperforms the previous state-of-the-art up to 20.7x over 2,048 GPUs. Aggregating all techniques, Tutel finally delivers 4.96x and 5.75x speedup of a single MoE layer on 16 GPUs and 2,048 GPUs, respectively, over Fairseq: Meta's Facebook AI Research Sequence-to-Sequence Toolkit (Tutel is now partially adopted by Fairseq). Tutel source code is available in public: https://github.com/microsoft/tutel . Our evaluation shows that Tutel efficiently and effectively runs a real-world MoE-based model named SwinV2-MoE, built upon Swin Transformer V2, a state-of-the-art computer vision architecture. On efficiency, Tutel accelerates SwinV2-MoE, achieving up to 1.55x and 2.11x speedup in training and inference over Fairseq, respectively. On effectiveness, the SwinV2-MoE model achieves superior accuracy in both pre-training and down-stream computer vision tasks such as COCO object detection than the counterpart dense model, indicating the readiness of Tutel for end-to-end real-world model training and inference. SwinV2-MoE is open sourced in https://github.com/microsoft/Swin-Transformer .
Can CNNs Be More Robust Than Transformers?
Authors: Zeyu Wang, Yutong Bai, Yuyin Zhou, Cihang Xie
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.03452
Pdf link: https://arxiv.org/pdf/2206.03452
Abstract The recent success of Vision Transformers is shaking the long dominance of Convolutional Neural Networks (CNNs) in image recognition for a decade. Specifically, in terms of robustness on out-of-distribution samples, recent research finds that Transformers are inherently more robust than CNNs, regardless of different training setups. Moreover, it is believed that such superiority of Transformers should largely be credited to their self-attention-like architectures per se. In this paper, we question that belief by closely examining the design of Transformers. Our findings lead to three highly effective architecture designs for boosting robustness, yet simple enough to be implemented in several lines of code, namely a) patchifying input images, b) enlarging kernel size, and c) reducing activation layers and normalization layers. Bringing these components together, we are able to build pure CNN architectures without any attention-like operations that is as robust as, or even more robust than, Transformers. We hope this work can help the community better understand the design of robust neural architectures. The code is publicly available at https://github.com/UCSC-VLAA/RobustCNN.
Fast Unsupervised Brain Anomaly Detection and Segmentation with Diffusion Models
Authors: Walter H. L. Pinaya, Mark S. Graham, Robert Gray, Pedro F Da Costa, Petru-Daniel Tudosiu, Paul Wright, Yee H. Mah, Andrew D. MacKinnon, James T. Teo, Rolf Jager, David Werring, Geraint Rees, Parashkev Nachev, Sebastien Ourselin, M. Jorge Cardoso
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM)
Arxiv link: https://arxiv.org/abs/2206.03461
Pdf link: https://arxiv.org/pdf/2206.03461
Abstract Deep generative models have emerged as promising tools for detecting arbitrary anomalies in data, dispensing with the necessity for manual labelling. Recently, autoregressive transformers have achieved state-of-the-art performance for anomaly detection in medical imaging. Nonetheless, these models still have some intrinsic weaknesses, such as requiring images to be modelled as 1D sequences, the accumulation of errors during the sampling process, and the significant inference times associated with transformers. Denoising diffusion probabilistic models are a class of non-autoregressive generative models recently shown to produce excellent samples in computer vision (surpassing Generative Adversarial Networks), and to achieve log-likelihoods that are competitive with transformers while having fast inference times. Diffusion models can be applied to the latent representations learnt by autoencoders, making them easily scalable and great candidates for application to high dimensional data, such as medical images. Here, we propose a method based on diffusion models to detect and segment anomalies in brain imaging. By training the models on healthy data and then exploring its diffusion and reverse steps across its Markov chain, we can identify anomalous areas in the latent space and hence identify anomalies in the pixel space. Our diffusion models achieve competitive performance compared with autoregressive approaches across a series of experiments with 2D CT and MRI data involving synthetic and real pathological lesions with much reduced inference times, making their usage clinically viable.
A COVID-19 Search Engine (CO-SE) with Transformer-based Architecture
Authors: Shaina Raza
Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2206.03474
Pdf link: https://arxiv.org/pdf/2206.03474
Abstract Coronavirus disease (COVID-19) is an infectious disease, which is caused by the SARS-CoV-2 virus. Due to the growing literature on COVID-19, it is hard to get precise, up-to-date information about the virus. Practitioners, front-line workers, and researchers require expert-specific methods to stay current on scientific knowledge and research findings. However, there are a lot of research papers being written on the subject, which makes it hard to keep up with the most recent research. This problem motivates us to propose the design of the COVID-19 Search Engine (CO-SE), which is an algorithmic system that finds relevant documents for each query (asked by a user) and answers complex questions by searching a large corpus of publications. The CO-SE has a retriever component trained on the TF-IDF vectorizer that retrieves the relevant documents from the system. It also consists of a reader component that consists of a Transformer-based model, which is used to read the paragraphs and find the answers related to the query from the retrieved documents. The proposed model has outperformed previous models, obtaining an exact match ratio score of 71.45% and a semantic answer similarity score of 78.55%. It also outperforms other benchmark datasets, demonstrating the generalizability of the proposed approach.
Keyword: autonomous driving

Physics and semantic informed multi-sensor calibration via optimization theory and self-supervised learning
Authors: Shmuel Y. Hayoun, Meir Halachmi, Doron Serebro, Kfir Twizer, Elinor Medezinski, Liron Korkidi, Moshik Cohen, Itai Orr
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2206.02856
Pdf link: https://arxiv.org/pdf/2206.02856
Abstract Achieving safe and reliable autonomous driving relies greatly on the ability to achieve an accurate and robust perception system; however, this cannot be fully realized without precisely calibrated sensors. Environmental and operational conditions as well as improper maintenance can produce calibration errors inhibiting sensor fusion and, consequently, degrading the perception performance. Traditionally, sensor calibration is performed in a controlled environment with one or more known targets. Such a procedure can only be carried out in between drives and requires manual operation; a tedious task if needed to be conducted on a regular basis. This sparked a recent interest in online targetless methods, capable of yielding a set of geometric transformations based on perceived environmental features, however, the required redundancy in sensing modalities makes this task even more challenging, as the features captured by each modality and their distinctiveness may vary. We present a holistic approach to performing joint calibration of a camera-lidar-radar trio. Leveraging prior knowledge and physical properties of these sensing modalities together with semantic information, we propose two targetless calibration methods within a cost minimization framework once via direct online optimization, and second via self-supervised learning (SSL).
Driving in Real Life with Inverse Reinforcement Learning
Authors: Tung Phan-Minh, Forbes Howington, Ting-Sheng Chu, Sang Uk Lee, Momchil S. Tomov, Nanxiang Li, Caglayan Dicle, Samuel Findler, Francisco Suarez-Ruiz, Robert Beaudoin, Bo Yang, Sammy Omari, Eric M. Wolff
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2206.03004
Pdf link: https://arxiv.org/pdf/2206.03004
Abstract In this paper, we introduce the first learning-based planner to drive a car in dense, urban traffic using Inverse Reinforcement Learning (IRL). Our planner, DriveIRL, generates a diverse set of trajectory proposals, filters these trajectories with a lightweight and interpretable safety filter, and then uses a learned model to score each remaining trajectory. The best trajectory is then tracked by the low-level controller of our self-driving vehicle. We train our trajectory scoring model on a 500+ hour real-world dataset of expert driving demonstrations in Las Vegas within the maximum entropy IRL framework. DriveIRL's benefits include: a simple design due to only learning the trajectory scoring function, relatively interpretable features, and strong real-world performance. We validated DriveIRL on the Las Vegas Strip and demonstrated fully autonomous driving in heavy traffic, including scenarios involving cut-ins, abrupt braking by the lead vehicle, and hotel pickup/dropoff zones. Our dataset will be made public to help further research in this area.

zhuhu00 / Paper-Daily-Notice

New submissions for Wed, 8 Jun 22 #179

New submissions for Wed, 8 Jun 22

Keyword: SLAM

Object Scan Context: Object-centric Spatial Descriptor for Place Recognition within 3D Point Cloud Map

Robot Self-Calibration Using Actuated 3D Sensors

Keyword: odometry

Keyword: livox

Keyword: loam

Keyword: lidar

Physics and semantic informed multi-sensor calibration via optimization theory and self-supervised learning

SpikiLi: A Spiking Simulation of LiDAR based Real-time Object Detection for Autonomous Driving

Object Scan Context: Object-centric Spatial Descriptor for Place Recognition within 3D Point Cloud Map

Keyword: loop detection

Keyword: nerf

Keyword: mapping

MIRNF: Medical Image Registration via Neural Fields

Examining the Implementation of Digital Health to Strengthen the COVID-19 Pandemic Response and Recovery and Scale up Equitable Vaccine Access in African Countries

EEG-based Emotion Recognition with Spatial and Functional Brain Mapping of CNS and PNS Signals

DeepOPF-AL: Augmented Learning for Solving AC-OPF Problems with Multiple Load-Solution Mappings

Localizing Semantic Patches for Accelerating Image Classification

Keyword: localization

Tight basis cycle representatives for persistent homology of large data sets

TadML: A fast temporal action detection with Mechanics-MLP

Keyword: transformer

A Bird's-Eye Tutorial of Graph Attention Architectures

DETR++: Taming Your Multi-Scale Detection Transformer

Structured Context Transformer for Generic Event Boundary Detection

DiMS: Distilling Multiple Steps of Iterative Non-Autoregressive Transformers

OCHADAI at SemEval-2022 Task 2: Adversarial Training for Multilingual Idiomaticity Detection

An Empirical Study of IoT Security Aspects at Sentence-Level in Developer Textual Discussions

Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient Object Detection

Wavelet Prior Attention Learning in Axial Inpainting Network

Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse

Fooling Explanations in Text Classifiers

Rites de Passage: Elucidating Displacement to Emplacement of Refugees

RAAT: Relation-Augmented Attention Transformer for Relation Modeling in Document-Level Event Extraction

Tutel: Adaptive Mixture-of-Experts at Scale

Can CNNs Be More Robust Than Transformers?

Fast Unsupervised Brain Anomaly Detection and Segmentation with Diffusion Models

A COVID-19 Search Engine (CO-SE) with Transformer-based Architecture

Keyword: autonomous driving

Physics and semantic informed multi-sensor calibration via optimization theory and self-supervised learning

Driving in Real Life with Inverse Reinforcement Learning