New submissions for Fri, 1 Jul 22

Keyword: SLAM

Neural Rendering for Stereo 3D Reconstruction of Deformable Tissues in Robotic Surgery

Authors: Yuehao Wang, Yonghao Long, Siu Hin Fan, Qi Dou
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.15255
Pdf link: https://arxiv.org/pdf/2206.15255
Abstract Reconstruction of the soft tissues in robotic surgery from endoscopic stereo videos is important for many applications such as intra-operative navigation and image-guided robotic surgery automation. Previous works on this task mainly rely on SLAM-based approaches, which struggle to handle complex surgical scenes. Inspired by recent progress in neural rendering, we present a novel framework for deformable tissue reconstruction from binocular captures in robotic surgery under the single-viewpoint setting. Our framework adopts dynamic neural radiance fields to represent deformable surgical scenes in MLPs and optimize shapes and deformations in a learning-based manner. In addition to non-rigid deformations, tool occlusion and poor 3D clues from a single viewpoint are also particular challenges in soft tissue reconstruction. To overcome these difficulties, we present a series of strategies of tool mask-guided ray casting, stereo depth-cueing ray marching and stereo depth-supervised optimization. With experiments on DaVinci robotic surgery videos, our method significantly outperforms the current state-of-the-art reconstruction method for handling various complex non-rigid deformations. To our best knowledge, this is the first work leveraging neural rendering for surgical scene 3D reconstruction with remarkable potential demonstrated. Code is available at: https://github.com/med-air/EndoNeRF.
Keyword: odometry

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

Boosting 3D Object Detection by Simulating Multimodality on Point Clouds
Authors: Wu Zheng, Mingxuan Hong, Li Jiang, Chi-Wing Fu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.14971
Pdf link: https://arxiv.org/pdf/2206.14971
Abstract This paper presents a new approach to boost a single-modality (LiDAR) 3D object detector by teaching it to simulate features and responses that follow a multi-modality (LiDAR-image) detector. The approach needs LiDAR-image data only when training the single-modality detector, and once well-trained, it only needs LiDAR data at inference. We design a novel framework to realize the approach: response distillation to focus on the crucial response samples and avoid the background samples; sparse-voxel distillation to learn voxel semantics and relations from the estimated crucial voxels; a fine-grained voxel-to-point distillation to better attend to features of small and distant objects; and instance distillation to further enhance the deep-feature consistency. Experimental results on the nuScenes dataset show that our approach outperforms all SOTA LiDAR-only 3D detectors and even surpasses the baseline LiDAR-image detector on the key NDS metric, filling 72% mAP gap between the single- and multi-modality detectors.
BoxGraph: Semantic Place Recognition and Pose Estimation from 3D LiDAR
Authors: Georgi Pramatarov, Daniele De Martini, Matthew Gadd, Paul Newman
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2206.15154
Pdf link: https://arxiv.org/pdf/2206.15154
Abstract This paper is about extremely robust and lightweight localisation using LiDAR point clouds based on instance segmentation and graph matching. We model 3D point clouds as fully-connected graphs of semantically identified components where each vertex corresponds to an object instance and encodes its shape. Optimal vertex association across graphs allows for full 6-Degree-of-Freedom (DoF) pose estimation and place recognition by measuring similarity. This representation is very concise, condensing the size of maps by a factor of 25 against the state-of-the-art, requiring only 3kB to represent a 1.4MB laser scan. We verify the efficacy of our system on the SemanticKITTI dataset, where we achieve a new state-of-the-art in place recognition, with an average of 88.4% recall at 100% precision where the next closest competitor follows with 64.9%. We also show accurate metric pose estimation performance - estimating 6-DoF pose with median errors of 10 cm and 0.33 deg.
HRFuser: A Multi-resolution Sensor Fusion Architecture for 2D Object Detection
Authors: Tim Broedermann (1), Christos Sakaridis (1), Dengxin Dai (2), Luc Van Gool (1 and 3) ((1) ETH Zurich, (2) MPI for Informatics, (3) KU Leuven)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2206.15157
Pdf link: https://arxiv.org/pdf/2206.15157
Abstract Besides standard cameras, autonomous vehicles typically include multiple additional sensors, such as lidars and radars, which help acquire richer information for perceiving the content of the driving scene. While several recent works focus on fusing certain pairs of sensors - such as camera and lidar or camera and radar - by using architectural components specific to the examined setting, a generic and modular sensor fusion architecture is missing from the literature. In this work, we focus on 2D object detection, a fundamental high-level task which is defined on the 2D image domain, and propose HRFuser, a multi-resolution sensor fusion architecture that scales straightforwardly to an arbitrary number of input modalities. The design of HRFuser is based on state-of-the-art high-resolution networks for image-only dense prediction and incorporates a novel multi-window cross-attention block as the means to perform fusion of multiple modalities at multiple resolutions. Even though cameras alone provide very informative features for 2D detection, we demonstrate via extensive experiments on the nuScenes and Seeing Through Fog datasets that our model effectively leverages complementary features from additional modalities, substantially improving upon camera-only performance and consistently outperforming state-of-the-art fusion methods for 2D detection both in normal and adverse conditions. The source code will be made publicly available.
LiDAR-as-Camera for End-to-End Driving
Authors: Ardi Tampuu, Romet Aidla, Jan Are van Gent, Tambet Matiisen
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2206.15170
Pdf link: https://arxiv.org/pdf/2206.15170
Abstract The core task of any autonomous driving system is to transform sensory inputs into driving commands. In end-to-end driving, this is achieved via a neural network, with one or multiple cameras as the most commonly used input and low-level driving command, e.g. steering angle, as output. However, depth-sensing has been shown in simulation to make the end-to-end driving task easier. On a real car, combining depth and visual information can be challenging, due to the difficulty of obtaining good spatial and temporal alignment of the sensors. To alleviate alignment problems, Ouster LiDARs can output surround-view LiDAR-images with depth, intensity, and ambient radiation channels. These measurements originate from the same sensor, rendering them perfectly aligned in time and space. We demonstrate that such LiDAR-images are sufficient for the real-car road-following task and perform at least equally to camera-based models in the tested conditions, with the difference increasing when needing to generalize to new weather conditions. In the second direction of study, we reveal that the temporal smoothness of off-policy prediction sequences correlates equally well with actual on-policy driving ability as the commonly used mean absolute error.
Keyword: loop detection

There is no result

Keyword: nerf

NeRF, meet differential geometry!
Authors: Thibaud Ehret, Roger Marí, Gabriele Facciolo
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.14938
Pdf link: https://arxiv.org/pdf/2206.14938
Abstract Neural radiance fields, or NeRF, represent a breakthrough in the field of novel view synthesis and 3D modeling of complex scenes from multi-view image collections. Numerous recent works have been focusing on making the models more robust, by means of regularization, so as to be able to train with possibly inconsistent and/or very sparse data. In this work, we scratch the surface of how differential geometry can provide regularization tools for robustly training NeRF-like models, which are modified so as to represent continuous and infinitely differentiable functions. In particular, we show how these tools yield a direct mathematical formalism of previously proposed NeRF variants aimed at improving the performance in challenging conditions (i.e. RegNeRF). Based on this, we show how the same formalism can be used to natively encourage the regularity of surfaces (by means of Gaussian and Mean Curvatures) making it possible, for example, to learn surfaces from a very limited number of views.
Neural Rendering for Stereo 3D Reconstruction of Deformable Tissues in Robotic Surgery
Authors: Yuehao Wang, Yonghao Long, Siu Hin Fan, Qi Dou
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.15255
Pdf link: https://arxiv.org/pdf/2206.15255
Abstract Reconstruction of the soft tissues in robotic surgery from endoscopic stereo videos is important for many applications such as intra-operative navigation and image-guided robotic surgery automation. Previous works on this task mainly rely on SLAM-based approaches, which struggle to handle complex surgical scenes. Inspired by recent progress in neural rendering, we present a novel framework for deformable tissue reconstruction from binocular captures in robotic surgery under the single-viewpoint setting. Our framework adopts dynamic neural radiance fields to represent deformable surgical scenes in MLPs and optimize shapes and deformations in a learning-based manner. In addition to non-rigid deformations, tool occlusion and poor 3D clues from a single viewpoint are also particular challenges in soft tissue reconstruction. To overcome these difficulties, we present a series of strategies of tool mask-guided ray casting, stereo depth-cueing ray marching and stereo depth-supervised optimization. With experiments on DaVinci robotic surgery videos, our method significantly outperforms the current state-of-the-art reconstruction method for handling various complex non-rigid deformations. To our best knowledge, this is the first work leveraging neural rendering for surgical scene 3D reconstruction with remarkable potential demonstrated. Code is available at: https://github.com/med-air/EndoNeRF.
Keyword: mapping

Automated Wheat Disease Detection using a ROS-based Autonomous Guided UAV
Authors: Behzad Safarijalal, Yousef Alborzi, Esmaeil Najafi
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2206.15042
Pdf link: https://arxiv.org/pdf/2206.15042
Abstract With the increase in world population, food resources have to be modified to be more productive, resistive, and reliable. Wheat is one of the most important food resources in the world, mainly because of the variety of wheat-based products. Wheat crops are threatened by three main types of diseases which cause large amounts of annual damage in crop yield. These diseases can be eliminated by using pesticides at the right time. While the task of manually spraying pesticides is burdensome and expensive, agricultural robotics can aid farmers by increasing the speed and decreasing the amount of chemicals. In this work, a smart autonomous system has been implemented on an unmanned aerial vehicle to automate the task of monitoring wheat fields. First, an image-based deep learning approach is used to detect and classify disease-infected wheat plants. To find the most optimal method, different approaches have been studied. Because of the lack of a public wheat-disease dataset, a custom dataset has been created and labeled. Second, an efficient mapping and navigation system is presented using a simulation in the robot operating system and Gazebo environments. A 2D simultaneous localization and mapping algorithm is used for mapping the workspace autonomously with the help of a frontier-based exploration method.
ZeroC: A Neuro-Symbolic Model for Zero-shot Concept Recognition and Acquisition at Inference Time
Authors: Tailin Wu, Megan Tjandrasuwita, Zhengxuan Wu, Xuelin Yang, Kevin Liu, Rok Sosič, Jure Leskovec
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2206.15049
Pdf link: https://arxiv.org/pdf/2206.15049
Abstract Humans have the remarkable ability to recognize and acquire novel visual concepts in a zero-shot manner. Given a high-level, symbolic description of a novel concept in terms of previously learned visual concepts and their relations, humans can recognize novel concepts without seeing any examples. Moreover, they can acquire new concepts by parsing and communicating symbolic structures using learned visual concepts and relations. Endowing these capabilities in machines is pivotal in improving their generalization capability at inference time. In this work, we introduce Zero-shot Concept Recognition and Acquisition (ZeroC), a neuro-symbolic architecture that can recognize and acquire novel concepts in a zero-shot way. ZeroC represents concepts as graphs of constituent concept models (as nodes) and their relations (as edges). To allow inference time composition, we employ energy-based models (EBMs) to model concepts and relations. We design ZeroC architecture so that it allows a one-to-one mapping between a symbolic graph structure of a concept and its corresponding EBM, which for the first time, allows acquiring new concepts, communicating its graph structure, and applying it to classification and detection tasks (even across domains) at inference time. We introduce algorithms for learning and inference with ZeroC. We evaluate ZeroC on a challenging grid-world dataset which is designed to probe zero-shot concept recognition and acquisition, and demonstrate its capability.
Colonoscopy Navigation using End-to-End Deep Visuomotor Control: A User Study
Authors: Ameya Pore, Martina Finocchiaro, Diego Dall'Alba, Albert Hernansanz, Gastone Ciuti, Alberto Arezzo, Arianna Menciassi, Alicia Casals, Paolo Fiorini
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2206.15086
Pdf link: https://arxiv.org/pdf/2206.15086
Abstract Flexible endoscopes for colonoscopy present several limitations due to their inherent complexity, resulting in patient discomfort and lack of intuitiveness for clinicians. Robotic devices together with autonomous control represent a viable solution to reduce the workload of endoscopists and the training time while improving the overall procedure outcome. Prior works on autonomous endoscope control use heuristic policies that limit their generalisation to the unstructured and highly deformable colon environment and require frequent human intervention. This work proposes an image-based control of the endoscope using Deep Reinforcement Learning, called Deep Visuomotor Control (DVC), to exhibit adaptive behaviour in convoluted sections of the colon tract. DVC learns a mapping between the endoscopic images and the control signal of the endoscope. A first user study of 20 expert gastrointestinal endoscopists was carried out to compare their navigation performance with DVC policies using a realistic virtual simulator. The results indicate that DVC shows equivalent performance on several assessment parameters, being more safer. Moreover, a second user study with 20 novice participants was performed to demonstrate easier human supervision compared to a state-of-the-art heuristic control policy. Seamless supervision of colonoscopy procedures would enable interventionists to focus on the medical decision rather than on the control problem of the endoscope.
Learning-Aided Beam Prediction in mmWave MU-MIMO Systems for High-Speed Railway
Authors: Fan Meng, Shengheng Liu, Yongming Huang, Zhaohua Lu
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2206.15095
Pdf link: https://arxiv.org/pdf/2206.15095
Abstract The problem of beam alignment and tracking in high mobility scenarios such as high-speed railway (HSR) becomes extremely challenging, since large overhead cost and significant time delay are introduced for fast time-varying channel estimation. To tackle this challenge, we propose a learning-aided beam prediction scheme for HSR networks, which predicts the beam directions and the channel amplitudes within a period of future time with fine time granularity, using a group of observations. Concretely, we transform the problem of high-dimensional beam prediction into a two-stage task, i.e., a low-dimensional parameter estimation and a cascaded hybrid beamforming operation. In the first stage, the location and speed of a certain terminal are estimated by maximum likelihood criterion, and a data-driven data fusion module is designed to improve the final estimation accuracy and robustness. Then, the probable future beam directions and channel amplitudes are predicted, based on the HSR scenario priors including deterministic trajectory, motion model, and channel model. Furthermore, we incorporate a learnable non-linear mapping module into the overall beam prediction to allow non-linear tracks. Both of the proposed learnable modules are model-based and have a good interpretability. Compared to the existing beam management scheme, the proposed beam prediction has (near) zero overhead cost and time delay. Simulation results verify the effectiveness of the proposed scheme.
Revisiting Competitive Coding Approach for Palmprint Recognition: A Linear Discriminant Analysis Perspective
Authors: Lingfei Song, Hua Huang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2206.15349
Pdf link: https://arxiv.org/pdf/2206.15349
Abstract The competitive Coding approach (CompCode) is one of the most promising methods for palmprint recognition. Due to its high performance and simple formulation, it has been continuously studied for many years. However, although numerous variations of CompCode have been proposed, a detailed analysis of the method is still absent. In this paper, we provide a detailed analysis of CompCode from the perspective of linear discriminant analysis (LDA) for the first time. A non-trivial sufficient condition under which the CompCode is optimal in the sense of Fisher's criterion is presented. Based on our analysis, we examined the statistics of palmprints and concluded that CompCode deviates from the optimal condition. To mitigate the deviation, we propose a new method called Class-Specific CompCode that improves CompCode by excluding non-palm-line areas from matching. A nonlinear mapping of the competitive code is also applied in this method to further enhance accuracy. Experiments on two public databases demonstrate the effectiveness of the proposed method.
Bounding and computing obstacle numbers of graphs
Authors: Martin Balko, Steven Chaplick, Robert Ganian, Siddharth Gupta, Michael Hoffmann, Pavel Valtr, Alexander Wolff
Subjects: Computational Geometry (cs.CG); Combinatorics (math.CO)
Arxiv link: https://arxiv.org/abs/2206.15414
Pdf link: https://arxiv.org/pdf/2206.15414
Abstract An obstacle representation of a graph $G$ consists of a set of pairwise disjoint simply-connected closed regions and a one-to-one mapping of the vertices of $G$ to points such that two vertices are adjacent in $G$ if and only if the line segment connecting the two corresponding points does not intersect any obstacle. The obstacle number of a graph is the smallest number of obstacles in an obstacle representation of the graph in the plane such that all obstacles are simple polygons. It is known that the obstacle number of each $n$-vertex graph is $O(n \log n)$ [Balko, Cibulka, and Valtr, 2018] and that there are $n$-vertex graphs whose obstacle number is $\Omega(n/(\log\log n)^2)$ [Dujmovi\'c and Morin, 2015]. We improve this lower bound to $\Omega(n/\log\log n)$ for simple polygons and to $\Omega(n)$ for convex polygons. To obtain these stronger bounds, we improve known estimates on the number of $n$-vertex graphs with bounded obstacle number, solving a conjecture by Dujmovi\'c and Morin. We also show that if the drawing of some $n$-vertex graph is given as part of the input, then for some drawings $\Omega(n^2)$ obstacles are required to turn them into an obstacle representation of the graph. Our bounds are asymptotically tight in several instances. We complement these combinatorial bounds by two complexity results. First, we show that computing the obstacle number of a graph $G$ is fixed-parameter tractable in the vertex cover number of $G$. Second, we show that, given a graph $G$ and a simple polygon $P$, it is NP-hard to decide whether $G$ admits an obstacle representation using $P$ as the only obstacle.
Keyword: localization

Exploring Temporally Dynamic Data Augmentation for Video Recognition
Authors: Taeoh Kim, Jinhyung Kim, Minho Shim, Sangdoo Yun, Myunggu Kang, Dongyoon Wee, Sangyoun Lee
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.15015
Pdf link: https://arxiv.org/pdf/2206.15015
Abstract Data augmentation has recently emerged as an essential component of modern training recipes for visual recognition tasks. However, data augmentation for video recognition has been rarely explored despite its effectiveness. Few existing augmentation recipes for video recognition naively extend the image augmentation methods by applying the same operations to the whole video frames. Our main idea is that the magnitude of augmentation operations for each frame needs to be changed over time to capture the real-world video's temporal variations. These variations should be generated as diverse as possible using fewer additional hyper-parameters during training. Through this motivation, we propose a simple yet effective video data augmentation framework, DynaAugment. The magnitude of augmentation operations on each frame is changed by an effective mechanism, Fourier Sampling that parameterizes diverse, smooth, and realistic temporal variations. DynaAugment also includes an extended search space suitable for video for automatic data augmentation methods. DynaAugment experimentally demonstrates that there are additional performance rooms to be improved from static augmentations on diverse video models. Specifically, we show the effectiveness of DynaAugment on various video datasets and tasks: large-scale video recognition (Kinetics-400 and Something-Something-v2), small-scale video recognition (UCF- 101 and HMDB-51), fine-grained video recognition (Diving-48 and FineGym), video action segmentation on Breakfast, video action localization on THUMOS'14, and video object detection on MOT17Det. DynaAugment also enables video models to learn more generalized representation to improve the model robustness on the corrupted videos.
Automated Wheat Disease Detection using a ROS-based Autonomous Guided UAV
Authors: Behzad Safarijalal, Yousef Alborzi, Esmaeil Najafi
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2206.15042
Pdf link: https://arxiv.org/pdf/2206.15042
Abstract With the increase in world population, food resources have to be modified to be more productive, resistive, and reliable. Wheat is one of the most important food resources in the world, mainly because of the variety of wheat-based products. Wheat crops are threatened by three main types of diseases which cause large amounts of annual damage in crop yield. These diseases can be eliminated by using pesticides at the right time. While the task of manually spraying pesticides is burdensome and expensive, agricultural robotics can aid farmers by increasing the speed and decreasing the amount of chemicals. In this work, a smart autonomous system has been implemented on an unmanned aerial vehicle to automate the task of monitoring wheat fields. First, an image-based deep learning approach is used to detect and classify disease-infected wheat plants. To find the most optimal method, different approaches have been studied. Because of the lack of a public wheat-disease dataset, a custom dataset has been created and labeled. Second, an efficient mapping and navigation system is presented using a simulation in the robot operating system and Gazebo environments. A 2D simultaneous localization and mapping algorithm is used for mapping the workspace autonomously with the help of a frontier-based exploration method.
Secure Heterogeneous Multi-Robot Collaboration and Docking with Hyperledger Fabric Blockchain
Authors: Salma Salimi, Paola Torrico Morón, Jorge Peña Queralta, Tomi Westerlund
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2206.15242
Pdf link: https://arxiv.org/pdf/2206.15242
Abstract In recent years, multi-robot systems have received increasing attention from both industry and academia. Besides the need of accurate and robust estimation of relative localization, security and trust in the system are essential to enable wider adoption. In this paper, we propose a framework using Hyperledger Fabric for multi-robot collaboration in industrial applications. We rely on blockchain identities for the interaction of ground and aerial robots, and use smart contracts for collaborative decision making. The use of ultra-wideband (UWB) localization for both autonomous navigation and robot collaboration extends our previous work in Fabric-based fleet management. We focus on an inventory management application which uses a ground robot and an aerial robot to inspect a warehouse-like environment and store information about the found objects in the blockchain. We measure the impact of adding the blockchain layer, analyze the transaction commit latency and compare the resource utilization of blockchain-related processes to the already running data processing modules.
A Distributed Massive MIMO Channel Sounder for "Big CSI Data"-driven Machine Learning
Authors: Florian Euchner, Marc Gauger, Sebastian Dörner, Stephan ten Brink
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2206.15302
Pdf link: https://arxiv.org/pdf/2206.15302
Abstract A distributed massive MIMO channel sounder for acquiring large CSI datasets, dubbed DICHASUS, is presented. The measured data has potential applications in the study of various machine learning algorithms for user localization, JCAS, channel charting, enabling massive MIMO in FDD operation, and many others. The proposed channel sounder architecture is distinct from similar previous designs in that each individual single-antenna receiver is completely autonomous, enabling arbitrary, spatially distributed antenna deployments, and offering virtually unlimited scalability in the number of antennas. Optionally, extracted channel coefficient vectors can be tagged with ground truth position data, obtained either through a GNSS receiver (for outdoor operation) or through various indoor positioning techniques.
Keyword: transformer

Causality for Inherently Explainable Transformers: CAT-XPLAIN
Authors: Subash Khanal, Benjamin Brodie, Xin Xing, Ai-Ling Lin, Nathan Jacobs
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2206.14841
Pdf link: https://arxiv.org/pdf/2206.14841
Abstract There have been several post-hoc explanation approaches developed to explain pre-trained black-box neural networks. However, there is still a gap in research efforts toward designing neural networks that are inherently explainable. In this paper, we utilize a recently proposed instance-wise post-hoc causal explanation method to make an existing transformer architecture inherently explainable. Once trained, our model provides an explanation in the form of top-$k$ regions in the input space of the given instance contributing to its decision. We evaluate our method on binary classification tasks using three image datasets: MNIST, FMNIST, and CIFAR. Our results demonstrate that compared to the causality-based post-hoc explainer model, our inherently explainable model achieves better explainability results while eliminating the need of training a separate explainer model. Our code is available at https://github.com/mvrl/CAT-XPLAIN.
Spatial Transformer Network with Transfer Learning for Small-scale Fine-grained Skeleton-based Tai Chi Action Recognition
Authors: Lin Yuan, Zhen He, Qiang Wang, Leiyang Xu, Xiang Ma
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.15002
Pdf link: https://arxiv.org/pdf/2206.15002
Abstract Human action recognition is a quite hugely investigated area where most remarkable action recognition networks usually use large-scale coarse-grained action datasets of daily human actions as inputs to state the superiority of their networks. We intend to recognize our small-scale fine-grained Tai Chi action dataset using neural networks and propose a transfer-learning method using NTU RGB+D dataset to pre-train our network. More specifically, the proposed method first uses a large-scale NTU RGB+D dataset to pre-train the Transformer-based network for action recognition to extract common features among human motion. Then we freeze the network weights except for the fully connected (FC) layer and take our Tai Chi actions as inputs only to train the initialized FC weights. Experimental results show that our general model pipeline can reach a high accuracy of small-scale fine-grained Tai Chi action recognition with even few inputs and demonstrate that our method achieves the state-of-the-art performance compared with previous Tai Chi action recognition methods.
Compressing Pre-trained Transformers via Low-Bit NxM Sparsity for Natural Language Understanding
Authors: Connor Holmes, Minjia Zhang, Yuxiong He, Bo Wu
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2206.15014
Pdf link: https://arxiv.org/pdf/2206.15014
Abstract In recent years, large pre-trained Transformer networks have demonstrated dramatic improvements in many natural language understanding tasks. However, the huge size of these models brings significant challenges to their fine-tuning and online deployment due to latency and cost constraints. New hardware supporting both N:M semi-structured sparsity and low-precision integer computation is a promising solution to boost DNN model serving efficiency. However, there have been very few studies that systematically investigate to what extent pre-trained Transformer networks benefit from the combination of these techniques, as well as how to best compress each component of the Transformer. We propose a flexible compression framework NxMiFormer that performs simultaneous sparsification and quantization using ADMM and STE-based QAT. Furthermore, we present and inexpensive, heuristic-driven search algorithm that identifies promising heterogeneous compression configurations that meet a compression ratio constraint. When evaluated across the GLUE suite of NLU benchmarks, our approach can achieve up to 93% compression of the encoders of a BERT model while retaining 98.2% of the original model accuracy and taking full advantage of the hardware's capabilities. Heterogeneous configurations found the by the search heuristic maintain 99.5% of the baseline accuracy while still compressing the model by 87.5%.
Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems
Authors: Hyun-Wook Yoon, Ohsung Kwon, Hoyeon Lee, Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim, Min-Jae Hwang
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2206.15067
Pdf link: https://arxiv.org/pdf/2206.15067
Abstract This paper proposes an effective emotional text-to-speech (TTS) system with a pre-trained language model (LM)-based emotion prediction method. Unlike conventional systems that require auxiliary inputs such as manually defined emotion classes, our system directly estimates emotion-related attributes from the input text. Specifically, we utilize generative pre-trained transformer (GPT)-3 to jointly predict both an emotion class and its strength in representing emotions coarse and fine properties, respectively. Then, these attributes are combined in the emotional embedding space and used as conditional features of the TTS model for generating output speech signals. Consequently, the proposed system can produce emotional speech only from text without any auxiliary inputs. Furthermore, because the GPT-3 enables to capture emotional context among the consecutive sentences, the proposed method can effectively handle the paragraph-level generation of emotional speech.
esCorpius: A Massive Spanish Crawling Corpus
Authors: Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, David Griol, Zoraida Callejas
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2206.15147
Pdf link: https://arxiv.org/pdf/2206.15147
Abstract In the recent years, transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce \textsc{esCorpius}, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning pipeline and encompasses a series of deduplication mechanisms that together ensure the integrity of both document and paragraph boundaries. Additionally, we maintain both the source web page URL and the WARC shard origin URL in order to complain with EU regulations. \textsc{esCorpius} has been released under CC BY-NC-ND 4.0 license and is available on HuggingFace.
The Topological BERT: Transforming Attention into Topology for Natural Language Processing
Authors: Ilan Perez, Raphael Reinauer
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Algebraic Topology (math.AT)
Arxiv link: https://arxiv.org/abs/2206.15195
Pdf link: https://arxiv.org/pdf/2206.15195
Abstract In recent years, the introduction of the Transformer models sparked a revolution in natural language processing (NLP). BERT was one of the first text encoders using only the attention mechanism without any recurrent parts to achieve state-of-the-art results on many NLP tasks. This paper introduces a text classifier using topological data analysis. We use BERT's attention maps transformed into attention graphs as the only input to that classifier. The model can solve tasks such as distinguishing spam from ham messages, recognizing whether a sentence is grammatically correct, or evaluating a movie review as negative or positive. It performs comparably to the BERT baseline and outperforms it on some tasks. Additionally, we propose a new method to reduce the number of BERT's attention heads considered by the topological classifier, which allows us to prune the number of heads from 144 down to as few as ten with no reduction in performance. Our work also shows that the topological model displays higher robustness against adversarial attacks than the original BERT model, which is maintained during the pruning process. To the best of our knowledge, this work is the first to confront topological-based models with adversarial attacks in the context of NLP.
ListBERT: Learning to Rank E-commerce products with Listwise BERT
Authors: Lakshya Kumar, Sagnik Sarkar
Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2206.15198
Pdf link: https://arxiv.org/pdf/2206.15198
Abstract Efficient search is a critical component for an e-commerce platform with an innumerable number of products. Every day millions of users search for products pertaining to their needs. Thus, showing the relevant products on the top will enhance the user experience. In this work, we propose a novel approach of fusing a transformer-based model with various listwise loss functions for ranking e-commerce products, given a user query. We pre-train a RoBERTa model over a fashion e-commerce corpus and fine-tune it using different listwise loss functions. Our experiments indicate that the RoBERTa model fine-tuned with an NDCG based surrogate loss function(approxNDCG) achieves an NDCG improvement of 13.9% compared to other popular listwise loss functions like ListNET and ListMLE. The approxNDCG based RoBERTa model also achieves an NDCG improvement of 20.6% compared to the pairwise RankNet based RoBERTa model. We call our methodology of directly optimizing the RoBERTa model in an end-to-end manner with a listwise surrogate loss function as ListBERT. Since there is a low latency requirement in a real-time search setting, we show how these models can be easily adopted by using a knowledge distillation technique to learn a representation-focused student model that can be easily deployed and leads to ~10 times lower ranking latency.
CTrGAN: Cycle Transformers GAN for Gait Transfer
Authors: Shahar Mahpod, Noam Gaash, G. Ben-Artzi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.15248
Pdf link: https://arxiv.org/pdf/2206.15248
Abstract We attempt for the first time to address the problem of gait transfer. In contrast to motion transfer, the objective here is not to imitate the source's normal motions, but rather to transform the source's motion into a typical gait pattern for the target. Using gait recognition models, we demonstrate that existing techniques yield a discrepancy that can be easily detected. We introduce a novel model, Cycle Transformers GAN (CTrGAN), that can successfully generate the target's natural gait. CTrGAN's generators consist of a decoder and encoder, both Transformers, where the attention is on the temporal domain between complete images rather than the spatial domain between patches. While recent Transformer studies in computer vision mainly focused on discriminative tasks, we introduce an architecture that can be applied to synthesis tasks. Using a widely-used gait recognition dataset, we demonstrate that our approach is capable of producing over an order of magnitude more realistic personalized gaits than existing methods, even when used with sources that were not available during training.
Deep Reinforcement Learning with Swin Transformer
Authors: Li Meng, Morten Goodwin, Anis Yazidi, Paal Engelstad
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2206.15269
Pdf link: https://arxiv.org/pdf/2206.15269
Abstract Transformers are neural network models that utilize multiple layers of self-attention heads. Attention is implemented in transformers as the contextual embeddings of the 'key' and 'query'. Transformers allow the re-combination of attention information from different layers and the processing of all inputs at once, which are more convenient than recurrent neural networks when dealt with a large number of data. Transformers have exhibited great performances on natural language processing tasks in recent years. Meanwhile, there have been tremendous efforts to adapt transformers into other fields of machine learning, such as Swin Transformer and Decision Transformer. Swin Transformer is a promising neural network architecture that splits image pixels into small patches and applies local self-attention operations inside the (shifted) windows of fixed sizes. Decision Transformer has successfully applied transformers to off-line reinforcement learning and showed that random-walk samples from Atari games are sufficient to let an agent learn optimized behaviors. However, it is considerably more challenging to combine online reinforcement learning with transformers. In this article, we further explore the possibility of not modifying the reinforcement learning policy, but only replacing the convolutional neural network architecture with the self-attention architecture from Swin Transformer. Namely, we target at changing how an agent views the world, but not how an agent plans about the world. We conduct our experiment on 49 games in Arcade Learning Environment. The results show that using Swin Transformer in reinforcement learning achieves significantly higher evaluation scores across the majority of games in Arcade Learning Environment. Thus, we conclude that online reinforcement learning can benefit from exploiting self-attentions with spatial token embeddings.
FL-Tuning: Layer Tuning for Feed-Forward Network in Transformer
Authors: Jingping Liu, Yuqiu Song, Kui Xue, Hongli Sun, Chao Wang, Lihan Chen, Haiyun Jiang, Jiaqing Liang, Tong Ruan
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2206.15312
Pdf link: https://arxiv.org/pdf/2206.15312
Abstract Prompt tuning is an emerging way of adapting pre-trained language models to downstream tasks. However, the existing studies are mainly to add prompts to the input sequence. This way would not work as expected due to the intermediate multi-head self-attention and feed-forward network computation, making model optimization not very smooth. Hence, we propose a novel tuning way called layer tuning, aiming to add learnable parameters in Transformer layers. Specifically, we focus on layer tuning for feed-forward network in the Transformer, namely FL-tuning. It introduces additional units into the hidden layer of each feed-forward network. We conduct extensive experiments on the public CLUE benchmark. The results show that: 1) Our FL-tuning outperforms prompt tuning methods under both full-data and few-shot settings in almost all cases. In particular, it improves accuracy by 17.93% (full-data setting) on WSC 1.0 and F1 by 16.142% (few-shot setting) on CLUENER over P-tuning v2. 2) Our FL-tuning is more stable and converges about 1.17 times faster than P-tuning v2. 3) With only about 3% of Transformer's parameters to be trained, FL-tuning is comparable with fine-tuning on most datasets, and significantly outperforms fine-tuning (e.g., accuracy improved by 12.9% on WSC 1.1) on several datasets. The source codes are available at https://github.com/genggui001/FL-Tuning.
PolarFormer: Multi-camera 3D Object Detection with Polar Transformer
Authors: Yanqin Jiang, Li Zhang, Zhenwei Miao, Xiatian Zhu, Jin Gao, Weiming Hu, Yu-Gang Jiang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2206.15398
Pdf link: https://arxiv.org/pdf/2206.15398
Abstract 3D object detection in autonomous driving aims to reason "what" and "where" the objects of interest present in a 3D world. Following the conventional wisdom of previous 2D object detection, existing methods often adopt the canonical Cartesian coordinate system with perpendicular axis. However, we conjugate that this does not fit the nature of the ego car's perspective, as each onboard camera perceives the world in shape of wedge intrinsic to the imaging geometry with radical (non-perpendicular) axis. Hence, in this paper we advocate the exploitation of the Polar coordinate system and propose a new Polar Transformer (PolarFormer) for more accurate 3D object detection in the bird's-eye-view (BEV) taking as input only multi-camera 2D images. Specifically, we design a cross attention based Polar detection head without restriction to the shape of input structure to deal with irregular Polar grids. For tackling the unconstrained object scale variations along Polar's distance dimension, we further introduce a multi-scalePolar representation learning strategy. As a result, our model can make best use of the Polar representation rasterized via attending to the corresponding image observation in a sequence-to-sequence fashion subject to the geometric constraints. Thorough experiments on the nuScenes dataset demonstrate that our PolarFormer outperforms significantly state-of-the-art 3D object detection alternatives, as well as yielding competitive performance on BEV semantic segmentation task.
Keyword: autonomous driving

Cross-domain Federated Object Detection
Authors: Shangchao Su, Bin Li, Chengzhi Zhang, Mingzhao Yang, Xiangyang Xue
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.14996
Pdf link: https://arxiv.org/pdf/2206.14996
Abstract Detection models trained by one party (server) may face severe performance degradation when distributed to other users (clients). For example, in autonomous driving scenarios, different driving environments may bring obvious domain shifts, which lead to biases in model predictions. Federated learning that has emerged in recent years can enable multi-party collaborative training without leaking client data. In this paper, we focus on a special cross-domain scenario where the server contains large-scale data and multiple clients only contain a small amount of data; meanwhile, there exist differences in data distributions among the clients. In this case, traditional federated learning techniques cannot take into account the learning of both the global knowledge of all participants and the personalized knowledge of a specific client. To make up for this limitation, we propose a cross-domain federated object detection framework, named FedOD. In order to learn both the global knowledge and the personalized knowledge in different domains, the proposed framework first performs the federated training to obtain a public global aggregated model through multi-teacher distillation, and sends the aggregated model back to each client for finetuning its personalized local model. After very few rounds of communication, on each client we can perform weighted ensemble inference on the public global model and the personalized local model. With the ensemble, the generalization performance of the client-side model can outperform a single model with the same parameter scale. We establish a federated object detection dataset which has significant background differences and instance differences based on multiple public autonomous driving datasets, and then conduct extensive experiments on the dataset. The experimental results validate the effectiveness of the proposed method.
LiDAR-as-Camera for End-to-End Driving
Authors: Ardi Tampuu, Romet Aidla, Jan Are van Gent, Tambet Matiisen
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2206.15170
Pdf link: https://arxiv.org/pdf/2206.15170
Abstract The core task of any autonomous driving system is to transform sensory inputs into driving commands. In end-to-end driving, this is achieved via a neural network, with one or multiple cameras as the most commonly used input and low-level driving command, e.g. steering angle, as output. However, depth-sensing has been shown in simulation to make the end-to-end driving task easier. On a real car, combining depth and visual information can be challenging, due to the difficulty of obtaining good spatial and temporal alignment of the sensors. To alleviate alignment problems, Ouster LiDARs can output surround-view LiDAR-images with depth, intensity, and ambient radiation channels. These measurements originate from the same sensor, rendering them perfectly aligned in time and space. We demonstrate that such LiDAR-images are sufficient for the real-car road-following task and perform at least equally to camera-based models in the tested conditions, with the difference increasing when needing to generalize to new weather conditions. In the second direction of study, we reveal that the temporal smoothness of off-policy prediction sequences correlates equally well with actual on-policy driving ability as the commonly used mean absolute error.
PolarFormer: Multi-camera 3D Object Detection with Polar Transformer
Authors: Yanqin Jiang, Li Zhang, Zhenwei Miao, Xiatian Zhu, Jin Gao, Weiming Hu, Yu-Gang Jiang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2206.15398
Pdf link: https://arxiv.org/pdf/2206.15398
Abstract 3D object detection in autonomous driving aims to reason "what" and "where" the objects of interest present in a 3D world. Following the conventional wisdom of previous 2D object detection, existing methods often adopt the canonical Cartesian coordinate system with perpendicular axis. However, we conjugate that this does not fit the nature of the ego car's perspective, as each onboard camera perceives the world in shape of wedge intrinsic to the imaging geometry with radical (non-perpendicular) axis. Hence, in this paper we advocate the exploitation of the Polar coordinate system and propose a new Polar Transformer (PolarFormer) for more accurate 3D object detection in the bird's-eye-view (BEV) taking as input only multi-camera 2D images. Specifically, we design a cross attention based Polar detection head without restriction to the shape of input structure to deal with irregular Polar grids. For tackling the unconstrained object scale variations along Polar's distance dimension, we further introduce a multi-scalePolar representation learning strategy. As a result, our model can make best use of the Polar representation rasterized via attending to the corresponding image observation in a sequence-to-sequence fashion subject to the geometric constraints. Thorough experiments on the nuScenes dataset demonstrate that our PolarFormer outperforms significantly state-of-the-art 3D object detection alternatives, as well as yielding competitive performance on BEV semantic segmentation task.
Shifts 2.0: Extending The Dataset of Real Distributional Shifts
Authors: Andrey Malinin, Andreas Athanasopoulos, Muhamed Barakovic, Meritxell Bach Cuadra, Mark J. F. Gales, Cristina Granziera, Mara Graziani, Nikolay Kartashev, Konstantinos Kyriakopoulos, Po-Jui Lu, Nataliia Molchanova, Antonis Nikitakis, Vatsal Raina, Francesco La Rosa, Eli Sivena, Vasileios Tsarsitalidis, Efi Tsompopoulou, Elena Volf
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2206.15407
Pdf link: https://arxiv.org/pdf/2206.15407
Abstract Distributional shift, or the mismatch between training and deployment data, is a significant obstacle to the usage of machine learning in high-stakes industrial applications, such as autonomous driving and medicine. This creates a need to be able to assess how robustly ML models generalize as well as the quality of their uncertainty estimates. Standard ML baseline datasets do not allow these properties to be assessed, as the training, validation and test data are often identically distributed. Recently, a range of dedicated benchmarks have appeared, featuring both distributionally matched and shifted data. Among these benchmarks, the Shifts dataset stands out in terms of the diversity of tasks as well as the data modalities it features. While most of the benchmarks are heavily dominated by 2D image classification tasks, Shifts contains tabular weather forecasting, machine translation, and vehicle motion prediction tasks. This enables the robustness properties of models to be assessed on a diverse set of industrial-scale tasks and either universal or directly applicable task-specific conclusions to be reached. In this paper, we extend the Shifts Dataset with two datasets sourced from industrial, high-risk applications of high societal importance. Specifically, we consider the tasks of segmentation of white matter Multiple Sclerosis lesions in 3D magnetic resonance brain images and the estimation of power consumption in marine cargo vessels. Both tasks feature ubiquitous distributional shifts and a strict safety requirement due to the high cost of errors. These new datasets will allow researchers to further explore robust generalization and uncertainty estimation in new situations. In this work, we provide a description of the dataset and baseline results for both tasks.

zhuhu00 / Paper-Daily-Notice

New submissions for Fri, 1 Jul 22 #193

New submissions for Fri, 1 Jul 22

Keyword: SLAM

Neural Rendering for Stereo 3D Reconstruction of Deformable Tissues in Robotic Surgery

Keyword: odometry

Keyword: livox

Keyword: loam

Keyword: lidar

Boosting 3D Object Detection by Simulating Multimodality on Point Clouds

BoxGraph: Semantic Place Recognition and Pose Estimation from 3D LiDAR

HRFuser: A Multi-resolution Sensor Fusion Architecture for 2D Object Detection

LiDAR-as-Camera for End-to-End Driving

Keyword: loop detection

Keyword: nerf

NeRF, meet differential geometry!

Neural Rendering for Stereo 3D Reconstruction of Deformable Tissues in Robotic Surgery

Keyword: mapping

Automated Wheat Disease Detection using a ROS-based Autonomous Guided UAV

ZeroC: A Neuro-Symbolic Model for Zero-shot Concept Recognition and Acquisition at Inference Time

Colonoscopy Navigation using End-to-End Deep Visuomotor Control: A User Study

Learning-Aided Beam Prediction in mmWave MU-MIMO Systems for High-Speed Railway

Revisiting Competitive Coding Approach for Palmprint Recognition: A Linear Discriminant Analysis Perspective

Bounding and computing obstacle numbers of graphs

Keyword: localization

Exploring Temporally Dynamic Data Augmentation for Video Recognition

Automated Wheat Disease Detection using a ROS-based Autonomous Guided UAV

Secure Heterogeneous Multi-Robot Collaboration and Docking with Hyperledger Fabric Blockchain

A Distributed Massive MIMO Channel Sounder for "Big CSI Data"-driven Machine Learning

Keyword: transformer

Causality for Inherently Explainable Transformers: CAT-XPLAIN

Spatial Transformer Network with Transfer Learning for Small-scale Fine-grained Skeleton-based Tai Chi Action Recognition

Compressing Pre-trained Transformers via Low-Bit NxM Sparsity for Natural Language Understanding

Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems

esCorpius: A Massive Spanish Crawling Corpus

The Topological BERT: Transforming Attention into Topology for Natural Language Processing

ListBERT: Learning to Rank E-commerce products with Listwise BERT

CTrGAN: Cycle Transformers GAN for Gait Transfer

Deep Reinforcement Learning with Swin Transformer

FL-Tuning: Layer Tuning for Feed-Forward Network in Transformer

PolarFormer: Multi-camera 3D Object Detection with Polar Transformer

Keyword: autonomous driving

Cross-domain Federated Object Detection

LiDAR-as-Camera for End-to-End Driving

PolarFormer: Multi-camera 3D Object Detection with Polar Transformer

Shifts 2.0: Extending The Dataset of Real Distributional Shifts