Abstract
The Iterative Closest Point (ICP) algorithm is one of the most important algorithms for geometric alignment of three-dimensional surface registration, which is frequently used in computer vision tasks, including the Simultaneous Localization And Mapping (SLAM) tasks. In this paper, we illustrate the theoretical principles of the ICP algorithm, how it can be used in surface registration tasks, and the traditional taxonomy of the variants of the ICP algorithm. As SLAM is becoming a popular topic, we also introduce a SLAM-oriented taxonomy of the ICP algorithm, based on the characteristics of each type of SLAM task, including whether the SLAM task is online or not and whether the landmarks are present as features in the SLAM task. We make a synthesis of each type of SLAM task by comparing several up-to-date research papers and analyzing their implementation details.
Keyword: odometry
There is no result
Keyword: livox
There is no result
Keyword: loam
There is no result
Keyword: lidar
CVR-LSE: Compact Vectorization Representation of Local Static Environments for Unmanned Ground Vehicles
Abstract
According to the requirement of general static obstacle detection, this paper proposes a compact vectorization representation approach of local static environments for unmanned ground vehicles. At first, by fusing the data of LiDAR and IMU, high-frequency pose information is obtained. Then, through the two-dimensional (2D) obstacle points generation, the process of grid map maintenance with a fixed size is proposed. Finally, the local static environment is described via multiple convex polygons, which is realized throungh the double threshold-based boundary simplification and the convex polygon segmentation. Our proposed approach has been applied in a practical driverless project in the park, and the qualitative experimental results on typical scenes verify the effectiveness and robustness. In addition, the quantitative evaluation shows the superior performance on making use of fewer number of points information (decreased by about 60%) to represent the local static environment compared with the traditional grid map-based methods. Furthermore, the performance of running time (15ms) shows that the proposed approach can be used for real-time local static environment perception. The corresponding code can be accessed at https://github.com/ghm0819/cvr_lse.
Monitoring Urban Forests from Auto-Generated Segmentation Maps
Authors: Conrad M Albrecht, Chenying Liu, Yi Wang, Levente Klein, Xiao Xiang Zhu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract
We present and evaluate a weakly-supervised methodology to quantify the spatio-temporal distribution of urban forests based on remotely sensed data with close-to-zero human interaction. Successfully training machine learning models for semantic segmentation typically depends on the availability of high-quality labels. We evaluate the benefit of high-resolution, three-dimensional point cloud data (LiDAR) as source of noisy labels in order to train models for the localization of trees in orthophotos. As proof of concept we sense Hurricane Sandy's impact on urban forests in Coney Island, New York City (NYC) and reference it to less impacted urban space in Brooklyn, NYC.
Keyword: loop detection
There is no result
Keyword: nerf
RigNeRF: Fully Controllable Neural 3D Portraits
Authors: ShahRukh Athar, Zexiang Xu, Kalyan Sunkavalli, Eli Shechtman, Zhixin Shu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Volumetric neural rendering methods, such as neural radiance fields (NeRFs), have enabled photo-realistic novel view synthesis. However, in their standard form, NeRFs do not support the editing of objects, such as a human head, within a scene. In this work, we propose RigNeRF, a system that goes beyond just novel view synthesis and enables full control of head pose and facial expressions learned from a single portrait video. We model changes in head pose and facial expressions using a deformation field that is guided by a 3D morphable face model (3DMM). The 3DMM effectively acts as a prior for RigNeRF that learns to predict only residuals to the 3DMM deformations and allows us to render novel (rigid) poses and (non-rigid) expressions that were not present in the input sequence. Using only a smartphone-captured short video of a subject for training, we demonstrate the effectiveness of our method on free view synthesis of a portrait scene with explicit head pose and expression controls. The project page can be found here: this http URL
Keyword: mapping
ICP Algorithm: Theory, Practice And Its SLAM-oriented Taxonomy
Authors: Hao Bai
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
The Iterative Closest Point (ICP) algorithm is one of the most important algorithms for geometric alignment of three-dimensional surface registration, which is frequently used in computer vision tasks, including the Simultaneous Localization And Mapping (SLAM) tasks. In this paper, we illustrate the theoretical principles of the ICP algorithm, how it can be used in surface registration tasks, and the traditional taxonomy of the variants of the ICP algorithm. As SLAM is becoming a popular topic, we also introduce a SLAM-oriented taxonomy of the ICP algorithm, based on the characteristics of each type of SLAM task, including whether the SLAM task is online or not and whether the landmarks are present as features in the SLAM task. We make a synthesis of each type of SLAM task by comparing several up-to-date research papers and analyzing their implementation details.
Permutation Search of Tensor Network Structures via Local Sampling
Abstract
Recent works put much effort into tensor network structure search (TN-SS), aiming to select suitable tensor network (TN) structures, involving the TN-ranks, formats, and so on, for the decomposition or learning tasks. In this paper, we consider a practical variant of TN-SS, dubbed TN permutation search (TN-PS), in which we search for good mappings from tensor modes onto TN vertices (core tensors) for compact TN representations. We conduct a theoretical investigation of TN-PS and propose a practically-efficient algorithm to resolve the problem. Theoretically, we prove the counting and metric properties of search spaces of TN-PS, analyzing for the first time the impact of TN structures on these unique properties. Numerically, we propose a novel meta-heuristic algorithm, in which the searching is done by randomly sampling in a neighborhood established in our theory, and then recurrently updating the neighborhood until convergence. Numerical results demonstrate that the new algorithm can reduce the required model size of TNs in extensive benchmarks, implying the improvement in the expressive power of TNs. Furthermore, the computational cost for the new algorithm is significantly less than that in~\cite{li2020evolutionary}.
Deep Isolation Forest for Anomaly Detection
Authors: Hongzuo Xu, Guansong Pang, Yijie Wang, Yongjun Wang
Abstract
Isolation forest (iForest) has been emerging as arguably the most popular anomaly detector in recent years. It iteratively performs axis-parallel data space partition in a tree structure to isolate deviated data objects from the other data, with the isolation difficulty of the objects defined as anomaly scores. iForest shows effective performance across popular dataset benchmarks, but its axis-parallel-based linear data partition is ineffective in handling hard anomalies in high-dimensional/non-linear-separable data space, and even worse, it leads to a notorious algorithmic bias that assigns unexpectedly large anomaly scores to artefact regions. There have been several extensions of iForest, but they still focus on linear data partition, failing to effectively isolate those hard anomalies. This paper introduces a novel extension of iForest, deep isolation forest. Our method offers a comprehensive isolation method that can arbitrarily partition the data at any random direction and angle on subspaces of any size, effectively avoiding the algorithmic bias in the linear partition. Further, it requires only randomly initialised neural networks (i.e., no optimisation is required in our method) to ensure the freedom of the partition. In doing so, desired randomness and diversity in both random network-based representations and random partition-based isolation can be fully leveraged to significantly enhance the isolation ensemble-based anomaly detection. Also, our approach offers a data-type-agnostic anomaly detection solution. It is versatile to detect anomalies in different types of data by simply plugging in corresponding randomly initialised neural networks in the feature mapping. Extensive empirical results on a large collection of real-world datasets show that our model achieves substantial improvement over state-of-the-art isolation-based and non-isolation-based anomaly detection models.
An analysis of retracted papers in Computer Science
Authors: Martin Shepperd, Leila Yousefi
Subjects: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
Abstract
Context: The retraction of research papers, for whatever reason, is a growing phenomenon. However, although retracted paper information is publicly available via publishers, it is somewhat distributed and inconsistent. Objective: The aim is to assess: (i) the extent and nature of retracted research in Computer Science (CS) (ii) the post-retraction citation behaviour of retracted works and (iii) the potential impact on systematic reviews and mapping studies. Method: We analyse the Retraction Watch database and take citation information from the Web of Science and Google scholar. Results: We find that of the 33,955 entries in the Retraction watch database (16 May 2022), 2,816 are classified as CS, i.e., approximately 8.3%. For CS, 56% of retracted papers, provide little or no information as to the reasons. This contrasts with 26% for other disciplines. There is also a remarkable disparity between different publishers, a tendency for multiple versions of a retracted paper over and above the Version of Record (VoR), and for new citations long after a paper is officially retracted. Conclusions: Unfortunately retraction seems to be a sufficiently common outcome for a scientific paper that we as a research community need to take it more seriously, e.g., standardising procedures and taxonomies across publishers and the provision of appropriate research tools. Finally, we recommend particular caution when undertaking secondary analyses and meta-analyses which are at risk of becoming contaminated by these problem primary studies.
A Neural Network-Based Energy Management System for PV-Battery Based Microgrids
Abstract
A neural network-based energy management system (NN-EMS) has been proposed in this paper for islanded ac microgrids fed by multiple PV-battery based distributed generators (DG). The stochastic and unequal irradiation results in unequal PV output, which causes an unequal state-of-charge (SoC) among the batteries of the DGs. This effect may cause the difference in the SoCs to increase considerably over time, leading to some batteries reaching their SoC limits. These batteries would no longer be able to control the dc-link of the hybrid grid forming DG. The proposed NN-EMS ensures SoC balancing by learning an optimal state-action mapping using the outputs of an optimal power flow (OPF). The training dataset has been generated by executing a mixed-integer linear programming based OPF for droop-based island microgrids considering a practical generation-load profile. The resultant NN-EMS controller inherits the information of optimal states and the network behaviour. Compared to traditional time-ahead centralized methods, the proposed strategy does not require accurate generation-load forecasting. Further, it can also respond to the variations in the PV power in near-real-time without resorting to solving an OPF. The proposed NN-EMS controller has been validated by case studies on a CIGRE LV microgrid containing PV-battery hybrid DGs. The proposed concept can also be extended to synthesize decentralized controllers that can cooperate among themselves to achieve a global objective without communication.
Learning Dense Features for Point Cloud Registration Using Graph Attention Network
Authors: Lai Dang Quoc Vinh, Sarvar Hussain Nengroo, Hojun Jin
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Point cloud registration is a fundamental task in many applications such as localization, mapping, tracking, and reconstruction. The successful registration relies on extracting robust and discriminative geometric features. Existing learning-based methods require high computing capacity for processing a large number of raw points at the same time. Although these approaches achieve convincing results, they are difficult to apply in real-world situations due to high computational costs. In this paper, we introduce a framework that efficiently and economically extracts dense features using graph attention network for point cloud matching and registration (DFGAT). The detector of the DFGAT is responsible for finding highly reliable key points in large raw data sets. The descriptor of the DFGAT takes these key points combined with their neighbors to extract invariant density features in preparation for the matching. The graph attention network uses the attention mechanism that enriches the relationships between point clouds. Finally, we consider this as an optimal transport problem and use the Sinkhorn algorithm to find positive and negative matches. We perform thorough tests on the KITTI dataset and evaluate the effectiveness of this approach. The results show that this method with the efficiently compact keypoint selection and description can achieve the best performance matching metrics and reach highest success ratio of 99.88% registration in comparison with other state-of-the-art approaches.
LPCSE: Neural Speech Enhancement through Linear Predictive Coding
Authors: Yang Liu, Na Tang, Xiaoli Chu, Yang Yang, Jun Wang
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract
The increasingly stringent requirement on quality-of-experience in 5G/B5G communication systems has led to the emerging neural speech enhancement techniques, which however have been developed in isolation from the existing expert-rule based models of speech pronunciation and distortion, such as the classic Linear Predictive Coding (LPC) speech model because it is difficult to integrate the models with auto-differentiable machine learning frameworks. In this paper, to improve the efficiency of neural speech enhancement, we introduce an LPC-based speech enhancement (LPCSE) architecture, which leverages the strong inductive biases in the LPC speech model in conjunction with the expressive power of neural networks. Differentiable end-to-end learning is achieved in LPCSE via two novel blocks: a block that utilizes the expert rules to reduce the computational overhead when integrating the LPC speech model into neural networks, and a block that ensures the stability of the model and avoids exploding gradients in end-to-end training by mapping the Linear prediction coefficients to the filter poles. The experimental results show that LPCSE successfully restores the formants of the speeches distorted by transmission loss, and outperforms two existing neural speech enhancement methods of comparable neural network sizes in terms of the Perceptual evaluation of speech quality (PESQ) and Short-Time Objective Intelligibility (STOI) on the LJ Speech corpus.
The Maximum Linear Arrangement for trees under projectivity and planarity
Authors: Lluís Alemany-Puig, Juan Luis Esteban, Ramon Ferrer-i-Cancho
Subjects: Data Structures and Algorithms (cs.DS); Computation and Language (cs.CL); Discrete Mathematics (cs.DM)
Abstract
The Maximum Linear Arrangement problem (MaxLA) consists of finding a mapping $\pi$ from the $n$ vertices of a graph $G$ to distinct consecutive integers that maximizes $D{\pi}(G)=\sum{uv\in E(G)}|\pi(u) - \pi(v)|$. In this setting, vertices are considered to lie on a horizontal line and edges are drawn as semicircles above the line. There exist variants of MaxLA in which the arrangements are constrained. In the planar variant edge crossings are forbidden. In the projective variant for rooted trees arrangements are planar and the root cannot be covered by any edge. Here we present $O(n)$-time and $O(n)$-space algorithms that solve Planar and Projective MaxLA for trees. We also prove several properties of maximum projective and planar arrangements.
Scaling ResNets in the Large-depth Regime
Authors: Pierre Marion, Adeline Fermanian, Gérard Biau, Jean-Philippe Vert
Abstract
Deep ResNets are recognized for achieving state-of-the-art results in complex machine learning tasks. However, the remarkable performance of these architectures relies on a training procedure that needs to be carefully crafted to avoid vanishing or exploding gradients, particularly as the depth $L$ increases. No consensus has been reached on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a factor $\alpha_L$. We show in a probabilistic setting that with standard i.i.d. initializations, the only non-trivial dynamics is for $\alpha_L = 1/\sqrt{L}$ (other choices lead either to explosion or to identity mapping). This scaling factor corresponds in the continuous-time limit to a neural stochastic differential equation, contrarily to a widespread interpretation that deep ResNets are discretizations of neural ordinary differential equations. By contrast, in the latter regime, stability is obtained with specific correlated initializations and $\alpha_L = 1/L$. Our analysis suggests a strong interplay between scaling and regularity of the weights as a function of the layer index. Finally, in a series of experiments, we exhibit a continuous range of regimes driven by these two parameters, which jointly impact performance before and after training.
ProcTHOR: Large-Scale Embodied AI Using Procedural Generation
Authors: Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Jordi Salvador, Kiana Ehsani, Winson Han, Eric Kolve, Ali Farhadi, Aniruddha Kembhavi, Roozbeh Mottaghi
Abstract
Massive datasets and high-capacity models have driven many recent advancements in computer vision and natural language understanding. This work presents a platform to enable similar success stories in Embodied AI. We propose ProcTHOR, a framework for procedural generation of Embodied AI environments. ProcTHOR enables us to sample arbitrarily large datasets of diverse, interactive, customizable, and performant virtual environments to train and evaluate embodied agents across navigation, interaction, and manipulation tasks. We demonstrate the power and potential of ProcTHOR via a sample of 10,000 generated houses and a simple neural model. Models trained using only RGB images on ProcTHOR, with no explicit mapping and no human task supervision produce state-of-the-art results across 6 embodied AI benchmarks for navigation, rearrangement, and arm manipulation, including the presently running Habitat 2022, AI2-THOR Rearrangement 2022, and RoboTHOR challenges. We also demonstrate strong 0-shot results on these benchmarks, via pre-training on ProcTHOR with no fine-tuning on the downstream benchmark, often beating previous state-of-the-art systems that access the downstream training data.
Constellation Design for Deep Joint Source-Channel Coding
Authors: Mengyang Wang, Jiahui Li, Mengyao Ma, Xiaopeng Fan
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Abstract
Deep learning-based joint source-channel coding (JSCC) has shown excellent performance in image and feature transmission. However, the output values of the JSCC encoder are continuous, which makes the constellation of modulation complex and dense. It is hard and expensive to design radio frequency chains for transmitting such full-resolution constellation points. In this paper, two methods of mapping the full-resolution constellation to finite constellation are proposed for real system implementation. The constellation mapping results of the proposed methods correspond to regular constellation and irregular constellation, respectively. We apply the methods to existing deep JSCC models and evaluate them on AWGN channels with different signal-to-noise ratios (SNRs). Experimental results show that the proposed methods outperform the traditional uniform quadrature amplitude modulation (QAM) constellation mapping method by only adding a few additional parameters.
Keyword: localization
ICP Algorithm: Theory, Practice And Its SLAM-oriented Taxonomy
Authors: Hao Bai
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
The Iterative Closest Point (ICP) algorithm is one of the most important algorithms for geometric alignment of three-dimensional surface registration, which is frequently used in computer vision tasks, including the Simultaneous Localization And Mapping (SLAM) tasks. In this paper, we illustrate the theoretical principles of the ICP algorithm, how it can be used in surface registration tasks, and the traditional taxonomy of the variants of the ICP algorithm. As SLAM is becoming a popular topic, we also introduce a SLAM-oriented taxonomy of the ICP algorithm, based on the characteristics of each type of SLAM task, including whether the SLAM task is online or not and whether the landmarks are present as features in the SLAM task. We make a synthesis of each type of SLAM task by comparing several up-to-date research papers and analyzing their implementation details.
Spiking Neural Networks for Frame-based and Event-based Single Object Localization
Authors: Sami Barchid, José Mennesson, Jason Eshraghian, Chaabane Djéraba, Mohammed Bennamoun
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Spiking neural networks have shown much promise as an energy-efficient alternative to artificial neural networks. However, understanding the impacts of sensor noises and input encodings on the network activity and performance remains difficult with common neuromorphic vision baselines like classification. Therefore, we propose a spiking neural network approach for single object localization trained using surrogate gradient descent, for frame- and event-based sensors. We compare our method with similar artificial neural networks and show that our model has competitive/better performance in accuracy, robustness against various corruptions, and has lower energy consumption. Moreover, we study the impact of neural coding schemes for static images in accuracy, robustness, and energy efficiency. Our observations differ importantly from previous studies on bio-plausible learning rules, which helps in the design of surrogate gradient trained architectures, and offers insight to design priorities in future neuromorphic technologies in terms of noise characteristics and data encoding methods.
Learning Dense Features for Point Cloud Registration Using Graph Attention Network
Authors: Lai Dang Quoc Vinh, Sarvar Hussain Nengroo, Hojun Jin
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Point cloud registration is a fundamental task in many applications such as localization, mapping, tracking, and reconstruction. The successful registration relies on extracting robust and discriminative geometric features. Existing learning-based methods require high computing capacity for processing a large number of raw points at the same time. Although these approaches achieve convincing results, they are difficult to apply in real-world situations due to high computational costs. In this paper, we introduce a framework that efficiently and economically extracts dense features using graph attention network for point cloud matching and registration (DFGAT). The detector of the DFGAT is responsible for finding highly reliable key points in large raw data sets. The descriptor of the DFGAT takes these key points combined with their neighbors to extract invariant density features in preparation for the matching. The graph attention network uses the attention mechanism that enriches the relationships between point clouds. Finally, we consider this as an optimal transport problem and use the Sinkhorn algorithm to find positive and negative matches. We perform thorough tests on the KITTI dataset and evaluate the effectiveness of this approach. The results show that this method with the efficiently compact keypoint selection and description can achieve the best performance matching metrics and reach highest success ratio of 99.88% registration in comparison with other state-of-the-art approaches.
Monitoring Urban Forests from Auto-Generated Segmentation Maps
Authors: Conrad M Albrecht, Chenying Liu, Yi Wang, Levente Klein, Xiao Xiang Zhu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract
We present and evaluate a weakly-supervised methodology to quantify the spatio-temporal distribution of urban forests based on remotely sensed data with close-to-zero human interaction. Successfully training machine learning models for semantic segmentation typically depends on the availability of high-quality labels. We evaluate the benefit of high-resolution, three-dimensional point cloud data (LiDAR) as source of noisy labels in order to train models for the localization of trees in orthophotos. As proof of concept we sense Hurricane Sandy's impact on urban forests in Coney Island, New York City (NYC) and reference it to less impacted urban space in Brooklyn, NYC.
Polarization Diversity-enabled LOS/NLOS Identification via Carrier Phase Measurements
Authors: Onel L. A. López, Dileep Kumar, Antti Tölli
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Abstract
Provision of accurate localization is an increasingly important feature of wireless networks. To this end, reliable distinction between line-of-sight (LOS) and non-LOS (NLOS) radio links is necessary to avoid degenerative localization estimation biases. Interestingly, LOS and NLOS transmissions affect differently the polarization of receive signals. In this work, we leverage this phenomenon to propose a threshold-based LOS/NLOS classifier exploiting weighted differential carrier phase measurements over a single link with different polarization configurations. Operation in full and limited polarization diversity systems are both possible. We develop a framework for assessing the performance of the proposed classifier, and show through simulations the performance impact of the reflecting materials in NLOS scenarios. For instance, the classifier is far more efficient in NLOS scenarios with wooden reflectors than in those with metallic reflectors. Numerical results evince the potential performance gains from exploiting full polarization diversity, properly weighting the differential carrier phase measurements, and using multi-signal/tone transmissions. Finally, we show that the optimum decision threshold is inversely proportional to the path power gain in dB, while it does not depend significantly on the material of potential NLOS reflectors.
Keyword: transformer
Compositional Mixture Representations for Vision and Text
Authors: Stephan Alaniz, Marco Federici, Zeynep Akata
Abstract
Learning a common representation space between vision and language allows deep networks to relate objects in the image to the corresponding semantic meaning. We present a model that learns a shared Gaussian mixture representation imposing the compositionality of the text onto the visual domain without having explicit location supervision. By combining the spatial transformer with a representation learning approach we learn to split images into separately encoded patches to associate visual and textual representations in an interpretable manner. On variations of MNIST and CIFAR10, our model is able to perform weakly supervised object detection and demonstrates its ability to extrapolate to unseen combination of objects.
Multimodal Learning with Transformers: A Survey
Authors: Peng Xu, Xiatian Zhu, David A. Clifton
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
Transformer is a promising neural network learner, and has achieved great success in various machine learning tasks. Thanks to the recent prevalence of multimodal applications and big data, Transformer-based multimodal learning has become a hot topic in AI research. This paper presents a comprehensive survey of Transformer techniques oriented at multimodal data. The main contents of this survey include: (1) a background of multimodal learning, Transformer ecosystem, and the multimodal big data era, (2) a theoretical review of Vanilla Transformer, Vision Transformer, and multimodal Transformers, from a geometrically topological perspective, (3) a review of multimodal Transformer applications, via two important paradigms, i.e., for multimodal pretraining and for specific multimodal tasks, (4) a summary of the common challenges and designs shared by the multimodal Transformer models and applications, and (5) a discussion of open problems and potential research directions for the community.
Exploring evolution-based & -free protein language models as protein function predictors
Authors: Mingyang Hu, Fajie Yuan, Kevin K. Yang, Fusong Ju, Jin Su, Hui Wang, Fei Yang, Qiuyang Ding
Abstract
Large-scale Protein Language Models (PLMs) have improved performance in protein prediction tasks, ranging from 3D structure prediction to various function predictions. In particular, AlphaFold, a ground-breaking AI system, could potentially reshape structural biology. However, the utility of the PLM module in AlphaFold, Evoformer, has not been explored beyond structure prediction. In this paper, we investigate the representation ability of three popular PLMs: ESM-1b (single sequence), MSA-Transformer (multiple sequence alignment) and Evoformer (structural), with a special focus on Evoformer. Specifically, we aim to answer the following key questions: (\romannumeral1) Does the Evoformer trained as part of AlphaFold produce representations amenable to predicting protein function? (\romannumeral2) If yes, can Evoformer replace ESM-1b and MSA-Transformer? (\romannumeral3) How much do these PLMs rely on evolution-related protein data? In this regard, are they complementary to each other? We compare these models by empirical study along with new insights and conclusions.Finally, we release code and datasets for reproducibility.
Abstract
The transformer architecture and variants presented remarkable success across many machine learning tasks in recent years. This success is intrinsically related to the capability of handling long sequences and the presence of context-dependent weights from the attention mechanism. We argue that these capabilities suit the central role of a Meta-Reinforcement Learning algorithm. Indeed, a meta-RL agent needs to infer the task from a sequence of trajectories. Furthermore, it requires a fast adaptation strategy to adapt its policy for a new task -- which can be achieved using the self-attention mechanism. In this work, we present TrMRL (Transformers for Meta-Reinforcement Learning), a meta-RL agent that mimics the memory reinstatement mechanism using the transformer architecture. It associates the recent past of working memories to build an episodic memory recursively through the transformer layers. We show that the self-attention computes a consensus representation that minimizes the Bayes Risk at each layer and provides meaningful features to compute the best actions. We conducted experiments in high-dimensional continuous control environments for locomotion and dexterous manipulation. Results show that TrMRL presents comparable or superior asymptotic performance, sample efficiency, and out-of-distribution generalization compared to the baselines in these environments.
TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer
Abstract
In this work, we explore neat yet effective Transformer-based frameworks for visual grounding. The previous methods generally address the core problem of visual grounding, i.e., multi-modal fusion and reasoning, with manually-designed mechanisms. Such heuristic designs are not only complicated but also make models easily overfit specific data distributions. To avoid this, we first propose TransVG, which establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates. We empirically show that complicated fusion modules can be replaced by a simple stack of Transformer encoder layers with higher performance. However, the core fusion Transformer in TransVG is stand-alone against uni-modal encoders, and thus should be trained from scratch on limited visual grounding data, which makes it hard to be optimized and leads to sub-optimal performance. To this end, we further introduce TransVG++ to make two-fold improvements. For one thing, we upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding. For another, we devise Language Conditioned Vision Transformer that removes external fusion modules and reuses the uni-modal ViT for vision-language fusion at the intermediate layers. We conduct extensive experiments on five prevalent datasets, and report a series of state-of-the-art records.
Recurrent Transformer Variational Autoencoders for Multi-Action Motion Synthesis
Authors: Rania Briq, Chuhang Zou, Leonid Pishchulin, Chris Broaddus, Juergen Gall
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
We consider the problem of synthesizing multi-action human motion sequences of arbitrary lengths. Existing approaches have mastered motion sequence generation in single-action scenarios, but fail to generalize to multi-action and arbitrary-length sequences. We fill this gap by proposing a novel efficient approach that leverages the expressiveness of Recurrent Transformers and generative richness of conditional Variational Autoencoders. The proposed iterative approach is able to generate smooth and realistic human motion sequences with an arbitrary number of actions and frames while doing so in linear space and time. We train and evaluate the proposed approach on PROX dataset which we augment with ground-truth action labels. Experimental evaluation shows significant improvements in FID score and semantic consistency metrics compared to the state-of-the-art.
Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO
Authors: Javier Rando, Nasib Naimi, Thomas Baumann, Max Mathys
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Abstract
This work conducts the first analysis on the robustness against adversarial attacks on self-supervised Vision Transformers trained using DINO. First, we evaluate whether features learned through self-supervision are more robust to adversarial attacks than those emerging from supervised learning. Then, we present properties arising for attacks in the latent space. Finally, we evaluate whether three well-known defense strategies can increase adversarial robustness in downstream tasks by only fine-tuning the classification head to provide robustness even in view of limited compute resources. These defense strategies are: Adversarial Training, Ensemble Adversarial Training and Ensemble of Specialized Networks.
DeepEmotex: Classifying Emotion in Text Messages using Deep Transfer Learning
Authors: Maryam Hasan, Elke Rundensteiner, Emmanuel Agu
Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract
Transfer learning has been widely used in natural language processing through deep pretrained language models, such as Bidirectional Encoder Representations from Transformers and Universal Sentence Encoder. Despite the great success, language models get overfitted when applied to small datasets and are prone to forgetting when fine-tuned with a classifier. To remedy this problem of forgetting in transferring deep pretrained language models from one domain to another domain, existing efforts explore fine-tuning methods to forget less. We propose DeepEmotex an effective sequential transfer learning method to detect emotion in text. To avoid forgetting problem, the fine-tuning step is instrumented by a large amount of emotion-labeled data collected from Twitter. We conduct an experimental study using both curated Twitter data sets and benchmark data sets. DeepEmotex models achieve over 91% accuracy for multi-class emotion classification on test dataset. We evaluate the performance of the fine-tuned DeepEmotex models in classifying emotion in EmoInt and Stimulus benchmark datasets. The models correctly classify emotion in 73% of the instances in the benchmark datasets. The proposed DeepEmotex-BERT model outperforms Bi-LSTM result on the benchmark datasets by 23%. We also study the effect of the size of the fine-tuning dataset on the accuracy of our models. Our evaluation results show that fine-tuning with a large set of emotion-labeled data improves both the robustness and effectiveness of the resulting target task model.
Recommender Transformers with Behavior Pathways
Authors: Zhiyu Yao, Xinyang Chen, Sinan Wang, Qinyan Dai, Yumeng Li, Tanchao Zhu, Mingsheng Long
Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract
Sequential recommendation requires the recommender to capture the evolving behavior characteristics from logged user behavior data for accurate recommendations. However, user behavior sequences are viewed as a script with multiple ongoing threads intertwined. We find that only a small set of pivotal behaviors can be evolved into the user's future action. As a result, the future behavior of the user is hard to predict. We conclude this characteristic for sequential behaviors of each user as the Behavior Pathway. Different users have their unique behavior pathways. Among existing sequential models, transformers have shown great capacity in capturing global-dependent characteristics. However, these models mainly provide a dense distribution over all previous behaviors using the self-attention mechanism, making the final predictions overwhelmed by the trivial behaviors not adjusted to each user. In this paper, we build the Recommender Transformer (RETR) with a novel Pathway Attention mechanism. RETR can dynamically plan the behavior pathway specified for each user, and sparingly activate the network through this behavior pathway to effectively capture evolving patterns useful for recommendation. The key design is a learned binary route to prevent the behavior pathway from being overwhelmed by trivial behaviors. We empirically verify the effectiveness of RETR on seven real-world datasets and RETR yields state-of-the-art performance.
Efficient Decoder-free Object Detection with Transformers
Abstract
Vision transformers (ViTs) are changing the landscape of object detection approaches. A natural usage of ViTs in detection is to replace the CNN-based backbone with a transformer-based backbone, which is straightforward and effective, with the price of bringing considerable computation burden for inference. More subtle usage is the DETR family, which eliminates the need for many hand-designed components in object detection but introduces a decoder demanding an extra-long time to converge. As a result, transformer-based object detection can not prevail in large-scale applications. To overcome these issues, we propose a novel decoder-free fully transformer-based (DFFT) object detector, achieving high efficiency in both training and inference stages, for the first time. We simplify objection detection into an encoder-only single-level anchor-based dense prediction problem by centering around two entry points: 1) Eliminate the training-inefficient decoder and leverage two strong encoders to preserve the accuracy of single-level feature map prediction; 2) Explore low-level semantic features for the detection task with limited computational resources. In particular, we design a novel lightweight detection-oriented transformer backbone that efficiently captures low-level features with rich semantics based on a well-conceived ablation study. Extensive experiments on the MS COCO benchmark demonstrate that DFFT_SMALL outperforms DETR by 2.5% AP with 28% computation cost reduction and more than $10\times$ fewer training epochs. Compared with the cutting-edge anchor-based detector RetinaNet, DFFT_SMALL obtains over 5.5% AP gain while cutting down 70% computation cost.
Object Scene Representation Transformer
Authors: Mehdi S. M. Sajjadi, Daniel Duckworth, Aravindh Mahendran, Sjoerd van Steenkiste, Filip Pavetić, Mario Lučić, Leonidas J. Guibas, Klaus Greff, Thomas Kipf
Abstract
A compositional understanding of the world in terms of objects and their geometry in 3D space is considered a cornerstone of human cognition. Facilitating the learning of such a representation in neural networks holds promise for substantially improving labeled data efficiency. As a key step in this direction, we make progress on the problem of learning 3D-consistent decompositions of complex scenes into individual objects in an unsupervised fashion. We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis. OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods. At the same time, it is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder. We believe this work will not only accelerate future architecture exploration and scaling efforts, but it will also serve as a useful tool for both object-centric as well as neural scene representation learning communities.
Comprehending and Ordering Semantics for Image Captioning
Authors: Yehao Li, Yingwei Pan, Ting Yao, Tao Mei
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
Abstract
Comprehending the rich semantics in an image and ordering them in linguistic order are essential to compose a visually-grounded and linguistically coherent description for image captioning. Modern techniques commonly capitalize on a pre-trained object detector/classifier to mine the semantics in an image, while leaving the inherent linguistic ordering of semantics under-exploited. In this paper, we propose a new recipe of Transformer-style structure, namely Comprehending and Ordering Semantics Networks (COS-Net), that novelly unifies an enriched semantic comprehending and a learnable semantic ordering processes into a single architecture. Technically, we initially utilize a cross-modal retrieval model to search the relevant sentences of each image, and all words in the searched sentences are taken as primary semantic cues. Next, a novel semantic comprehender is devised to filter out the irrelevant semantic words in primary semantic cues, and meanwhile infer the missing relevant semantic words visually grounded in the image. After that, we feed all the screened and enriched semantic words into a semantic ranker, which learns to allocate all semantic words in linguistic order as humans. Such sequence of ordered semantic words are further integrated with visual tokens of images to trigger sentence generation. Empirical evidences show that COS-Net clearly surpasses the state-of-the-art approaches on COCO and achieves to-date the best CIDEr score of 141.1% on Karpathy test split. Source code is available at \url{https://github.com/YehLi/xmodaler/tree/master/configs/image_caption/cosnet}.
Stand-Alone Inter-Frame Attention in Video Models
Authors: Fuchen Long, Zhaofan Qiu, Yingwei Pan, Ting Yao, Jiebo Luo, Tao Mei
Abstract
Motion, as the uniqueness of a video, has been critical to the development of video understanding models. Modern deep learning models leverage motion by either executing spatio-temporal 3D convolutions, factorizing 3D convolutions into spatial and temporal convolutions separately, or computing self-attention along temporal dimension. The implicit assumption behind such successes is that the feature maps across consecutive frames can be nicely aggregated. Nevertheless, the assumption may not always hold especially for the regions with large deformation. In this paper, we present a new recipe of inter-frame attention block, namely Stand-alone Inter-Frame Attention (SIFA), that novelly delves into the deformation across frames to estimate local self-attention on each spatial location. Technically, SIFA remoulds the deformable design via re-scaling the offset predictions by the difference between two frames. Taking each spatial location in the current frame as the query, the locally deformable neighbors in the next frame are regarded as the keys/values. Then, SIFA measures the similarity between query and keys as stand-alone attention to weighted average the values for temporal aggregation. We further plug SIFA block into ConvNets and Vision Transformer, respectively, to devise SIFA-Net and SIFA-Transformer. Extensive experiments conducted on four video datasets demonstrate the superiority of SIFA-Net and SIFA-Transformer as stronger backbones. More remarkably, SIFA-Transformer achieves an accuracy of 83.1% on Kinetics-400 dataset. Source code is available at \url{https://github.com/FuchenUSTC/SIFA}.
Keyword: autonomous driving
Asymmetric Dual-Decoder U-Net for Joint Rain and Haze Removal
Abstract
This work studies the joint rain and haze removal problem. In real-life scenarios, rain and haze, two often co-occurring common weather phenomena, can greatly degrade the clarity and quality of the scene images, leading to a performance drop in the visual applications, such as autonomous driving. However, jointly removing the rain and haze in scene images is ill-posed and challenging, where the existence of haze and rain and the change of atmosphere light, can both degrade the scene information. Current methods focus on the contamination removal part, thus ignoring the restoration of the scene information affected by the change of atmospheric light. We propose a novel deep neural network, named Asymmetric Dual-decoder U-Net (ADU-Net), to address the aforementioned challenge. The ADU-Net produces both the contamination residual and the scene residual to efficiently remove the rain and haze while preserving the fidelity of the scene information. Extensive experiments show our work outperforms the existing state-of-the-art methods by a considerable margin in both synthetic data and real-world data benchmarks, including RainCityscapes, BID Rain, and SPA-Data. For instance, we improve the state-of-the-art PSNR value by 2.26/4.57 on the RainCityscapes/SPA-Data, respectively. Codes will be made available freely to the research community.
New submissions for Wed, 15 Jun 22
Keyword: SLAM
ICP Algorithm: Theory, Practice And Its SLAM-oriented Taxonomy
Keyword: odometry
There is no result
Keyword: livox
There is no result
Keyword: loam
There is no result
Keyword: lidar
CVR-LSE: Compact Vectorization Representation of Local Static Environments for Unmanned Ground Vehicles
Monitoring Urban Forests from Auto-Generated Segmentation Maps
Keyword: loop detection
There is no result
Keyword: nerf
RigNeRF: Fully Controllable Neural 3D Portraits
Keyword: mapping
ICP Algorithm: Theory, Practice And Its SLAM-oriented Taxonomy
Permutation Search of Tensor Network Structures via Local Sampling
Deep Isolation Forest for Anomaly Detection
An analysis of retracted papers in Computer Science
A Neural Network-Based Energy Management System for PV-Battery Based Microgrids
Learning Dense Features for Point Cloud Registration Using Graph Attention Network
LPCSE: Neural Speech Enhancement through Linear Predictive Coding
The Maximum Linear Arrangement for trees under projectivity and planarity
Scaling ResNets in the Large-depth Regime
ProcTHOR: Large-Scale Embodied AI Using Procedural Generation
Constellation Design for Deep Joint Source-Channel Coding
Keyword: localization
ICP Algorithm: Theory, Practice And Its SLAM-oriented Taxonomy
Spiking Neural Networks for Frame-based and Event-based Single Object Localization
Learning Dense Features for Point Cloud Registration Using Graph Attention Network
Monitoring Urban Forests from Auto-Generated Segmentation Maps
Polarization Diversity-enabled LOS/NLOS Identification via Carrier Phase Measurements
Keyword: transformer
Compositional Mixture Representations for Vision and Text
Multimodal Learning with Transformers: A Survey
Exploring evolution-based & -free protein language models as protein function predictors
Transformers are Meta-Reinforcement Learners
TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer
Recurrent Transformer Variational Autoencoders for Multi-Action Motion Synthesis
Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO
DeepEmotex: Classifying Emotion in Text Messages using Deep Transfer Learning
Recommender Transformers with Behavior Pathways
Efficient Decoder-free Object Detection with Transformers
Object Scene Representation Transformer
Comprehending and Ordering Semantics for Image Captioning
Stand-Alone Inter-Frame Attention in Video Models
Keyword: autonomous driving
Asymmetric Dual-Decoder U-Net for Joint Rain and Haze Removal