New submissions for Fri, 13 May 22

Keyword: SLAM

S3E-GNN: Sparse Spatial Scene Embedding with Graph Neural Networks for Camera Relocalization

Authors: Ran Cheng, Xinyu Jiang, Yuan Chen, Lige Liu, Tao Sun
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2205.05861
Pdf link: https://arxiv.org/pdf/2205.05861
Abstract Camera relocalization is the key component of simultaneous localization and mapping (SLAM) systems. This paper proposes a learning-based approach, named Sparse Spatial Scene Embedding with Graph Neural Networks (S3E-GNN), as an end-to-end framework for efficient and robust camera relocalization. S3E-GNN consists of two modules. In the encoding module, a trained S3E network encodes RGB images into embedding codes to implicitly represent spatial and semantic embedding code. With embedding codes and the associated poses obtained from a SLAM system, each image is represented as a graph node in a pose graph. In the GNN query module, the pose graph is transformed to form a embedding-aggregated reference graph for camera relocalization. We collect various scene datasets in the challenging environments to perform experiments. Our results demonstrate that S3E-GNN method outperforms the traditional Bag-of-words (BoW) for camera relocalization due to learning-based embedding and GNN powered scene matching mechanism.
Dynamic Dense RGB-D SLAM using Learning-based Visual Odometry
Authors: Shihao Shen, Yilin Cai, Jiayi Qiu, Guangzhao Li
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2205.05916
Pdf link: https://arxiv.org/pdf/2205.05916
Abstract We propose a dense dynamic RGB-D SLAM pipeline based on a learning-based visual odometry, TartanVO. TartanVO, like other direct methods rather than feature-based, estimates camera pose through dense optical flow, which only applies to static scenes and disregards dynamic objects. Due to the color constancy assumption, optical flow is not able to differentiate between dynamic and static pixels. Therefore, to reconstruct a static map through such direct methods, our pipeline resolves dynamic/static segmentation by leveraging the optical flow output, and only fuse static points into the map. Moreover, we rerender the input frames such that the dynamic pixels are removed and iteratively pass them back into the visual odometry to refine the pose estimate.
Keyword: odometry

Dynamic Dense RGB-D SLAM using Learning-based Visual Odometry
Authors: Shihao Shen, Yilin Cai, Jiayi Qiu, Guangzhao Li
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2205.05916
Pdf link: https://arxiv.org/pdf/2205.05916
Abstract We propose a dense dynamic RGB-D SLAM pipeline based on a learning-based visual odometry, TartanVO. TartanVO, like other direct methods rather than feature-based, estimates camera pose through dense optical flow, which only applies to static scenes and disregards dynamic objects. Due to the color constancy assumption, optical flow is not able to differentiate between dynamic and static pixels. Therefore, to reconstruct a static map through such direct methods, our pipeline resolves dynamic/static segmentation by leveraging the optical flow output, and only fuse static points into the map. Moreover, we rerender the input frames such that the dynamic pixels are removed and iteratively pass them back into the visual odometry to refine the pose estimate.
Efficient Deep Visual and Inertial Odometry with Adaptive Visual Modality Selection
Authors: Mingyu Yang, Yu Chen, Hun-Seok Kim
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2205.06187
Pdf link: https://arxiv.org/pdf/2205.06187
Abstract In recent years, deep learning-based approaches for visual-inertial odometry (VIO) have shown remarkable performance outperforming traditional geometric methods. Yet, all existing methods use both the visual and inertial measurements for every pose estimation incurring potential computational redundancy. While visual data processing is much more expensive than that for the inertial measurement unit (IMU), it may not always contribute to improving the pose estimation accuracy. In this paper, we propose an adaptive deep-learning based VIO method that reduces computational redundancy by opportunistically disabling the visual modality. Specifically, we train a policy network that learns to deactivate the visual feature extractor on the fly based on the current motion state and IMU readings. A Gumbel-Softmax trick is adopted to train the policy network to make the decision process differentiable for end-to-end system training. The learned strategy is interpretable, and it shows scenario-dependent decision patterns for adaptive complexity reduction. Experiment results show that our method achieves a similar or even better performance than the full-modality baseline with up to 78.8% computational complexity reduction for KITTI dataset evaluation. Our code will be shared in https://github.com/mingyuyng/Visual-Selective-VIO
Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

CorAl: Introspection for Robust Radar and Lidar Perception in Diverse Environments Using Differential Entropy
Authors: Daniel Adolfsson, Manuel Castellano-Quero, Martin Magnusson, Achim J. Lilienthal, Henrik Andreasson
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2205.05975
Pdf link: https://arxiv.org/pdf/2205.05975
Abstract Robust perception is an essential component to enable long-term operation of mobile robots. It depends on failure resilience through reliable sensor data and preprocessing, as well as failure awareness through introspection, for example the ability to self-assess localization performance. This paper presents CorAl: a principled, intuitive, and generalizable method to measure the quality of alignment between pairs of point clouds, which learns to detect alignment errors in a self-supervised manner. CorAl compares the differential entropy in the point clouds separately with the entropy in their union to account for entropy inherent to the scene. By making use of dual entropy measurements, we obtain a quality metric that is highly sensitive to small alignment errors and still generalizes well to unseen environments. In this work, we extend our previous work on lidar-only CorAl to radar data by proposing a two-stage filtering technique that produces high-quality point clouds from noisy radar scans. Thus we target robust perception in two ways: by introducing a method that introspectively assesses alignment quality, and applying it to an inherently robust sensor modality. We show that our filtering technique combined with CorAl can be applied to the problem of alignment classification, and that it detects small alignment errors in urban settings with up to 98% accuracy, and with up to 96% if trained only in a different environment. Our lidar and radar experiments demonstrate that CorAl outperforms previous methods both on the ETH lidar benchmark, which includes several indoor and outdoor environments, and the large-scale Oxford and MulRan radar data sets for urban traffic scenarios The results also demonstrate that CorAl generalizes very well across substantially different environments without the need of retraining.
CURL: Continuous, Ultra-compact Representation for LiDAR
Authors: Kaicheng Zhang, Ziyang Hong, Shida Xu, Sen Wang
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2205.06059
Pdf link: https://arxiv.org/pdf/2205.06059
Abstract Increasing the density of the 3D LiDAR point cloud is appealing for many applications in robotics. However, high-density LiDAR sensors are usually costly and still limited to a level of coverage per scan (e.g., 128 channels). Meanwhile, denser point cloud scans and maps mean larger volumes to store and longer times to transmit. Existing works focus on either improving point cloud density or compressing its size. This paper aims to design a novel 3D point cloud representation that can continuously increase point cloud density while reducing its storage and transmitting size. The pipeline of the proposed Continuous, Ultra-compact Representation of LiDAR (CURL) includes four main steps: meshing, upsampling, encoding, and continuous reconstruction. It is capable of transforming a 3D LiDAR scan or map into a compact spherical harmonics representation which can be used or transmitted in low latency to continuously reconstruct a much denser 3D point cloud. Extensive experiments on four public datasets, covering college gardens, city streets, and indoor rooms, demonstrate that much denser 3D point clouds can be accurately reconstructed using the proposed CURL representation while achieving up to 80% storage space-saving. We open-source the CURL codes for the community.
Keyword: loop detection

There is no result

Keyword: autonomous driving

MPPNet: Multi-Frame Feature Intertwining with Proxy Points for 3D Temporal Object Detection
Authors: Xuesong Chen, Shaoshuai Shi, Benjin Zhu, Ka Chun Cheung, Hang Xu, Hongsheng Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2205.05979
Pdf link: https://arxiv.org/pdf/2205.05979
Abstract Accurate and reliable 3D detection is vital for many applications including autonomous driving vehicles and service robots. In this paper, we present a flexible and high-performance 3D detection framework, named MPPNet, for 3D temporal object detection with point cloud sequences. We propose a novel three-hierarchy framework with proxy points for multi-frame feature encoding and interactions to achieve better detection. The three hierarchies conduct per-frame feature encoding, short-clip feature fusion, and whole-sequence feature aggregation, respectively. To enable processing long-sequence point clouds with reasonable computational resources, intra-group feature mixing and inter-group feature attention are proposed to form the second and third feature encoding hierarchies, which are recurrently applied for aggregating multi-frame trajectory features. The proxy points not only act as consistent object representations for each frame, but also serve as the courier to facilitate feature interaction between frames. The experiments on largeWaymo Open dataset show that our approach outperforms state-of-the-art methods with large margins when applied to both short (e.g., 4-frame) and long (e.g., 16-frame) point cloud sequences. Specifically, MPPNet achieves 74.21%, 74.62% and 73.31% for vehicle, pedestrian and cyclist classes on the LEVEL 2 mAPH metric with 16-frame input.
Keyword: mapping

Tutorial: Analog Matrix Computing (AMC) with Crosspoint Resistive Memory Arrays
Authors: Zhong Sun, Daniele Ielmini
Subjects: Emerging Technologies (cs.ET); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2205.05853
Pdf link: https://arxiv.org/pdf/2205.05853
Abstract Matrix computation is ubiquitous in modern scientific and engineering fields. Due to the high computational complexity in conventional digital computers, matrix computation represents a heavy workload in many data-intensive applications, e.g., machine learning, scientific computing, and wireless communications. For fast, efficient matrix computations, analog computing with resistive memory arrays has been proven to be a promising solution. In this Tutorial, we present analog matrix computing (AMC) circuits based on crosspoint resistive memory arrays. AMC circuits are able to carry out basic matrix computations, including matrix multiplication, matrix inversion, pseudoinverse and eigenvector computation, all with one single operation. We describe the main design principles of the AMC circuits, such as local/global or negative/positive feedback configurations, with/without external inputs. Mapping strategies for matrices containing negative values will be presented. The underlying requirements for circuit stability will be described via the transfer function analysis, which also defines time complexity of the circuits towards steady-state results. Lastly, typical applications, challenges, and future trends of AMC circuits will be discussed.
S3E-GNN: Sparse Spatial Scene Embedding with Graph Neural Networks for Camera Relocalization
Authors: Ran Cheng, Xinyu Jiang, Yuan Chen, Lige Liu, Tao Sun
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2205.05861
Pdf link: https://arxiv.org/pdf/2205.05861
Abstract Camera relocalization is the key component of simultaneous localization and mapping (SLAM) systems. This paper proposes a learning-based approach, named Sparse Spatial Scene Embedding with Graph Neural Networks (S3E-GNN), as an end-to-end framework for efficient and robust camera relocalization. S3E-GNN consists of two modules. In the encoding module, a trained S3E network encodes RGB images into embedding codes to implicitly represent spatial and semantic embedding code. With embedding codes and the associated poses obtained from a SLAM system, each image is represented as a graph node in a pose graph. In the GNN query module, the pose graph is transformed to form a embedding-aggregated reference graph for camera relocalization. We collect various scene datasets in the challenging environments to perform experiments. Our results demonstrate that S3E-GNN method outperforms the traditional Bag-of-words (BoW) for camera relocalization due to learning-based embedding and GNN powered scene matching mechanism.
SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Mapping
Authors: Damla Senol Cali, Konstantinos Kanellopoulos, Joel Lindegger, Zülal Bingöl, Gurpreet S. Kalsi, Ziyi Zuo, Can Firtina, Meryem Banu Cavlak, Jeremie Kim, Nika Mansouri Ghiasi, Gagandeep Singh, Juan Gómez-Luna, Nour Almadhoun Alserr, Mohammed Alser, Sreenivas Subramoney, Can Alkan, Saugata Ghose, Onur Mutlu
Subjects: Hardware Architecture (cs.AR); Genomics (q-bio.GN)
Arxiv link: https://arxiv.org/abs/2205.05883
Pdf link: https://arxiv.org/pdf/2205.05883
Abstract A critical step of genome sequence analysis is the mapping of sequenced DNA fragments (i.e., reads) collected from an individual to a known linear reference genome sequence (i.e., sequence-to-sequence mapping). Recent works replace the linear reference sequence with a graph-based representation of the reference genome, which captures the genetic variations and diversity across many individuals in a population. Mapping reads to the graph-based reference genome (i.e., sequence-to-graph mapping) results in notable quality improvements in genome analysis. Unfortunately, while sequence-to-sequence mapping is well studied with many available tools and accelerators, sequence-to-graph mapping is a more difficult computational problem, with a much smaller number of practical software tools currently available. We analyze two state-of-the-art sequence-to-graph mapping tools and reveal four key issues. We find that there is a pressing need to have a specialized, high-performance, scalable, and low-cost algorithm/hardware co-design that alleviates bottlenecks in both the seeding and alignment steps of sequence-to-graph mapping. To this end, we propose SeGraM, a universal algorithm/hardware co-designed genomic mapping accelerator that can effectively and efficiently support both sequence-to-graph mapping and sequence-to-sequence mapping, for both short and long reads. To our knowledge, SeGraM is the first algorithm/hardware co-design for accelerating sequence-to-graph mapping. SeGraM consists of two main components: (1) MinSeed, the first minimizer-based seeding accelerator; and (2) BitAlign, the first bitvector-based sequence-to-graph alignment accelerator. We demonstrate that SeGraM provides significant improvements for multiple steps of the sequence-to-graph and sequence-to-sequence mapping pipelines.
Embodied vision for learning object representations
Authors: Arthur Aubret, Céline Teulière, Jochen Triesch
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2205.06198
Pdf link: https://arxiv.org/pdf/2205.06198
Abstract Recent time-contrastive learning approaches manage to learn invariant object representations without supervision. This is achieved by mapping successive views of an object onto close-by internal representations. When considering this learning approach as a model of the development of human object recognition, it is important to consider what visual input a toddler would typically observe while interacting with objects. First, human vision is highly foveated, with high resolution only available in the central region of the field of view. Second, objects may be seen against a blurry background due to infants' limited depth of field. Third, during object manipulation a toddler mostly observes close objects filling a large part of the field of view due to their rather short arms. Here, we study how these effects impact the quality of visual representations learnt through time-contrastive learning. To this end, we let a visually embodied agent "play" with objects in different locations of a near photo-realistic flat. During each play session the agent views an object in multiple orientations before turning its body to view another object. The resulting sequence of views feeds a time-contrastive learning algorithm. Our results show that visual statistics mimicking those of a toddler improve object recognition accuracy in both familiar and novel environments. We argue that this effect is caused by the reduction of features extracted in the background, a neural network bias for large features in the image and a greater similarity between novel and familiar background regions. We conclude that the embodied nature of visual learning may be crucial for understanding the development of human object perception.
Keyword: localization

Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos
Authors: Shuo Yang, Xinxiao Wu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2205.05854
Pdf link: https://arxiv.org/pdf/2205.05854
Abstract Language-driven action localization in videos is a challenging task that involves not only visual-linguistic matching but also action boundary prediction. Recent progress has been achieved through aligning language query to video segments, but estimating precise boundaries is still under-explored. In this paper, we propose entity-aware and motion-aware Transformers that progressively localizes actions in videos by first coarsely locating clips with entity queries and then finely predicting exact boundaries in a shrunken temporal region with motion queries. The entity-aware Transformer incorporates the textual entities into visual representation learning via cross-modal and cross-frame attentions to facilitate attending action-related video clips. The motion-aware Transformer captures fine-grained motion changes at multiple temporal scales via integrating long short-term memory into the self-attention module to further improve the precision of action boundary prediction. Extensive experiments on the Charades-STA and TACoS datasets demonstrate that our method achieves better performance than existing methods.
S3E-GNN: Sparse Spatial Scene Embedding with Graph Neural Networks for Camera Relocalization
Authors: Ran Cheng, Xinyu Jiang, Yuan Chen, Lige Liu, Tao Sun
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2205.05861
Pdf link: https://arxiv.org/pdf/2205.05861
Abstract Camera relocalization is the key component of simultaneous localization and mapping (SLAM) systems. This paper proposes a learning-based approach, named Sparse Spatial Scene Embedding with Graph Neural Networks (S3E-GNN), as an end-to-end framework for efficient and robust camera relocalization. S3E-GNN consists of two modules. In the encoding module, a trained S3E network encodes RGB images into embedding codes to implicitly represent spatial and semantic embedding code. With embedding codes and the associated poses obtained from a SLAM system, each image is represented as a graph node in a pose graph. In the GNN query module, the pose graph is transformed to form a embedding-aggregated reference graph for camera relocalization. We collect various scene datasets in the challenging environments to perform experiments. Our results demonstrate that S3E-GNN method outperforms the traditional Bag-of-words (BoW) for camera relocalization due to learning-based embedding and GNN powered scene matching mechanism.
CorAl: Introspection for Robust Radar and Lidar Perception in Diverse Environments Using Differential Entropy
Authors: Daniel Adolfsson, Manuel Castellano-Quero, Martin Magnusson, Achim J. Lilienthal, Henrik Andreasson
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2205.05975
Pdf link: https://arxiv.org/pdf/2205.05975
Abstract Robust perception is an essential component to enable long-term operation of mobile robots. It depends on failure resilience through reliable sensor data and preprocessing, as well as failure awareness through introspection, for example the ability to self-assess localization performance. This paper presents CorAl: a principled, intuitive, and generalizable method to measure the quality of alignment between pairs of point clouds, which learns to detect alignment errors in a self-supervised manner. CorAl compares the differential entropy in the point clouds separately with the entropy in their union to account for entropy inherent to the scene. By making use of dual entropy measurements, we obtain a quality metric that is highly sensitive to small alignment errors and still generalizes well to unseen environments. In this work, we extend our previous work on lidar-only CorAl to radar data by proposing a two-stage filtering technique that produces high-quality point clouds from noisy radar scans. Thus we target robust perception in two ways: by introducing a method that introspectively assesses alignment quality, and applying it to an inherently robust sensor modality. We show that our filtering technique combined with CorAl can be applied to the problem of alignment classification, and that it detects small alignment errors in urban settings with up to 98% accuracy, and with up to 96% if trained only in a different environment. Our lidar and radar experiments demonstrate that CorAl outperforms previous methods both on the ETH lidar benchmark, which includes several indoor and outdoor environments, and the large-scale Oxford and MulRan radar data sets for urban traffic scenarios The results also demonstrate that CorAl generalizes very well across substantially different environments without the need of retraining.
Keyword: transformer

AppTek's Submission to the IWSLT 2022 Isometric Spoken Language Translation Task
Authors: Patrick Wilken, Evgeny Matusov
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2205.05807
Pdf link: https://arxiv.org/pdf/2205.05807
Abstract To participate in the Isometric Spoken Language Translation Task of the IWSLT 2022 evaluation, constrained condition, AppTek developed neural Transformer-based systems for English-to-German with various mechanisms of length control, ranging from source-side and target-side pseudo-tokens to encoding of remaining length in characters that replaces positional encoding. We further increased translation length compliance by sentence-level selection of length-compliant hypotheses from different system variants, as well as rescoring of N-best candidates from a single system. Length-compliant back-translated and forward-translated synthetic data, as well as other parallel data variants derived from the original MuST-C training corpus were important for a good quality/desired length trade-off. Our experimental results show that length compliance levels above 90% can be reached while minimizing losses in MT quality as measured in BERT and BLEU scores.
Supplementary Material: Implementation and Experiments for GAU-based Model
Authors: Zhenjie Liu
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2205.05842
Pdf link: https://arxiv.org/pdf/2205.05842
Abstract In February this year Google proposed a new Transformer variant called FLASH, which has a faster speed, lower VRAM footprint and better performance. This is achieved by designing a performant layer named GAU (Gated Attention Unit), which combines the Attention layer and FFN. In this paper, some implementation details are re-analyzed both theoretically and practically. We then propose a novel GAU-based model and pre-train it model on a Chinese corpus. Results of the CLUE benchmark show that our model achieves a dev average score of 75.02, 1% higher than RoFormerV1 and being 45% faster, which is also competitive with RoFormerV2.
Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos
Authors: Shuo Yang, Xinxiao Wu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2205.05854
Pdf link: https://arxiv.org/pdf/2205.05854
Abstract Language-driven action localization in videos is a challenging task that involves not only visual-linguistic matching but also action boundary prediction. Recent progress has been achieved through aligning language query to video segments, but estimating precise boundaries is still under-explored. In this paper, we propose entity-aware and motion-aware Transformers that progressively localizes actions in videos by first coarsely locating clips with entity queries and then finely predicting exact boundaries in a shrunken temporal region with motion queries. The entity-aware Transformer incorporates the textual entities into visual representation learning via cross-modal and cross-frame attentions to facilitate attending action-related video clips. The motion-aware Transformer captures fine-grained motion changes at multiple temporal scales via integrating long short-term memory into the self-attention module to further improve the precision of action boundary prediction. Extensive experiments on the Charades-STA and TACoS datasets demonstrate that our method achieves better performance than existing methods.
SimCPSR: Simple Contrastive Learning for Paper Submission Recommendation System
Authors: Duc H. Le, Tram T. Doan, Son T. Huynh, Binh T. Nguyen
Subjects: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2205.05940
Pdf link: https://arxiv.org/pdf/2205.05940
Abstract The recommendation system plays a vital role in many areas, especially academic fields, to support researchers in submitting and increasing the acceptance of their work through the conference or journal selection process. This study proposes a transformer-based model using transfer learning as an efficient approach for the paper submission recommendation system. By combining essential information (such as the title, the abstract, and the list of keywords) with the aims and scopes of journals, the model can recommend the Top K journals that maximize the acceptance of the paper. Our model had developed through two states: (i) Fine-tuning the pre-trained language model (LM) with a simple contrastive learning framework. We utilized a simple supervised contrastive objective to fine-tune all parameters, encouraging the LM to learn the document representation effectively. (ii) The fine-tuned LM was then trained on different combinations of the features for the downstream task. This study suggests a more advanced method for enhancing the efficiency of the paper submission recommendation system compared to previous approaches when we respectively achieve 0.5173, 0.8097, 0.8862, 0.9496 for Top 1, 3, 5, and 10 accuracies on the test set for combining the title, abstract, and keywords as input features. Incorporating the journals' aims and scopes, our model shows an exciting result by getting 0.5194, 0.8112, 0.8866, and 0.9496 respective to Top 1, 3, 5, and 10.
Exploiting Inductive Bias in Transformers for Unsupervised Disentanglement of Syntax and Semantics with VAEs
Authors: Ghazi Felhi, Joseph Le Roux, Djamé Seddah
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2205.05943
Pdf link: https://arxiv.org/pdf/2205.05943
Abstract We propose a generative model for text generation, which exhibits disentangled latent representations of syntax and semantics. Contrary to previous work, this model does not need syntactic information such as constituency parses, or semantic information such as paraphrase pairs. Our model relies solely on the inductive bias found in attention-based architectures such as Transformers. In the attention of Transformers, keys handle information selection while values specify what information is conveyed. Our model, dubbed QKVAE, uses Attention in its decoder to read latent variables where one latent variable infers keys while another infers values. We run experiments on latent representations and experiments on syntax/semantics transfer which show that QKVAE displays clear signs of disentangled syntax and semantics. We also show that our model displays competitive syntax transfer capabilities when compared to supervised models and that comparable supervised models need a fairly large amount of data (more than 50K samples) to outperform it on both syntactic and semantic transfer. The code for our experiments is publicly available.
Robot Cooking with Stir-fry: Bimanual Non-prehensile Manipulation of Semi-fluid Objects
Authors: Junjia Liu, Yiting Chen, Zhipeng Dong, Shixiong Wang, Sylvain Calinon, Miao Li, Fei Chen
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2205.05960
Pdf link: https://arxiv.org/pdf/2205.05960
Abstract This letter describes an approach to achieve well-known Chinese cooking art stir-fry on a bimanual robot system. Stir-fry requires a sequence of highly dynamic coordinated movements, which is usually difficult to learn for a chef, let alone transfer to robots. In this letter, we define a canonical stir-fry movement, and then propose a decoupled framework for learning this deformable object manipulation from human demonstration. First, the dual arms of the robot are decoupled into different roles (a leader and follower) and learned with classical and neural network-based methods separately, then the bimanual task is transformed into a coordination problem. To obtain general bimanual coordination, we secondly propose a Graph and Transformer based model -- Structured-Transformer, to capture the spatio-temporal relationship between dual-arm movements. Finally, by adding visual feedback of content deformation, our framework can adjust the movements automatically to achieve the desired stir-fry effect. We verify the framework by a simulator and deploy it on a real bimanual Panda robot system. The experimental results validate our framework can realize the bimanual robot stir-fry motion and have the potential to extend to other deformable objects with bimanual coordination.
Social Distancing Alert with Smartwatches
Authors: Xin Wang, Xilei Wu, Huina Meng, Yuhan Fan, Jingang Shi, Han Ding, Fei Wang
Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2205.06110
Pdf link: https://arxiv.org/pdf/2205.06110
Abstract Social distancing is an efficient public health practice during the COVID-19 pandemic. However, people would violate the social distancing practice unconsciously when they conduct some social activities such as handshaking, hugging, kissing on the face or forehead, etc. In this paper, we present SoDA, a social distancing practice violation alert system based on smartwatches, for preventing COVID-19 virus transmission. SoDA utilizes recordings of accelerometers and gyroscopes to recognize activities that may violate social distancing practice with simple yet effective Vision Transformer models. Extensive experiments over 10 volunteers and 1800+ samples demonstrate that SoDA achieves social activity recognition with the accuracy of 94.7%, 1.8% negative alert, and 2.2% missing alert.
Zero-shot Code-Mixed Offensive Span Identification through Rationale Extraction
Authors: Manikandan Ravikiran, Bharathi Raja Chakravarthi
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2205.06119
Pdf link: https://arxiv.org/pdf/2205.06119
Abstract This paper investigates the effectiveness of sentence-level transformers for zero-shot offensive span identification on a code-mixed Tamil dataset. More specifically, we evaluate rationale extraction methods of Local Interpretable Model Agnostic Explanations (LIME) \cite{DBLP:conf/kdd/Ribeiro0G16} and Integrated Gradients (IG) \cite{DBLP:conf/icml/SundararajanTY17} for adapting transformer based offensive language classification models for zero-shot offensive span identification. To this end, we find that LIME and IG show baseline $F{1}$ of 26.35\% and 44.83\%, respectively. Besides, we study the effect of data set size and training process on the overall accuracy of span identification. As a result, we find both LIME and IG to show significant improvement with Masked Data Augmentation and Multilabel Training, with $F{1}$ of 50.23\% and 47.38\% respectively. \textit{Disclaimer : This paper contains examples that may be considered profane, vulgar, or offensive. The examples do not represent the views of the authors or their employers/graduate schools towards any person(s), group(s), practice(s), or entity/entities. Instead they are used to emphasize only the linguistic research challenges.}
Multi Task Learning For Zero Shot Performance Prediction of Multilingual Models
Authors: Kabir Ahuja, Shanu Kumar, Sandipan Dandapat, Monojit Choudhury
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2205.06130
Pdf link: https://arxiv.org/pdf/2205.06130
Abstract Massively Multilingual Transformer based Language Models have been observed to be surprisingly effective on zero-shot transfer across languages, though the performance varies from language to language depending on the pivot language(s) used for fine-tuning. In this work, we build upon some of the existing techniques for predicting the zero-shot performance on a task, by modeling it as a multi-task learning problem. We jointly train predictive models for different tasks which helps us build more accurate predictors for tasks where we have test data in very few languages to measure the actual performance of the model. Our approach also lends us the ability to perform a much more robust feature selection and identify a common set of features that influence zero-shot performance across a variety of tasks.
Multimodal Indoor Localisation for Measuring Mobility in Parkinson's Disease using Transformers
Authors: Ferdian Jovan, Ryan McConville, Catherine Morgan, Emma Tonkin, Alan Whone, Ian Craddock
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2205.06142
Pdf link: https://arxiv.org/pdf/2205.06142
Abstract Parkinson's disease (PD) is a slowly progressive debilitating neurodegenerative disease which is prominently characterised by motor symptoms. Indoor localisation, including number and speed of room to room transitions, provides a proxy outcome which represents mobility and could be used as a digital biomarker to quantify how mobility changes as this disease progresses. We use data collected from 10 people with Parkinson's, and 10 controls, each of whom lived for five days in a smart home with various sensors. In order to more effectively localise them indoors, we propose a transformer-based approach utilizing two data modalities, Received Signal Strength Indicator (RSSI) and accelerometer data from wearable devices, which provide complementary views of movement. Our approach makes asymmetric and dynamic correlations by a) learning temporal correlations at different scales and levels, and b) utilizing various gating mechanisms to select relevant features within modality and suppress unnecessary modalities. On a dataset with real patients, we demonstrate that our proposed method gives an average accuracy of 89.9%, outperforming competitors. We also show that our model is able to better predict in-home mobility for people with Parkinson's with an average offset of 1.13 seconds to ground truth.
Predicting Human Psychometric Properties Using Computational Language Models
Authors: Antonio Laverghetta Jr., Animesh Nighojkar, Jamshidbek Mirzakhalov, John Licato
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2205.06203
Pdf link: https://arxiv.org/pdf/2205.06203
Abstract Transformer-based language models (LMs) continue to achieve state-of-the-art performance on natural language processing (NLP) benchmarks, including tasks designed to mimic human-inspired "commonsense" competencies. To better understand the degree to which LMs can be said to have certain linguistic reasoning skills, researchers are beginning to adapt the tools and concepts from psychometrics. But to what extent can benefits flow in the other direction? In other words, can LMs be of use in predicting the psychometric properties of test items, when those items are given to human participants? If so, the benefit for psychometric practitioners is enormous, as it can reduce the need for multiple rounds of empirical testing. We gather responses from numerous human participants and LMs (transformer- and non-transformer-based) on a broad diagnostic test of linguistic competencies. We then use the human responses to calculate standard psychometric properties of the items in the diagnostic test, using the human responses and the LM responses separately. We then determine how well these two sets of predictions correlate. We find that transformer-based LMs predict the human psychometric data consistently well across most categories, suggesting that they can be used to gather human-like psychometric data without the need for extensive human trials.
Simple Open-Vocabulary Object Detection with Vision Transformers
Authors: Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, Neil Houlsby
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2205.06230
Pdf link: https://arxiv.org/pdf/2205.06230
Abstract Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. Our analysis of the scaling properties of this setup shows that increasing image-level pre-training and model size yield consistent improvements on the downstream detection task. We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection. Code and models are available on GitHub.

zhuhu00 / Paper-Daily-Notice

New submissions for Fri, 13 May 22 #162

Keyword: SLAM

S3E-GNN: Sparse Spatial Scene Embedding with Graph Neural Networks for Camera Relocalization

Dynamic Dense RGB-D SLAM using Learning-based Visual Odometry

Keyword: odometry

Dynamic Dense RGB-D SLAM using Learning-based Visual Odometry

Efficient Deep Visual and Inertial Odometry with Adaptive Visual Modality Selection

Keyword: livox

Keyword: loam

Keyword: lidar

CorAl: Introspection for Robust Radar and Lidar Perception in Diverse Environments Using Differential Entropy

CURL: Continuous, Ultra-compact Representation for LiDAR

Keyword: loop detection

Keyword: autonomous driving

MPPNet: Multi-Frame Feature Intertwining with Proxy Points for 3D Temporal Object Detection

Keyword: mapping

Tutorial: Analog Matrix Computing (AMC) with Crosspoint Resistive Memory Arrays

S3E-GNN: Sparse Spatial Scene Embedding with Graph Neural Networks for Camera Relocalization

SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Mapping

Embodied vision for learning object representations

Keyword: localization

Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos

S3E-GNN: Sparse Spatial Scene Embedding with Graph Neural Networks for Camera Relocalization

CorAl: Introspection for Robust Radar and Lidar Perception in Diverse Environments Using Differential Entropy

Keyword: transformer

AppTek's Submission to the IWSLT 2022 Isometric Spoken Language Translation Task

Supplementary Material: Implementation and Experiments for GAU-based Model

Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos

SimCPSR: Simple Contrastive Learning for Paper Submission Recommendation System

Exploiting Inductive Bias in Transformers for Unsupervised Disentanglement of Syntax and Semantics with VAEs

Robot Cooking with Stir-fry: Bimanual Non-prehensile Manipulation of Semi-fluid Objects

Social Distancing Alert with Smartwatches

Zero-shot Code-Mixed Offensive Span Identification through Rationale Extraction

Multi Task Learning For Zero Shot Performance Prediction of Multilingual Models

Multimodal Indoor Localisation for Measuring Mobility in Parkinson's Disease using Transformers

Predicting Human Psychometric Properties Using Computational Language Models

Simple Open-Vocabulary Object Detection with Vision Transformers