Abstract
Growing evidence suggests that YouTube's recommendation algorithm plays a role in online radicalization via surfacing extreme content. Radical Islamist groups, in particular, have been profiting from the global appeal of YouTube to disseminate hate and jihadist propaganda. In this quantitative, data-driven study, we investigate the prevalence of religiously intolerant Arabic YouTube videos, the tendency of the platform to recommend such videos, and how these recommendations are affected by demographics and watch history. Based on our deep learning classifier developed to detect hateful videos and a large-scale dataset of over 350K videos, we find that Arabic videos targeting religious minorities are particularly prevalent in search results (30%) and first-level recommendations (21%), and that 15% of overall captured recommendations point to hateful videos. Our personalized audit experiments suggest that gender and religious identity can substantially affect the extent of exposure to hateful content. Our results contribute vital insights into the phenomenon of online radicalization and facilitate curbing online harmful content.
Keeping Less is More: Point Sparsification for Visual SLAM
Authors: Yeonsoo Park, Soohyun Bae
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Abstract
When adapting Simultaneous Mapping and Localization (SLAM) to real-world applications, such as autonomous vehicles, drones, and augmented reality devices, its memory footprint and computing cost are the two main factors limiting the performance and the range of applications. In sparse feature based SLAM algorithms, one efficient way for this problem is to limit the map point size by selecting the points potentially useful for local and global bundle adjustment (BA). This study proposes an efficient graph optimization for sparsifying map points in SLAM systems. Specifically, we formulate a maximum pose-visibility and maximum spatial diversity problem as a minimum-cost maximum-flow graph optimization problem. The proposed method works as an additional step in existing SLAM systems, so it can be used in both conventional or learning based SLAM systems. By extensive experimental evaluations we demonstrate the proposed method achieves even more accurate camera poses with approximately 1/3 of the map points and 1/2 of the computation.
A Survey on Active Simultaneous Localization and Mapping: State of the Art and New Frontiers
Authors: Julio A. Placed, Jared Strader, Henry Carrillo, Nikolay Atanasov, Vadim Indelman, Luca Carlone, José A. Castellanos
Abstract
Active Simultaneous Localization and Mapping (SLAM) is the problem of planning and controlling the motion of a robot to build the most accurate and complete model of the surrounding environment. Since the first foundational work in active perception appeared, more than three decades ago, this field has received increasing attention across different scientific communities. This has brought about many different approaches and formulations, and makes a review of the current trends necessary and extremely valuable for both new and experienced researchers. In this work, we survey the state-of-the-art in active SLAM and take an in-depth look at the open challenges that still require attention to meet the needs of modern applications. % in order to achieve real-world deployment. After providing a historical perspective, we present a unified problem formulation and review the classical solution scheme, which decouples the problem into three stages that identify, select, and execute potential navigation actions. We then analyze alternative approaches, including belief-space planning and modern techniques based on deep reinforcement learning, and review related work on multi-robot coordination. The manuscript concludes with a discussion of new research directions, addressing reproducible research, active spatial perception, and practical applications, among other topics.
Keyword: odometry
There is no result
Keyword: livox
There is no result
Keyword: loam
There is no result
Keyword: lidar
LaserMix for Semi-Supervised LiDAR Semantic Segmentation
Authors: Lingdong Kong, Jiawei Ren, Liang Pan, Ziwei Liu
Abstract
Densely annotating LiDAR point clouds is costly, which restrains the scalability of fully-supervised learning methods. In this work, we study the underexplored semi-supervised learning (SSL) in LiDAR segmentation. Our core idea is to leverage the strong spatial cues of LiDAR point clouds to better exploit unlabeled data. We propose LaserMix to mix laser beams from different LiDAR scans, and then encourage the model to make consistent and confident predictions before and after mixing. Our framework has three appealing properties: 1) Generic: LaserMix is agnostic to LiDAR representations (e.g., range view and voxel), and hence our SSL framework can be universally applied. 2) Statistically grounded: We provide a detailed analysis to theoretically explain the applicability of the proposed framework. 3) Effective: Comprehensive experimental analysis on popular LiDAR segmentation datasets (nuScenes, SemanticKITTI, and ScribbleKITTI) demonstrates our effectiveness and superiority. Notably, we achieve competitive results over fully-supervised counterparts with 2x to 5x fewer labels and improve the supervised-only baseline significantly by 10.8% on average. We hope this concise yet high-performing framework could facilitate future research in semi-supervised LiDAR segmentation. Code will be publicly available.
MMFN: Multi-Modal-Fusion-Net for End-to-End Driving
Abstract
Inspired by the fact that humans use diverse sensory organs to perceive the world, sensors with different modalities are deployed in end-to-end driving to obtain the global context of the 3D scene. In previous works, camera and LiDAR inputs are fused through transformers for better driving performance. These inputs are normally further interpreted as high-level map information to assist navigation tasks. Nevertheless, extracting useful information from the complex map input is challenging, for redundant information may mislead the agent and negatively affect driving performance. We propose a novel approach to efficiently extract features from vectorized High-Definition (HD) maps and utilize them in the end-to-end driving tasks. In addition, we design a new expert to further enhance the model performance by considering multi-road rules. Experimental results prove that both of the proposed improvements enable our agent to achieve superior performance compared with other methods.
Keyword: loop detection
There is no result
Keyword: nerf
There is no result
Keyword: mapping
Facilitating Conversational Interaction in Natural Language Interfaces for Visualization
Authors: Rishab Mitra, Arpit Narechania, Alex Endert, John Stasko
Abstract
Natural language (NL) toolkits enable visualization developers, who may not have a background in natural language processing (NLP), to create natural language interfaces (NLIs) for end-users to flexibly specify and interact with visualizations. However, these toolkits currently only support one-off utterances, with minimal capability to facilitate a multi-turn dialog between the user and the system. Developing NLIs with such conversational interaction capabilities remains a challenging task, requiring implementations of low-level NLP techniques to process a new query as an intent to follow-up on an older query. We extend an existing Python-based toolkit, NL4DV, that processes an NL query about a tabular dataset and returns an analytic specification containing data attributes, analytic tasks, and relevant visualizations, modeled as a JSON object. Specifically, NL4DV now enables developers to facilitate multiple simultaneous conversations about a dataset and resolve associated ambiguities, augmenting new conversational information into the output JSON object. We demonstrate these capabilities through three examples: (1) an NLI to learn aspects of the Vega-Lite grammar, (2) a mind mapping application to create free-flowing conversations, and (3) a chatbot to answer questions and resolve ambiguities.
Keeping Less is More: Point Sparsification for Visual SLAM
Authors: Yeonsoo Park, Soohyun Bae
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Abstract
When adapting Simultaneous Mapping and Localization (SLAM) to real-world applications, such as autonomous vehicles, drones, and augmented reality devices, its memory footprint and computing cost are the two main factors limiting the performance and the range of applications. In sparse feature based SLAM algorithms, one efficient way for this problem is to limit the map point size by selecting the points potentially useful for local and global bundle adjustment (BA). This study proposes an efficient graph optimization for sparsifying map points in SLAM systems. Specifically, we formulate a maximum pose-visibility and maximum spatial diversity problem as a minimum-cost maximum-flow graph optimization problem. The proposed method works as an additional step in existing SLAM systems, so it can be used in both conventional or learning based SLAM systems. By extensive experimental evaluations we demonstrate the proposed method achieves even more accurate camera poses with approximately 1/3 of the map points and 1/2 of the computation.
A Survey on Active Simultaneous Localization and Mapping: State of the Art and New Frontiers
Authors: Julio A. Placed, Jared Strader, Henry Carrillo, Nikolay Atanasov, Vadim Indelman, Luca Carlone, José A. Castellanos
Abstract
Active Simultaneous Localization and Mapping (SLAM) is the problem of planning and controlling the motion of a robot to build the most accurate and complete model of the surrounding environment. Since the first foundational work in active perception appeared, more than three decades ago, this field has received increasing attention across different scientific communities. This has brought about many different approaches and formulations, and makes a review of the current trends necessary and extremely valuable for both new and experienced researchers. In this work, we survey the state-of-the-art in active SLAM and take an in-depth look at the open challenges that still require attention to meet the needs of modern applications. % in order to achieve real-world deployment. After providing a historical perspective, we present a unified problem formulation and review the classical solution scheme, which decouples the problem into three stages that identify, select, and execute potential navigation actions. We then analyze alternative approaches, including belief-space planning and modern techniques based on deep reinforcement learning, and review related work on multi-robot coordination. The manuscript concludes with a discussion of new research directions, addressing reproducible research, active spatial perception, and practical applications, among other topics.
SDRTV-to-HDRTV via Hierarchical Dynamic Context Feature Mapping
Authors: Gang He, Kepeng Xu, Li Xu, Chang Wu, Ming Sun, Xing Wen, Yu-Wing Tai
Abstract
In this work, we address the task of SDR videos to HDR videos(SDRTV-to-HDRTV). Previous approaches use global feature modulation for SDRTV-to-HDRTV. Feature modulation scales and shifts the features in the original feature space, which has limited mapping capability. In addition, the global image mapping cannot restore detail in HDR frames due to the luminance differences in different regions of SDR frames. To resolve the appeal, we propose a two-stage solution. The first stage is a hierarchical Dynamic Context feature mapping (HDCFM) model. HDCFM learns the SDR frame to HDR frame mapping function via hierarchical feature modulation (HME and HM ) module and a dynamic context feature transformation (DCT) module. The HME estimates the feature modulation vector, HM is capable of hierarchical feature modulation, consisting of global feature modulation in series with local feature modulation, and is capable of adaptive mapping of local image features. The DCT module constructs a feature transformation module in conjunction with the context, which is capable of adaptively generating a feature transformation matrix for feature mapping. Compared with simple feature scaling and shifting, the DCT module can map features into a new feature space and thus has a more excellent feature mapping capability. In the second stage, we introduce a patch discriminator-based context generation model PDCG to obtain subjective quality enhancement of over-exposed regions. PDCG can solve the problem that the model is challenging to train due to the proportion of overexposed regions of the image. The proposed method can achieve state-of-the-art objective and subjective quality results. Specifically, HDCFM achieves a PSNR gain of 0.81 dB at a parameter of about 100K. The number of parameters is 1/14th of the previous state-of-the-art methods. The test code will be released soon.
Fine-grained Correlation Loss for Regression
Authors: Chaoyu Chen, Xin Yang, Ruobing Huang, Xindi Hu, Yankai Huang, Xiduo Lu, Xinrui Zhou, Mingyuan Luo, Yinyu Ye, Xue Shuang, Juzheng Miao, Yi Xiong, Dong Ni
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Regression learning is classic and fundamental for medical image analysis. It provides the continuous mapping for many critical applications, like the attribute estimation, object detection, segmentation and non-rigid registration. However, previous studies mainly took the case-wise criteria, like the mean square errors, as the optimization objectives. They ignored the very important population-wise correlation criterion, which is exactly the final evaluation metric in many tasks. In this work, we propose to revisit the classic regression tasks with novel investigations on directly optimizing the fine-grained correlation losses. We mainly explore two complementary correlation indexes as learnable losses: Pearson linear correlation (PLC) and Spearman rank correlation (SRC). The contributions of this paper are two folds. First, for the PLC on global level, we propose a strategy to make it robust against the outliers and regularize the key distribution factors. These efforts significantly stabilize the learning and magnify the efficacy of PLC. Second, for the SRC on local level, we propose a coarse-to-fine scheme to ease the learning of the exact ranking order among samples. Specifically, we convert the learning for the ranking of samples into the learning of similarity relationships among samples. We extensively validate our method on two typical ultrasound image regression tasks, including the image quality assessment and bio-metric measurement. Experiments prove that, with the fine-grained guidance in directly optimizing the correlation, the regression performances are significantly improved. Our proposed correlation losses are general and can be extended to more important applications.
Artificial Intelligence Techniques for Next-Generation Mega Satellite Networks
Authors: Bassel Al Homssi, Kosta Dakic, Ke Wang, Tansu Alpcan, Ben Allen, Sithamparanathan Kandeepan, Akram Al-Hourani, Walid Saad
Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Abstract
Space communications, particularly mega satellite networks, re-emerged as an appealing candidate for next generation networks due to major advances in space launching, electronics, processing power, and miniaturization. However, mega satellite networks rely on numerous underlying and intertwined processes that cannot be truly captured using conventionally used models, due to their dynamic and unique features such as orbital speed, inter-satellite links, short time pass, and satellite footprint, among others. Hence, new approaches are needed to enable the network to proactively adjust to the rapidly varying conditions associated within the link. Artificial intelligence (AI) provides a pathway to capture these processes, analyze their behavior, and model their effect on the network. This article introduces the application of AI techniques for integrated terrestrial satellite networks, particularly mega satellite network communications. It details the unique features of mega satellite networks, and the overarching challenges concomitant with their integration into the current communication infrastructure. Moreover, the article provides insights into state-of-the-art AI techniques across various layers of the communication link. This entails applying AI for forecasting the highly dynamic radio channel, spectrum sensing and classification, signal detection and demodulation, inter-satellite link and satellite access network optimization, and network security. Moreover, future paradigms and the mapping of these mechanisms onto practical networks are outlined.
Keyword: localization
Data generation using simulation technology to improve perception mechanism of autonomous vehicles
Authors: Minh Cao, Ramin Ramezani
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Recent advancements in computer graphics technology allow more realistic ren-dering of car driving environments. They have enabled self-driving car simulators such as DeepGTA-V and CARLA (Car Learning to Act) to generate large amounts of synthetic data that can complement the existing real-world dataset in training autonomous car perception. Furthermore, since self-driving car simulators allow full control of the environment, they can generate dangerous driving scenarios that the real-world dataset lacks such as bad weather and accident scenarios. In this paper, we will demonstrate the effectiveness of combining data gathered from the real world with data generated in the simulated world to train perception systems on object detection and localization task. We will also propose a multi-level deep learning perception framework that aims to emulate a human learning experience in which a series of tasks from the simple to more difficult ones are learned in a certain domain. The autonomous car perceptron can learn from easy-to-drive scenarios to more challenging ones customized by simulation software.
Keeping Less is More: Point Sparsification for Visual SLAM
Authors: Yeonsoo Park, Soohyun Bae
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Abstract
When adapting Simultaneous Mapping and Localization (SLAM) to real-world applications, such as autonomous vehicles, drones, and augmented reality devices, its memory footprint and computing cost are the two main factors limiting the performance and the range of applications. In sparse feature based SLAM algorithms, one efficient way for this problem is to limit the map point size by selecting the points potentially useful for local and global bundle adjustment (BA). This study proposes an efficient graph optimization for sparsifying map points in SLAM systems. Specifically, we formulate a maximum pose-visibility and maximum spatial diversity problem as a minimum-cost maximum-flow graph optimization problem. The proposed method works as an additional step in existing SLAM systems, so it can be used in both conventional or learning based SLAM systems. By extensive experimental evaluations we demonstrate the proposed method achieves even more accurate camera poses with approximately 1/3 of the map points and 1/2 of the computation.
Point Cloud Change Detection With Stereo V-SLAM:Dataset, Metrics and Baseline
Authors: Zihan Lin, Jincheng Yu, Lipu Zhou, Xudong Zhang, Jian Wang, Yu Wang
Abstract
Localization and navigation are basic robotic tasks requiring an accurate and up-to-date map to finish these tasks, with crowdsourced data to detect map changes posing an appealing solution. Collecting and processing crowdsourced data requires low-cost sensors and algorithms, but existing methods rely on expensive sensors or computationally expensive algorithms. Additionally, there is no existing dataset to evaluate point cloud change detection. Thus, this paper proposes a novel framework using low-cost sensors like stereo cameras and IMU to detect changes in a point cloud map. Moreover, we create a dataset and the corresponding metrics to evaluate point cloud change detection with the help of the high-fidelity simulator Unreal Engine 4. Experiments show that our visualbased framework can effectively detect the changes in our dataset.
Computer-aided Tuberculosis Diagnosis with Attribute Reasoning Assistance
Abstract
Although deep learning algorithms have been intensively developed for computer-aided tuberculosis diagnosis (CTD), they mainly depend on carefully annotated datasets, leading to much time and resource consumption. Weakly supervised learning (WSL), which leverages coarse-grained labels to accomplish fine-grained tasks, has the potential to solve this problem. In this paper, we first propose a new large-scale tuberculosis (TB) chest X-ray dataset, namely the tuberculosis chest X-ray attribute dataset (TBX-Att), and then establish an attribute-assisted weakly-supervised framework to classify and localize TB by leveraging the attribute information to overcome the insufficiency of supervision in WSL scenarios. Specifically, first, the TBX-Att dataset contains 2000 X-ray images with seven kinds of attributes for TB relational reasoning, which are annotated by experienced radiologists. It also includes the public TBX11K dataset with 11200 X-ray images to facilitate weakly supervised detection. Second, we exploit a multi-scale feature interaction model for TB area classification and detection with attribute relational reasoning. The proposed model is evaluated on the TBX-Att dataset and will serve as a solid baseline for future research. The code and data will be available at https://github.com/GangmingZhao/tb-attribute-weak-localization.
A Survey on Active Simultaneous Localization and Mapping: State of the Art and New Frontiers
Authors: Julio A. Placed, Jared Strader, Henry Carrillo, Nikolay Atanasov, Vadim Indelman, Luca Carlone, José A. Castellanos
Abstract
Active Simultaneous Localization and Mapping (SLAM) is the problem of planning and controlling the motion of a robot to build the most accurate and complete model of the surrounding environment. Since the first foundational work in active perception appeared, more than three decades ago, this field has received increasing attention across different scientific communities. This has brought about many different approaches and formulations, and makes a review of the current trends necessary and extremely valuable for both new and experienced researchers. In this work, we survey the state-of-the-art in active SLAM and take an in-depth look at the open challenges that still require attention to meet the needs of modern applications. % in order to achieve real-world deployment. After providing a historical perspective, we present a unified problem formulation and review the classical solution scheme, which decouples the problem into three stages that identify, select, and execute potential navigation actions. We then analyze alternative approaches, including belief-space planning and modern techniques based on deep reinforcement learning, and review related work on multi-robot coordination. The manuscript concludes with a discussion of new research directions, addressing reproducible research, active spatial perception, and practical applications, among other topics.
Can we learn from developer mistakes? Learning to localize and repair real bugs from real bug fixes
Abstract
Real bug fixes found in open source repositories seem to be the perfect source for learning to localize and repair real bugs. However, the absence of large scale bug fix collections has made it difficult to effectively exploit real bug fixes in the training of larger neural models in the past. In contrast, artificial bugs -- produced by mutating existing source code -- can be easily obtained at a sufficient scale and are therefore often preferred in the training of existing approaches. Still, localization and repair models that are trained on artificial bugs usually underperform when faced with real bugs. This raises the question whether bug localization and repair models trained on real bug fixes are more effective in localizing and repairing real bugs. We address this question by introducing RealiT, a pre-train-and-fine-tune approach for effectively learning to localize and repair real bugs from real bug fixes. RealiT is first pre-trained on a large number of artificial bugs produced by traditional mutation operators and then fine-tuned on a smaller set of real bug fixes. Fine-tuning does not require any modifications of the learning algorithm and hence can be easily adopted in various training scenarios for bug localization or repair (even when real training data is scarce). In addition, we found that training on real bug fixes with RealiT is empirically powerful by nearly doubling the localization performance of an existing model on real bugs while maintaining or even improving the repair performance.
Agent with Tangent-based Formulation and Anatomical Perception for Standard Plane Localization in 3D Ultrasound
Authors: Yuxin Zou, Haoran Dou, Yuhao Huang, Xin Yang, Jikuan Qian, Chaojiong Zhen, Xiaodan Ji, Nishant Ravikumar, Guoqiang Chen, Weijun Huang, Alejandro F. Frangi, Dong Ni
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Abstract
Standard plane (SP) localization is essential in routine clinical ultrasound (US) diagnosis. Compared to 2D US, 3D US can acquire multiple view planes in one scan and provide complete anatomy with the addition of coronal plane. However, manually navigating SPs in 3D US is laborious and biased due to the orientation variability and huge search space. In this study, we introduce a novel reinforcement learning (RL) framework for automatic SP localization in 3D US. Our contribution is three-fold. First, we formulate SP localization in 3D US as a tangent-point-based problem in RL to restructure the action space and significantly reduce the search space. Second, we design an auxiliary task learning strategy to enhance the model's ability to recognize subtle differences crossing Non-SPs and SPs in plane search. Finally, we propose a spatial-anatomical reward to effectively guide learning trajectories by exploiting spatial and anatomical information simultaneously. We explore the efficacy of our approach on localizing four SPs on uterus and fetal brain datasets. The experiments indicate that our approach achieves a high localization accuracy as well as robust performance.
Keyword: transformer
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
Authors: Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, Yuxiong He
Abstract
The past several years have witnessed the success of transformer-based models, and their scale and application scenarios continue to grow aggressively. The current landscape of transformer models is increasingly diverse: the model size varies drastically with the largest being of hundred-billion parameters; the model characteristics differ due to the sparsity introduced by the Mixture-of-Experts; the target application scenarios can be latency-critical or throughput-oriented; the deployment hardware could be single- or multi-GPU systems with different types of memory and storage, etc. With such increasing diversity and the fast-evolving pace of transformer models, designing a highly performant and efficient inference system is extremely challenging. In this paper, we present DeepSpeed Inference, a comprehensive system solution for transformer model inference to address the above-mentioned challenges. DeepSpeed Inference consists of (1) a multi-GPU inference solution to minimize latency while maximizing the throughput of both dense and sparse transformer models when they fit in aggregate GPU memory, and (2) a heterogeneous inference solution that leverages CPU and NVMe memory in addition to the GPU memory and compute to enable high inference throughput with large models which do not fit in aggregate GPU memory. DeepSpeed Inference reduces latency by up to 7.3X over the state-of-the-art for latency-oriented scenarios and increases throughput by over 1.5x for throughput-oriented scenarios. Moreover, it enables trillion parameter scale inference under real-time latency constraints by leveraging hundreds of GPUs, an unprecedented scale for inference. It can inference 25x larger models than with GPU-only solutions, while delivering a high throughput of 84 TFLOPS (over $50\%$ of A6000 peak).
GaitForeMer: Self-Supervised Pre-Training of Transformers via Human Motion Forecasting for Few-Shot Gait Impairment Severity Estimation
Authors: Mark Endo, Kathleen L. Poston, Edith V. Sullivan, Li Fei-Fei, Kilian M. Pohl, Ehsan Adeli
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Abstract
Parkinson's disease (PD) is a neurological disorder that has a variety of observable motor-related symptoms such as slow movement, tremor, muscular rigidity, and impaired posture. PD is typically diagnosed by evaluating the severity of motor impairments according to scoring systems such as the Movement Disorder Society Unified Parkinson's Disease Rating Scale (MDS-UPDRS). Automated severity prediction using video recordings of individuals provides a promising route for non-intrusive monitoring of motor impairments. However, the limited size of PD gait data hinders model ability and clinical potential. Because of this clinical data scarcity and inspired by the recent advances in self-supervised large-scale language models like GPT-3, we use human motion forecasting as an effective self-supervised pre-training task for the estimation of motor impairment severity. We introduce GaitForeMer, Gait Forecasting and impairment estimation transforMer, which is first pre-trained on public datasets to forecast gait movements and then applied to clinical data to predict MDS-UPDRS gait impairment severity. Our method outperforms previous approaches that rely solely on clinical data by a large margin, achieving an F1 score of 0.76, precision of 0.79, and recall of 0.75. Using GaitForeMer, we show how public human movement data repositories can assist clinical use cases through learning universal motion representations. The code is available at https://github.com/markendo/GaitForeMer .
Language model compression with weighted low-rank factorization
Abstract
Factorizing a large matrix into small matrices is a popular strategy for model compression. Singular value decomposition (SVD) plays a vital role in this compression strategy, approximating a learned matrix with fewer parameters. However, SVD minimizes the squared error toward reconstructing the original matrix without gauging the importance of the parameters, potentially giving a larger reconstruction error for those who affect the task accuracy more. In other words, the optimization objective of SVD is not aligned with the trained model's task accuracy. We analyze this previously unexplored problem, make observations, and address it by introducing Fisher information to weigh the importance of parameters affecting the model prediction. This idea leads to our method: Fisher-Weighted SVD (FWSVD). Although the factorized matrices from our approach do not result in smaller reconstruction errors, we find that our resulting task accuracy is much closer to the original model's performance. We perform analysis with the transformer-based language models, showing our weighted SVD largely alleviates the mismatched optimization objectives and can maintain model performance with a higher compression rate. Our method can directly compress a task-specific model while achieving better performance than other compact model strategies requiring expensive model pre-training. Moreover, the evaluation of compressing an already compact model shows our method can further reduce 9% to 30% parameters with an insignificant impact on task accuracy.
Rethinking Surgical Captioning: End-to-End Window-Based MLP Transformer Using Patches
Abstract
Surgical captioning plays an important role in surgical instruction prediction and report generation. However, the majority of captioning models still rely on the heavy computational object detector or feature extractor to extract regional features. In addition, the detection model requires additional bounding box annotation which is costly and needs skilled annotators. These lead to inference delay and limit the captioning model to deploy in real-time robotic surgery. For this purpose, we design an end-to-end detector and feature extractor-free captioning model by utilizing the patch-based shifted window technique. We propose Shifted Window-Based Multi-Layer Perceptrons Transformer Captioning model (SwinMLP-TranCAP) with faster inference speed and less computation. SwinMLP-TranCAP replaces the multi-head attention module with window-based multi-head MLP. Such deployments primarily focus on image understanding tasks, but very few works investigate the caption generation task. SwinMLP-TranCAP is also extended into a video version for video captioning tasks using 3D patches and windows. Compared with previous detector-based or feature extractor-based models, our models greatly simplify the architecture design while maintaining performance on two surgical datasets. The code is publicly available at https://github.com/XuMengyaAmy/SwinMLP_TranCAP.
TENET: Transformer Encoding Network for Effective Temporal Flow on Motion Prediction
Abstract
This technical report presents an effective method for motion prediction in autonomous driving. We develop a Transformer-based method for input encoding and trajectory prediction. Besides, we propose the Temporal Flow Header to enhance the trajectory encoding. In the end, an efficient K-means ensemble method is used. Using our Transformer network and ensemble method, we win the first place of Argoverse 2 Motion Forecasting Challenge with the state-of-the-art brier-minFDE score of 1.90.
MMFN: Multi-Modal-Fusion-Net for End-to-End Driving
Abstract
Inspired by the fact that humans use diverse sensory organs to perceive the world, sensors with different modalities are deployed in end-to-end driving to obtain the global context of the 3D scene. In previous works, camera and LiDAR inputs are fused through transformers for better driving performance. These inputs are normally further interpreted as high-level map information to assist navigation tasks. Nevertheless, extracting useful information from the complex map input is challenging, for redundant information may mislead the agent and negatively affect driving performance. We propose a novel approach to efficiently extract features from vectorized High-Definition (HD) maps and utilize them in the end-to-end driving tasks. In addition, we design a new expert to further enhance the model performance by considering multi-road rules. Experimental results prove that both of the proposed improvements enable our agent to achieve superior performance compared with other methods.
Rethinking Query-Key Pairwise Interactions in Vision Transformers
Authors: Cheng Li, Yangxin Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Vision Transformers have achieved state-of-the-art performance in many visual tasks. Due to the quadratic computational and memory complexities of self-attention, recent works either apply attention only to low-resolution inputs or restrict the receptive field to a small local region. To overcome these limitations, we propose key-only attention, which excludes query-key pairwise interactions and uses a compute-efficient saliency-gate to obtain attention weights, modeling local-global interactions in all stages. Key-only attention has linear computational and memory complexities w.r.t input size. We use alternate layout to hybridize convolution and attention layers instead of grafting which is suggested by previous works, so that all stages can benefit from both spatial attentions and convolutions. We leverage these improvements to develop a new self-attention model family, LinGlos, which reach state-of-the-art accuracies on the parameter-limited setting of ImageNet classification benchmark, and outperform baselines significantly in downstream tasks, e.g., COCO object detection and ADE20K semantic segmentation.
Visual Transformer Meets CutMix for Improved Accuracy, Communication Efficiency, and Data Privacy in Split Learning
Authors: Sihun Baek, Jihong Park, Praneeth Vepakomma, Ramesh Raskar, Mehdi Bennis, Seong-Lyun Kim
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract
This article seeks for a distributed learning solution for the visual transformer (ViT) architectures. Compared to convolutional neural network (CNN) architectures, ViTs often have larger model sizes, and are computationally expensive, making federated learning (FL) ill-suited. Split learning (SL) can detour this problem by splitting a model and communicating the hidden representations at the split-layer, also known as smashed data. Notwithstanding, the smashed data of ViT are as large as and as similar as the input data, negating the communication efficiency of SL while violating data privacy. To resolve these issues, we propose a new form of CutSmashed data by randomly punching and compressing the original smashed data. Leveraging this, we develop a novel SL framework for ViT, coined CutMixSL, communicating CutSmashed data. CutMixSL not only reduces communication costs and privacy leakage, but also inherently involves the CutMix data augmentation, improving accuracy and scalability. Simulations corroborate that CutMixSL outperforms baselines such as parallelized SL and SplitFed that integrates FL with SL.
DALG: Deep Attentive Local and Global Modeling for Image Retrieval
Authors: Yuxin Song, Ruolin Zhu, Min Yang, Dongliang He
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Deeply learned representations have achieved superior image retrieval performance in a retrieve-then-rerank manner. Recent state-of-the-art single stage model, which heuristically fuses local and global features, achieves promising trade-off between efficiency and effectiveness. However, we notice that efficiency of existing solutions is still restricted because of their multi-scale inference paradigm. In this paper, we follow the single stage art and obtain further complexity-effectiveness balance by successfully getting rid of multi-scale testing. To achieve this goal, we abandon the widely-used convolution network giving its limitation in exploring diverse visual patterns, and resort to fully attention based framework for robust representation learning motivated by the success of Transformer. Besides applying Transformer for global feature extraction, we devise a local branch composed of window-based multi-head attention and spatial attention to fully exploit local image patterns. Furthermore, we propose to combine the hierarchical local and global features via a cross-attention module, instead of using heuristically fusion as previous art does. With our Deep Attentive Local and Global modeling framework (DALG), extensive experimental results show that efficiency can be significantly improved while maintaining competitive results with the state of the arts.
TopicFM: Robust and Interpretable Feature Matching with Topic-assisted
Authors: Khang Truong Giang, Soohwan Song, Sungho Jo
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Finding correspondences across images is an important task in many visual applications. Recent state-of-the-art methods focus on end-to-end learning-based architectures designed in a coarse-to-fine manner. They use a very deep CNN or multi-block Transformer to learn robust representation, which requires high computation power. Moreover, these methods learn features without reasoning about objects, shapes inside images, thus lacks of interpretability. In this paper, we propose an architecture for image matching which is efficient, robust, and interpretable. More specifically, we introduce a novel feature matching module called TopicFM which can roughly organize same spatial structure across images into a topic and then augment the features inside each topic for accurate matching. To infer topics, we first learn global embedding of topics and then use a latent-variable model to detect-then-assign the image structures into topics. Our method can only perform matching in co-visibility regions to reduce computations. Extensive experiments in both outdoor and indoor datasets show that our method outperforms the recent methods in terms of matching performance and computational efficiency. The code is available at https://github.com/TruongKhang/TopicFM.
ReLER@ZJU-Alibaba Submission to the Ego4D Natural Language Queries Challenge 2022
Abstract
In this report, we present the ReLER@ZJU-Alibaba submission to the Ego4D Natural Language Queries (NLQ) Challenge in CVPR 2022. Given a video clip and a text query, the goal of this challenge is to locate a temporal moment of the video clip where the answer to the query can be obtained. To tackle this task, we propose a multi-scale cross-modal transformer and a video frame-level contrastive loss to fully uncover the correlation between language queries and video clips. Besides, we propose two data augmentation strategies to increase the diversity of training samples. The experimental results demonstrate the effectiveness of our method. The final submission ranked first on the leaderboard.
Conditional Generation with a Question-Answering Blueprint
Abstract
The ability to convey relevant and faithful information is critical for many tasks in conditional generation and yet remains elusive for neural seq-to-seq models whose outputs often reveal hallucinations and fail to correctly cover important details. In this work, we advocate planning as a useful intermediate representation for rendering conditional generation less opaque and more grounded. Our work proposes a new conceptualization of text plans as a sequence of question-answer (QA) pairs. We enhance existing datasets (e.g., for summarization) with a QA blueprint operating as a proxy for both content selection (i.e.,~what to say) and planning (i.e.,~in what order). We obtain blueprints automatically by exploiting state-of-the-art question generation technology and convert input-output pairs into input-blueprint-output tuples. We develop Transformer-based models, each varying in how they incorporate the blueprint in the generated output (e.g., as a global plan or iteratively). Evaluation across metrics and datasets demonstrates that blueprint models are more factual than alternatives which do not resort to planning and allow tighter control of the generation output.
Multi-task Driver Steering Behaviour Modeling Using Time-Series Transformer
Abstract
Human intention prediction provides an augmented solution for the design of assistants and collaboration between the human driver and intelligent vehicles. In this study, a multi-task sequential learning framework is developed to predict future steering torques and steering postures based on the upper limb neuromuscular Electromyography (EMG) signals. A single-right-hand driving mode is particularly studied. For this driving mode, three different driving postures are also evaluated. Then, a multi-task time-series transformer network (MTS-Trans) is developed to predict the steering torques and driving postures. To evaluate the multi-task learning performance, four different frameworks are assessed. Twenty-one participants are involved in the driving simulator-based experiment. The proposed model achieved accurate prediction results on the future steering torque prediction and driving postures recognition for single-hand driving modes. The proposed system can contribute to the development of advanced driver steering assistant systems and ensure mutual understanding between human drivers and intelligent vehicles.
Masked Autoencoders for Self-Supervised Learning on Automotive Point Clouds
Authors: Georg Hess, Johan Jaxing, Elias Svensson, David Hagerman, Christoffer Petersson, Lennart Svensson
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
Masked autoencoding has become a successful pre-training paradigm for Transformer models for text, images, and recently, point clouds. Raw automotive datasets are a suitable candidate for self-supervised pre-training as they generally are cheap to collect compared to annotations for tasks like 3D object detection (OD). However, development of masked autoencoders for point clouds has focused solely on synthetic and indoor data. Consequently, existing methods have tailored their representations and models toward point clouds which are small, dense and have homogeneous point density. In this work, we study masked autoencoding for point clouds in an automotive setting, which are sparse and for which the point density can vary drastically among objects in the same scene. To this end, we propose Voxel-MAE, a simple masked autoencoding pre-training scheme designed for voxel representations. We pre-train the backbone of a Transformer-based 3D object detector to reconstruct masked voxels and to distinguish between empty and non-empty voxels. Our method improves the 3D OD performance by 1.75 mAP points and 1.05 NDS on the challenging nuScenes dataset. Compared to existing self-supervised methods for automotive data, Voxel-MAE displays up to $2\times$ performance increase. Further, we show that by pre-training with Voxel-MAE, we require only 40% of the annotated data to outperform a randomly initialized equivalent. Code will be released.
Abstract
Generating images from semantic visual knowledge is a challenging task, that can be useful to condition the synthesis process in complex, subtle, and unambiguous ways, compared to alternatives such as class labels or text descriptions. Although generative methods conditioned by semantic representations exist, they do not provide a way to control the generation process aside from the specification of constraints between objects. As an example, the possibility to iteratively generate or modify images by manually adding specific items is a desired property that, to our knowledge, has not been fully investigated in the literature. In this work we propose a transformer-based approach conditioned by scene graphs that, conversely to recent transformer-based methods, also employs a decoder to autoregressively compose images, making the synthesis process more effective and controllable. The proposed architecture is composed by three modules: 1) a graph convolutional network, to encode the relationships of the input graph; 2) an encoder-decoder transformer, which autoregressively composes the output image; 3) an auto-encoder, employed to generate representations used as input/output of each generation step by the transformer. Results obtained on CIFAR10 and MNIST images show that our model is able to satisfy semantic constraints defined by a scene graph and to model relations between visual objects in the scene by taking into account a user-provided partial rendering of the desired target.
Is neural language acquisition similar to natural? A chronological probing study
Abstract
The probing methodology allows one to obtain a partial representation of linguistic phenomena stored in the inner layers of the neural network, using external classifiers and statistical analysis. Pre-trained transformer-based language models are widely used both for natural language understanding (NLU) and natural language generation (NLG) tasks making them most commonly used for downstream applications. However, little analysis was carried out, whether the models were pre-trained enough or contained knowledge correlated with linguistic theory. We are presenting the chronological probing study of transformer English models such as MultiBERT and T5. We sequentially compare the information about the language learned by the models in the process of training on corpora. The results show that 1) linguistic information is acquired in the early stages of training 2) both language models demonstrate capabilities to capture various features from various levels of language, including morphology, syntax, and even discourse, while they also can inconsistently fail on tasks that are perceived as easy. We also introduce the open-source framework for chronological probing research, compatible with other transformer-based models. https://github.com/EkaterinaVoloshina/chronological_probing
Keyword: autonomous driving
TENET: Transformer Encoding Network for Effective Temporal Flow on Motion Prediction
Abstract
This technical report presents an effective method for motion prediction in autonomous driving. We develop a Transformer-based method for input encoding and trajectory prediction. Besides, we propose the Temporal Flow Header to enhance the trajectory encoding. In the end, an efficient K-means ensemble method is used. Using our Transformer network and ensemble method, we win the first place of Argoverse 2 Motion Forecasting Challenge with the state-of-the-art brier-minFDE score of 1.90.
Comprehensive Reactive Safety: No Need For A Trajectory If You Have A Strategy
Authors: Fang Da
Subjects: Robotics (cs.RO); Multiagent Systems (cs.MA)
Abstract
Safety guarantees in motion planning for autonomous driving typically involve certifying the trajectory to be collision-free under any motion of the uncontrollable participants in the environment, such as the human-driven vehicles on the road. As a result they usually employ a conservative bound on the behavior of such participants, such as reachability analysis. We point out that planning trajectories to rigorously avoid the entirety of the reachable regions is unnecessary and too restrictive, because observing the environment in the future will allow us to prune away most of them; disregarding this ability to react to future updates could prohibit solutions to scenarios that are easily navigated by human drivers. We propose to account for the autonomous vehicle's reactions to future environment changes by a novel safety framework, Comprehensive Reactive Safety. Validated in simulations in several urban driving scenarios such as unprotected left turns and lane merging, the resulting planning algorithm called Reactive ILQR demonstrates strong negotiation capabilities and better safety at the same time.
Safe Decision-making for Lane-change of Autonomous Vehicles via Human Demonstration-aided Reinforcement Learning
Abstract
Decision-making is critical for lane change in autonomous driving. Reinforcement learning (RL) algorithms aim to identify the values of behaviors in various situations and thus they become a promising pathway to address the decision-making problem. However, poor runtime safety hinders RL-based decision-making strategies from complex driving tasks in practice. To address this problem, human demonstrations are incorporated into the RL-based decision-making strategy in this paper. Decisions made by human subjects in a driving simulator are treated as safe demonstrations, which are stored into the replay buffer and then utilized to enhance the training process of RL. A complex lane change task in an off-ramp scenario is established to examine the performance of the developed strategy. Simulation results suggest that human demonstrations can effectively improve the safety of decisions of RL. And the proposed strategy surpasses other existing learning-based decision-making strategies with respect to multiple driving performances.
New submissions for Mon, 4 Jul 22
Keyword: SLAM
Deradicalizing YouTube: Characterization, Detection, and Personalization of Religiously Intolerant Arabic Videos
Keeping Less is More: Point Sparsification for Visual SLAM
A Survey on Active Simultaneous Localization and Mapping: State of the Art and New Frontiers
Keyword: odometry
There is no result
Keyword: livox
There is no result
Keyword: loam
There is no result
Keyword: lidar
LaserMix for Semi-Supervised LiDAR Semantic Segmentation
MMFN: Multi-Modal-Fusion-Net for End-to-End Driving
Keyword: loop detection
There is no result
Keyword: nerf
There is no result
Keyword: mapping
Facilitating Conversational Interaction in Natural Language Interfaces for Visualization
Keeping Less is More: Point Sparsification for Visual SLAM
A Survey on Active Simultaneous Localization and Mapping: State of the Art and New Frontiers
SDRTV-to-HDRTV via Hierarchical Dynamic Context Feature Mapping
Fine-grained Correlation Loss for Regression
Artificial Intelligence Techniques for Next-Generation Mega Satellite Networks
Keyword: localization
Data generation using simulation technology to improve perception mechanism of autonomous vehicles
Keeping Less is More: Point Sparsification for Visual SLAM
Point Cloud Change Detection With Stereo V-SLAM:Dataset, Metrics and Baseline
Computer-aided Tuberculosis Diagnosis with Attribute Reasoning Assistance
A Survey on Active Simultaneous Localization and Mapping: State of the Art and New Frontiers
Can we learn from developer mistakes? Learning to localize and repair real bugs from real bug fixes
Agent with Tangent-based Formulation and Anatomical Perception for Standard Plane Localization in 3D Ultrasound
Keyword: transformer
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
GaitForeMer: Self-Supervised Pre-Training of Transformers via Human Motion Forecasting for Few-Shot Gait Impairment Severity Estimation
Language model compression with weighted low-rank factorization
Rethinking Surgical Captioning: End-to-End Window-Based MLP Transformer Using Patches
TENET: Transformer Encoding Network for Effective Temporal Flow on Motion Prediction
MMFN: Multi-Modal-Fusion-Net for End-to-End Driving
Rethinking Query-Key Pairwise Interactions in Vision Transformers
Visual Transformer Meets CutMix for Improved Accuracy, Communication Efficiency, and Data Privacy in Split Learning
DALG: Deep Attentive Local and Global Modeling for Image Retrieval
TopicFM: Robust and Interpretable Feature Matching with Topic-assisted
ReLER@ZJU-Alibaba Submission to the Ego4D Natural Language Queries Challenge 2022
Conditional Generation with a Question-Answering Blueprint
Multi-task Driver Steering Behaviour Modeling Using Time-Series Transformer
Masked Autoencoders for Self-Supervised Learning on Automotive Point Clouds
Transforming Image Generation from Scene Graphs
Is neural language acquisition similar to natural? A chronological probing study
Keyword: autonomous driving
TENET: Transformer Encoding Network for Effective Temporal Flow on Motion Prediction
Comprehensive Reactive Safety: No Need For A Trajectory If You Have A Strategy
Safe Decision-making for Lane-change of Autonomous Vehicles via Human Demonstration-aided Reinforcement Learning