Abstract
We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages. mSLAM combines w2v-BERT pre-training on speech with SpanBERT pre-training on character-level text, along with Connectionist Temporal Classification (CTC) losses on paired speech and transcript data, to learn a single model capable of learning from and representing both speech and text signals in a shared representation space. We evaluate mSLAM on several downstream speech understanding tasks and find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID while being competitive on multilingual ASR, when compared against speech-only pre-training. Our speech translation model demonstrates zero-shot text translation without seeing any text translation data, providing evidence for cross-modal alignment of representations. mSLAM also benefits from multi-modal fine-tuning, further improving the quality of speech translation by directly leveraging text translation data during the fine-tuning process. Our empirical analysis highlights several opportunities and challenges arising from large-scale multimodal pre-training, suggesting directions for future research.
Keyword: Visual inertial
There is no result
Keyword: livox
There is no result
Keyword: loam
There is no result
Keyword: Visual inertial odometry
There is no result
Keyword: lidar
There is no result
Keyword: loop detection
There is no result
Keyword: autonomous driving
PanoDepth: A Two-Stage Approach for Monocular Omnidirectional Depth Estimation
Authors: Yuyan Li, Zhixin Yan, Ye Duan, Liu Ren
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Omnidirectional 3D information is essential for a wide range of applications such as Virtual Reality, Autonomous Driving, Robotics, etc. In this paper, we propose a novel, model-agnostic, two-stage pipeline for omnidirectional monocular depth estimation. Our proposed framework PanoDepth takes one 360 image as input, produces one or more synthesized views in the first stage, and feeds the original image and the synthesized images into the subsequent stereo matching stage. In the second stage, we propose a differentiable Spherical Warping Layer to handle omnidirectional stereo geometry efficiently and effectively. By utilizing the explicit stereo-based geometric constraints in the stereo matching stage, PanoDepth can generate dense high-quality depth. We conducted extensive experiments and ablation studies to evaluate PanoDepth with both the full pipeline as well as the individual modules in each stage. Our results show that PanoDepth outperforms the state-of-the-art approaches by a large margin for 360 monocular depth estimation.
AI-as-a-Service Toolkit for Human-Centered Intelligence in Autonomous Driving
Authors: Valerio De Caro, Saira Bano, Achilles Machumilane, Alberto Gotta, Pietro Cassará, Antonio Carta, Christos Sardianos, Christos Chronis, Iraklis Varlamis, Konstantinos Tserpes, Vincenzo Lomonaco, Claudio Gallicchio, Davide Bacciu
Abstract
This paper presents a proof-of-concept implementation of the AI-as-a-service toolkit developed within the H2020 TEACHING project and designed to implement an autonomous driving personalization system according to the output of an automatic driver's stress recognition algorithm, both of them realizing a Cyber-Physical System of Systems. In addition, we implemented a data-gathering subsystem to collect data from different sensors, i.e., wearables and cameras, to automatize stress recognition. The system was attached for testing to a driving emulation software, CARLA, which allows testing the approach's feasibility with minimum cost and without putting at risk drivers and passengers. At the core of the relative subsystems, different learning algorithms were implemented using Deep Neural Networks, Recurrent Neural Networks, and Reinforcement Learning.
Keyword: mapping
VNE Solution for Network Differentiated QoS and Security Requirements: From the Perspective of Deep Reinforcement Learning
Abstract
The rapid development and deployment of network services has brought a series of challenges to researchers. On the one hand, the needs of Internet end users/applications reflect the characteristics of travel alienation, and they pursue different perspectives of service quality. On the other hand, with the explosive growth of information in the era of big data, a lot of private information is stored in the network. End users/applications naturally start to pay attention to network security. In order to solve the requirements of differentiated quality of service (QoS) and security, this paper proposes a virtual network embedding (VNE) algorithm based on deep reinforcement learning (DRL), aiming at the CPU, bandwidth, delay and security attributes of substrate network. DRL agent is trained in the network environment constructed by the above attributes. The purpose is to deduce the mapping probability of each substrate node and map the virtual node according to this probability. Finally, the breadth first strategy (BFS) is used to map the virtual links. In the experimental stage, the algorithm based on DRL is compared with other representative algorithms in three aspects: long term average revenue, long term revenue consumption ratio and acceptance rate. The results show that the algorithm proposed in this paper has achieved good experimental results, which proves that the algorithm can be effectively applied to solve the end user/application differentiated QoS and security requirements.
A multi-domain virtual network embedding algorithm with delay prediction
Abstract
Virtual network embedding (VNE) is an crucial part of network virtualization (NV), which aims to map the virtual networks (VNs) to a shared substrate network (SN). With the emergence of various delay-sensitive applications, how to improve the delay performance of the system has become a hot topic in academic circles. Based on extensive research, we proposed a multi-domain virtual network embedding algorithm based on delay prediction (DP-VNE). Firstly, the candidate physical nodes are selected by estimating the delay of virtual requests, then particle swarm optimization (PSO) algorithm is used to optimize the mapping process, so as to reduce the delay of the system. The simulation results show that compared with the other three advanced algorithms, the proposed algorithm can significantly reduce the system delay while keeping other indicators unaffected.
Keyword: localization
Training Semantic Descriptors for Image-Based Localization
Authors: Ibrahim Cinaroglu, Yalin Bastanlar
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Abstract
Vision based solutions for the localization of vehicles have become popular recently. We employ an image retrieval based visual localization approach. The database images are kept with GPS coordinates and the location of the retrieved database image serves as an approximate position of the query image. We show that localization can be performed via descriptors solely extracted from semantically segmented images. It is reliable especially when the environment is subjected to severe illumination and seasonal changes. Our experiments reveal that the localization performance of a semantic descriptor can increase up to the level of state-of-the-art RGB image based methods.
A semi-discrete numerical scheme for nonlocally regularized KdV-type equations
Abstract
A general class of KdV-type wave equations regularized with a convolution-type nonlocality in space is considered. The class differs from the class of the nonlinear nonlocal unidirectional wave equations previously studied by the addition of a linear convolution term involving third-order derivative. To solve the Cauchy problem we propose a semi-discrete numerical method based on a uniform spatial discretization, that is an extension of a previously published work of the present authors. We prove uniform convergence of the numerical method as the mesh size goes to zero. We also prove that the localization error resulting from localization to a finite domain is significantly less than a given threshold if the finite domain is large enough. To illustrate the theoretical results, some numerical experiments are carried out for the Rosenau-KdV equation, the Rosenau-BBM-KdV equation and a convolution-type integro-differential equation. The experiments conducted for three particular choices of the kernel function confirm the error estimates that we provide.
Feasibility of Interactive 3D Map for Remote Sighted Assistance
Authors: Jingyi Xie, Rui Yu, Sooyeon Lee, Yao Lyu, Syed Masum Billah, John M. Carroll
Abstract
Remote sighted assistance (RSA) has emerged as a conversational assistive technology, where remote sighted workers, i.e., agents, provide real-time assistance to users with vision impairments via video-chat-like communication. Researchers found that agents' lack of environmental knowledge, the difficulty of orienting users in their surroundings, and the inability to estimate distances from users' camera feeds are key challenges to sighted agents. To address these challenges, researchers have suggested assisting agents with computer vision technologies, especially 3D reconstruction. This paper presents a high-fidelity prototype of such an RSA, where agents use interactive 3D maps with localization capability. We conducted a walkthrough study with thirteen agents and one user with simulated vision impairment using this prototype. The study revealed that, compared to baseline RSA, the agents were significantly faster in providing navigational assistance to users, and their mental workload was significantly reduced -- all indicate the feasibility and prospect of 3D maps in RSA.
Keyword: SLAM
mSLAM: Massively multilingual joint pre-training for speech and text
Keyword: Visual inertial
There is no result
Keyword: livox
There is no result
Keyword: loam
There is no result
Keyword: Visual inertial odometry
There is no result
Keyword: lidar
There is no result
Keyword: loop detection
There is no result
Keyword: autonomous driving
PanoDepth: A Two-Stage Approach for Monocular Omnidirectional Depth Estimation
AI-as-a-Service Toolkit for Human-Centered Intelligence in Autonomous Driving
Keyword: mapping
VNE Solution for Network Differentiated QoS and Security Requirements: From the Perspective of Deep Reinforcement Learning
A multi-domain virtual network embedding algorithm with delay prediction
Keyword: localization
Training Semantic Descriptors for Image-Based Localization
A semi-discrete numerical scheme for nonlocally regularized KdV-type equations
Feasibility of Interactive 3D Map for Remote Sighted Assistance