Abstract
This work describes the automatic registration of a large network (approximately 40) of fixed, ceiling-mounted environment cameras spread over a large area (approximately 800 squared meters) using a mobile calibration robot equipped with a single upward-facing fisheye camera and a backlit ArUco marker for easy detection. The fisheye camera is used to do visual odometry (VO), and the ArUco marker facilitates easy detection of the calibration robot in the environment cameras. In addition, the fisheye camera is also able to detect the environment cameras. This two-way, bidirectional detection constrains the pose of the environment cameras to solve an optimization problem. Such an approach can be used to automatically register a large-scale multi-camera system used for surveillance, automated parking, or robotic applications. This VO based multicamera registration method is extensively validated using real-world experiments, and also compared against a similar approach which uses an LiDAR - an expensive, heavier and power hungry sensor.
Keyword: livox
There is no result
Keyword: loam
There is no result
Keyword: lidar
How Does Data Freshness Affect Real-time Supervised Learning?
Authors: Md Kamran Chowdhury Shisher, Yin Sun
Subjects: Networking and Internet Architecture (cs.NI); Information Theory (cs.IT); Machine Learning (cs.LG)
Abstract
In this paper, we analyze the impact of data freshness on real-time supervised learning, where a neural network is trained to infer a time-varying target (e.g., the position of the vehicle in front) based on features (e.g., video frames) observed at a sensing node (e.g., camera or lidar). One might expect that the performance of real-time supervised learning degrades monotonically as the feature becomes stale. Using an information-theoretic analysis, we show that this is true if the feature and target data sequence can be closely approximated as a Markov chain; it is not true if the data sequence is far from Markovian. Hence, the prediction error of real-time supervised learning is a function of the Age of Information (AoI), where the function could be non-monotonic. Several experiments are conducted to illustrate the monotonic and non-monotonic behaviors of the prediction error. To minimize the inference error in real-time, we propose a new "selection-from-buffer" model for sending the features, which is more general than the "generate-at-will" model used in earlier studies. By using Gittins and Whittle indices, low-complexity scheduling strategies are developed to minimize the inference error, where a new connection between the Gittins index theory and Age of Information (AoI) minimization is discovered. These scheduling results hold (i) for minimizing general AoI functions (monotonic or non-monotonic) and (ii) for general feature transmission time distributions. Data-driven evaluations are presented to illustrate the benefits of the proposed scheduling algorithms.
Look Both Ways: Bidirectional Visual Sensing for Automatic Multi-Camera Registration
Abstract
This work describes the automatic registration of a large network (approximately 40) of fixed, ceiling-mounted environment cameras spread over a large area (approximately 800 squared meters) using a mobile calibration robot equipped with a single upward-facing fisheye camera and a backlit ArUco marker for easy detection. The fisheye camera is used to do visual odometry (VO), and the ArUco marker facilitates easy detection of the calibration robot in the environment cameras. In addition, the fisheye camera is also able to detect the environment cameras. This two-way, bidirectional detection constrains the pose of the environment cameras to solve an optimization problem. Such an approach can be used to automatically register a large-scale multi-camera system used for surveillance, automated parking, or robotic applications. This VO based multicamera registration method is extensively validated using real-world experiments, and also compared against a similar approach which uses an LiDAR - an expensive, heavier and power hungry sensor.
Online Pole Segmentation on Range Images for Long-term LiDAR Localization in Urban Environments
Authors: Hao Dong, Xieyuanli Chen, Simo Särkkä, Cyrill Stachniss
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Abstract
Robust and accurate localization is a basic requirement for mobile autonomous systems. Pole-like objects, such as traffic signs, poles, and lamps are frequently used landmarks for localization in urban environments due to their local distinctiveness and long-term stability. In this paper, we present a novel, accurate, and fast pole extraction approach based on geometric features that runs online and has little computational demands. Our method performs all computations directly on range images generated from 3D LiDAR scans, which avoids processing 3D point clouds explicitly and enables fast pole extraction for each scan. We further use the extracted poles as pseudo labels to train a deep neural network for online range image-based pole segmentation. We test both our geometric and learning-based pole extraction methods for localization on different datasets with different LiDAR scanners, routes, and seasonal changes. The experimental results show that our methods outperform other state-of-the-art approaches. Moreover, boosted with pseudo pole labels extracted from multiple datasets, our learning-based method can run across different datasets and achieve even better localization results compared to our geometry-based method. We released our pole datasets to the public for evaluating the performance of pole extractors, as well as the implementation of our approach.
Keyword: loop detection
There is no result
Keyword: nerf
UPST-NeRF: Universal Photorealistic Style Transfer of Neural Radiance Fields for 3D Scene
Authors: Yaosen Chen, Qi Yuan, Zhiqiang Li, Yuegen Liu Wei Wang Chaoping Xie, Xuming Wen, Qien Yu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
3D scenes photorealistic stylization aims to generate photorealistic images from arbitrary novel views according to a given style image while ensuring consistency when rendering from different viewpoints. Some existing stylization methods with neural radiance fields can effectively predict stylized scenes by combining the features of the style image with multi-view images to train 3D scenes. However, these methods generate novel view images that contain objectionable artifacts. Besides, they cannot achieve universal photorealistic stylization for a 3D scene. Therefore, a styling image must retrain a 3D scene representation network based on a neural radiation field. We propose a novel 3D scene photorealistic style transfer framework to address these issues. It can realize photorealistic 3D scene style transfer with a 2D style image. We first pre-trained a 2D photorealistic style transfer network, which can meet the photorealistic style transfer between any given content image and style image. Then, we use voxel features to optimize a 3D scene and get the geometric representation of the scene. Finally, we jointly optimize a hyper network to realize the scene photorealistic style transfer of arbitrary style images. In the transfer stage, we use a pre-trained 2D photorealistic network to constrain the photorealistic style of different views and different style images in the 3D scene. The experimental results show that our method not only realizes the 3D photorealistic style transfer of arbitrary style images but also outperforms the existing methods in terms of visual quality and consistency. Project page:https://semchan.github.io/UPST_NeRF.
DM-NeRF: 3D Scene Geometry Decomposition and Manipulation from 2D Images
Abstract
In this paper, we study the problem of 3D scene geometry decomposition and manipulation from 2D views. By leveraging the recent implicit neural representation techniques, particularly the appealing neural radiance fields, we introduce an object field component to learn unique codes for all individual objects in 3D space only from 2D supervision. The key to this component is a series of carefully designed loss functions to enable every 3D point, especially in non-occupied space, to be effectively optimized even without 3D labels. In addition, we introduce an inverse query algorithm to freely manipulate any specified 3D object shape in the learned scene representation. Notably, our manipulation algorithm can explicitly tackle key issues such as object collisions and visual occlusions. Our method, called DM-NeRF, is among the first to simultaneously reconstruct, decompose, manipulate and render complex 3D scenes in a single pipeline. Extensive experiments on three datasets clearly show that our method can accurately decompose all 3D objects from 2D views, allowing any interested object to be freely manipulated in 3D space such as translation, rotation, size adjustment, and deformation.
Keyword: mapping
Revisiting Adversarial Attacks on Graph Neural Networks for Graph Classification
Abstract
Graph neural networks (GNNs) have achieved tremendous success in the task of graph classification and diverse downstream real-world applications. Despite their success, existing approaches are either limited to structure attacks or restricted to local information. This calls for a more general attack framework on graph classification, which faces significant challenges due to the complexity of generating local-node-level adversarial examples using the global-graph-level information. To address this "global-to-local" problem, we present a general framework CAMA to generate adversarial examples by manipulating graph structure and node features in a hierarchical style. Specifically, we make use of Graph Class Activation Mapping and its variant to produce node-level importance corresponding to the graph classification task. Then through a heuristic design of algorithms, we can perform both feature and structure attacks under unnoticeable perturbation budgets with the help of both node-level and subgraph-level importance. Experiments towards attacking four state-of-the-art graph classification models on six real-world benchmarks verify the flexibility and effectiveness of our framework.
IPvSeeYou: Exploiting Leaked Identifiers in IPv6 for Street-Level Geolocation
Authors: Erik Rye, Robert Beverly
Subjects: Networking and Internet Architecture (cs.NI); Cryptography and Security (cs.CR)
Abstract
We present IPvSeeYou, a privacy attack that permits a remote and unprivileged adversary to physically geolocate many residential IPv6 hosts and networks with street-level precision. The crux of our method involves: 1) remotely discovering wide area (WAN) hardware MAC addresses from home routers; 2) correlating these MAC addresses with their WiFi BSSID counterparts of known location; and 3) extending coverage by associating devices connected to a common penultimate provider router. We first obtain a large corpus of MACs embedded in IPv6 addresses via high-speed network probing. These MAC addresses are effectively leaked up the protocol stack and largely represent WAN interfaces of residential routers, many of which are all-in-one devices that also provide WiFi. We develop a technique to statistically infer the mapping between a router's WAN and WiFi MAC addresses across manufacturers and devices, and mount a large-scale data fusion attack that correlates WAN MACs with WiFi BSSIDs available in wardriving (geolocation) databases. Using these correlations, we geolocate the IPv6 prefixes of $>$12M routers in the wild across 146 countries and territories. Selected validation confirms a median geolocation error of 39 meters. We then exploit technology and deployment constraints to extend the attack to a larger set of IPv6 residential routers by clustering and associating devices with a common penultimate provider router. While we responsibly disclosed our results to several manufacturers and providers, the ossified ecosystem of deployed residential cable and DSL routers suggests that our attack will remain a privacy threat into the foreseeable future.
HDR-Plenoxels: Self-Calibrating High Dynamic Range Radiance Fields
Authors: Kim Jun-Seong, Kim Yu-Ji, Moon Ye-Bin, Tae-Hyun Oh
Abstract
We propose high dynamic range radiance (HDR) fields, HDR-Plenoxels, that learn a plenoptic function of 3D HDR radiance fields, geometry information, and varying camera settings inherent in 2D low dynamic range (LDR) images. Our voxel-based volume rendering pipeline reconstructs HDR radiance fields with only multi-view LDR images taken from varying camera settings in an end-to-end manner and has a fast convergence speed. To deal with various cameras in real-world scenarios, we introduce a tone mapping module that models the digital in-camera imaging pipeline (ISP) and disentangles radiometric settings. Our tone mapping module allows us to render by controlling the radiometric settings of each novel view. Finally, we build a multi-view dataset with varying camera conditions, which fits our problem setting. Our experiments show that HDR-Plenoxels can express detail and high-quality HDR novel views from only LDR images with various cameras.
Block Orthogonal Sparse Superposition Codes for Ultra-Reliable Low-Latency Communications
Authors: Donghwa Han, Jeonghun Park, Youngjoo Lee, H. Vincent Poor, Namyoon Lee
Abstract
Low-rate and short-packet transmissions are important for ultra-reliable low-latency communications (URLLC). In this paper, we put forth a new family of sparse superposition codes for URLLC, called block orthogonal sparse superposition (BOSS) codes. We first present a code construction method for the efficient encoding of BOSS codes. The key idea is to construct codewords by the superposition of the orthogonal columns of a dictionary matrix with a sequential bit mapping strategy. We also propose an approximate maximum a posteriori probability (MAP) decoder with two stages. The approximate MAP decoder reduces the decoding latency significantly via a parallel decoding structure while maintaining a comparable decoding complexity to the successive cancellation list (SCL) decoder of polar codes. Furthermore, to gauge the code performance in the finite-blocklength regime, we derive an exact analytical expression for block-error rates (BLERs) for single-layered BOSS codes in terms of relevant code parameters. Lastly, we present a cyclic redundancy check aided-BOSS (CA-BOSS) code with simple list decoding to boost the code performance. Our experiments verify that CA-BOSS with the simple list decoder outperforms CA-polar codes with SCL decoding in the low-rate and finite-blocklength regimes while achieving the finite-blocklength capacity upper bound within one dB of signal-to-noise ratio.
Global Priors Guided Modulation Network for Joint Super-Resolution and Inverse Tone-Mapping
Authors: Gang He, Shaoyi Long, Li Xu, Chang Wu, Jinjia Zhou, Ming Sun, Xing Wen, Yurong Dai
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Abstract
Joint super-resolution and inverse tone-mapping (SR-ITM) aims to enhance the visual quality of videos that have quality deficiencies in resolution and dynamic range. This problem arises when using 4K high dynamic range (HDR) TVs to watch a low-resolution standard dynamic range (LR SDR) video. Previous methods that rely on learning local information typically cannot do well in preserving color conformity and long-range structural similarity, resulting in unnatural color transition and texture artifacts. In order to tackle these challenges, we propose a global priors guided modulation network (GPGMNet) for joint SR-ITM. In particular, we design a global priors extraction module (GPEM) to extract color conformity prior and structural similarity prior that are beneficial for ITM and SR tasks, respectively. To further exploit the global priors and preserve spatial information, we devise multiple global priors guided spatial-wise modulation blocks (GSMBs) with a few parameters for intermediate feature modulation, in which the modulation parameters are generated by the shared global priors and the spatial features map from the spatial pyramid convolution block (SPCB). With these elaborate designs, the GPGMNet can achieve higher visual quality with lower computational complexity. Extensive experiments demonstrate that our proposed GPGMNet is superior to the state-of-the-art methods. Specifically, our proposed model exceeds the state-of-the-art by 0.64 dB in PSNR, with 69$\%$ fewer parameters and 3.1$\times$ speedup. The code will be released soon.
Visual Localization via Few-Shot Scene Region Classification
Authors: Siyan Dong, Shuzhe Wang, Yixin Zhuang, Juho Kannala, Marc Pollefeys, Baoquan Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Visual (re)localization addresses the problem of estimating the 6-DoF (Degree of Freedom) camera pose of a query image captured in a known scene, which is a key building block of many computer vision and robotics applications. Recent advances in structure-based localization solve this problem by memorizing the mapping from image pixels to scene coordinates with neural networks to build 2D-3D correspondences for camera pose optimization. However, such memorization requires training by amounts of posed images in each scene, which is heavy and inefficient. On the contrary, few-shot images are usually sufficient to cover the main regions of a scene for a human operator to perform visual localization. In this paper, we propose a scene region classification approach to achieve fast and effective scene memorization with few-shot images. Our insight is leveraging a) pre-learned feature extractor, b) scene region classifier, and c) meta-learning strategy to accelerate training while mitigating overfitting. We evaluate our method on both indoor and outdoor benchmarks. The experiments validate the effectiveness of our method in the few-shot setting, and the training time is significantly reduced to only a few minutes. Code available at: \url{https://github.com/siyandong/SRC}
MACE: Multi-Agent Autonomous Collaborative Exploration of Unknown Environments
Abstract
In this paper, we propose a new framework for multi-agent collaborative exploration of unknown environments. The proposed method combines state-of-the-art algorithms in mapping, safe corridor generation and multi-agent planning. It first takes a volume that we want to explore, then proceeds to give the multiple agents different goals in order to explore a voxel grid of that volume. The exploration ends when all voxels are discovered as free or occupied, or there is no path found for the remaining undiscovered voxels. The state-of-the-art planning algorithm uses time-aware Safe Corridors to guarantee intra-agent collision safety as well safety from static obstacles. The presented approach is tested in a state of the art simulator for up to 4 agents.
Automatic Landmark Detection and Registration of Brain Cortical Surfaces via Quasi-Conformal Geometry and Convolutional Neural Networks
Authors: Yuchen Guo, Qiguang Chen, Gary P. T. Choi, Lok Ming Lui
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Abstract
In medical imaging, surface registration is extensively used for performing systematic comparisons between anatomical structures, with a prime example being the highly convoluted brain cortical surfaces. To obtain a meaningful registration, a common approach is to identify prominent features on the surfaces and establish a low-distortion mapping between them with the feature correspondence encoded as landmark constraints. Prior registration works have primarily focused on using manually labeled landmarks and solving highly nonlinear optimization problems, which are time-consuming and hence hinder practical applications. In this work, we propose a novel framework for the automatic landmark detection and registration of brain cortical surfaces using quasi-conformal geometry and convolutional neural networks. We first develop a landmark detection network (LD-Net) that allows for the automatic extraction of landmark curves given two prescribed starting and ending points based on the surface geometry. We then utilize the detected landmarks and quasi-conformal theory for achieving the surface registration. Specifically, we develop a coefficient prediction network (CP-Net) for predicting the Beltrami coefficients associated with the desired landmark-based registration and a mapping network called the disk Beltrami solver network (DBS-Net) for generating quasi-conformal mappings from the predicted Beltrami coefficients, with the bijectivity guaranteed by quasi-conformal theory. Experimental results are presented to demonstrate the effectiveness of our proposed framework. Altogether, our work paves a new way for surface-based morphometry and medical shape analysis.
Abstract
Heuristic search has traditionally relied on hand-crafted or programmatically derived heuristics. Neural networks (NNs) are newer powerful tools which can be used to learn complex mappings from states to cost-to-go heuristics. However, their slow single inference time is a large overhead that can substantially slow down planning time in optimized heuristic search implementations. Several recent works have described ways to take advantage of NN's batch computations to decrease overhead in planning, while retaining bounds on (sub)optimality. However, all these methods have used the NN heuristic in a "blocking" manner while building up their batches, and have ignored possible fast-to-compute admissible heuristics (e.g. existing classically derived heuristics) that are usually available to use. We introduce Non-Blocking Batch A (NBBA), a bounded suboptimal method which lazily computes the NN heuristic in batches while allowing expansions informed by a non-NN heuristic. We show how this subtle but important change can lead to substantial reductions in expansions compared to the current blocking alternative, and see that the performance is related to the information difference between the batch computed NN and fast non-NN heuristic.
LCSM: A Lightweight Complex Spectral Mapping Framework for Stereophonic Acoustic Echo Cancellation
Abstract
The traditional adaptive algorithms will face the non-uniqueness problem when dealing with stereophonic acoustic echo cancellation (SAEC). In this paper, we first propose an efficient multi-input and multi-output (MIMO) scheme based on deep learning to filter out echoes from all microphone signals at once. Then, we employ a lightweight complex spectral mapping framework (LCSM) for end-to-end SAEC without decorrelation preprocessing to the loudspeaker signals. Inplace convolution and channel-wise spatial modeling are utilized to ensure the near-end signal information is preserved. Finally, a cross-domain loss function is designed for better generalization capability. Experiments are evaluated on a variety of untrained conditions and results demonstrate that the LCSM significantly outperforms previous methods. Moreover, the proposed causal framework only has 0.55 million parameters, much less than the similar deep learning-based methods, which is important for the resource-limited devices.
Keyword: localization
Finding Point with Image: An End-to-End Benchmark for Vision-based UAV Localization
Abstract
In the past, image retrieval was the mainstream solution for cross-view geolocation and UAV visual localization tasks. In a nutshell, the way of image retrieval is to obtain the final required information, such as GPS, through a transitional perspective. However, the way of image retrieval is not completely end-to-end. And there are some redundant operations such as the need to prepare the feature library in advance, and the sampling interval problem of the gallery construction, which make it difficult to implement large-scale applications. In this article we propose an end-to-end positioning scheme, Finding Point with Image (FPI), which aims to directly find the corresponding location in the image of source B (satellite-view) through the image of source A (drone-view). To verify the feasibility of our framework, we construct a new dataset (UL14), which is designed to solve the UAV visual self-localization task. At the same time, we also build a transformer-based baseline to achieve end-to-end training. In addition, the previous evaluation methods are no longer applicable under the framework of FPI. Thus, Metre-level Accuracy (MA) and Relative Distance Score (RDS) are proposed to evaluate the accuracy of UAV localization. At the same time, we preliminarily compare FPI and image retrieval method, and the structure of FPI achieves better performance in both speed and efficiency. In particular, the task of FPI remains great challenges due to the large differences between different views and the drastic spatial scale transformation.
Visual Localization via Few-Shot Scene Region Classification
Authors: Siyan Dong, Shuzhe Wang, Yixin Zhuang, Juho Kannala, Marc Pollefeys, Baoquan Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Visual (re)localization addresses the problem of estimating the 6-DoF (Degree of Freedom) camera pose of a query image captured in a known scene, which is a key building block of many computer vision and robotics applications. Recent advances in structure-based localization solve this problem by memorizing the mapping from image pixels to scene coordinates with neural networks to build 2D-3D correspondences for camera pose optimization. However, such memorization requires training by amounts of posed images in each scene, which is heavy and inefficient. On the contrary, few-shot images are usually sufficient to cover the main regions of a scene for a human operator to perform visual localization. In this paper, we propose a scene region classification approach to achieve fast and effective scene memorization with few-shot images. Our insight is leveraging a) pre-learned feature extractor, b) scene region classifier, and c) meta-learning strategy to accelerate training while mitigating overfitting. We evaluate our method on both indoor and outdoor benchmarks. The experiments validate the effectiveness of our method in the few-shot setting, and the training time is significantly reduced to only a few minutes. Code available at: \url{https://github.com/siyandong/SRC}
Terrain-based vehicle localization using an active suspension system
Authors: Yu Jiang, John Eisenmann, William Graves, Vijayaraghavan Sridhar, Zackary Anderson
Abstract
This paper, for the first time, presents a terrain-based localization approach using sensor data from an active suspension system. The contribution is four-fold. First, it is shown that a location dependent road height profile can be created from sensor data of the active suspension system. Second, an algorithm is developed to extract a pitch profile from the road height profile data. The ideal pitch profile is vehicle-independent and only depends on the road. This pitch profile generated from an on-board computer is matched with a known terrain map to achieve real-time positioning. Third, a crowd-sourced map creation algorithm is developed to create and improve the terrain map that contains pitch profile. Fourth, experiments have been conducted to validate the accuracy and robustness of the proposed localization approach.
Online Pole Segmentation on Range Images for Long-term LiDAR Localization in Urban Environments
Authors: Hao Dong, Xieyuanli Chen, Simo Särkkä, Cyrill Stachniss
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Abstract
Robust and accurate localization is a basic requirement for mobile autonomous systems. Pole-like objects, such as traffic signs, poles, and lamps are frequently used landmarks for localization in urban environments due to their local distinctiveness and long-term stability. In this paper, we present a novel, accurate, and fast pole extraction approach based on geometric features that runs online and has little computational demands. Our method performs all computations directly on range images generated from 3D LiDAR scans, which avoids processing 3D point clouds explicitly and enables fast pole extraction for each scan. We further use the extracted poles as pseudo labels to train a deep neural network for online range image-based pole segmentation. We test both our geometric and learning-based pole extraction methods for localization on different datasets with different LiDAR scanners, routes, and seasonal changes. The experimental results show that our methods outperform other state-of-the-art approaches. Moreover, boosted with pseudo pole labels extracted from multiple datasets, our learning-based method can run across different datasets and achieve even better localization results compared to our geometry-based method. We released our pole datasets to the public for evaluating the performance of pole extractors, as well as the implementation of our approach.
Keyword: transformer
LM-CORE: Language Models with Contextually Relevant External Knowledge
Abstract
Large transformer-based pre-trained language models have achieved impressive performance on a variety of knowledge-intensive tasks and can capture factual knowledge in their parameters. We argue that storing large amounts of knowledge in the model parameters is sub-optimal given the ever-growing amounts of knowledge and resource requirements. We posit that a more efficient alternative is to provide explicit access to contextually relevant structured knowledge to the model and train it to use that knowledge. We present LM-CORE -- a general framework to achieve this -- that allows \textit{decoupling} of the language model training from the external knowledge source and allows the latter to be updated without affecting the already trained model. Experimental results show that LM-CORE, having access to external knowledge, achieves significant and robust outperformance over state-of-the-art knowledge-enhanced language models on knowledge probing tasks; can effectively handle knowledge updates; and performs well on two downstream tasks. We also present a thorough error analysis highlighting the successes and failures of LM-CORE.
Finding Point with Image: An End-to-End Benchmark for Vision-based UAV Localization
Abstract
In the past, image retrieval was the mainstream solution for cross-view geolocation and UAV visual localization tasks. In a nutshell, the way of image retrieval is to obtain the final required information, such as GPS, through a transitional perspective. However, the way of image retrieval is not completely end-to-end. And there are some redundant operations such as the need to prepare the feature library in advance, and the sampling interval problem of the gallery construction, which make it difficult to implement large-scale applications. In this article we propose an end-to-end positioning scheme, Finding Point with Image (FPI), which aims to directly find the corresponding location in the image of source B (satellite-view) through the image of source A (drone-view). To verify the feasibility of our framework, we construct a new dataset (UL14), which is designed to solve the UAV visual self-localization task. At the same time, we also build a transformer-based baseline to achieve end-to-end training. In addition, the previous evaluation methods are no longer applicable under the framework of FPI. Thus, Metre-level Accuracy (MA) and Relative Distance Score (RDS) are proposed to evaluate the accuracy of UAV localization. At the same time, we preliminarily compare FPI and image retrieval method, and the structure of FPI achieves better performance in both speed and efficiency. In particular, the task of FPI remains great challenges due to the large differences between different views and the drastic spatial scale transformation.
GEDI: A Graph-based End-to-end Data Imputation Framework
Authors: Katrina Chen, Xiuqin Liang, Zhibin Zhang, Zheng Ma
Abstract
Data imputation is an effective way to handle missing data, which is common in practical applications. In this study, we propose and test a novel data imputation process that achieve two important goals: (1) preserve the row-wise similarities among observations and column-wise contextual relationships among features in the feature matrix, and (2) tailor the imputation process to specific downstream label prediction task. The proposed imputation process uses Transformer network and graph structure learning to iteratively refine the contextual relationships among features and similarities among observations. Moreover, it uses a meta-learning framework to select features that are influential to the downstream prediction task of interest. We conduct experiments on real-world large data sets, and show that the proposed imputation process consistently improves imputation and label prediction performance over a variety of benchmark methods.
Enhanced Vehicle Re-identification for ITS: A Feature Fusion approach using Deep Learning
Authors: Ashutosh Holla B, Manohara Pai M.M, Ujjwal Verma, Radhika M. Pai
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
In recent years, the development of robust Intelligent transportation systems (ITS) is tackled across the globe to provide better traffic efficiency by reducing frequent traffic problems. As an application of ITS, vehicle re-identification has gained ample interest in the domain of computer vision and robotics. Convolutional neural network (CNN) based methods are developed to perform vehicle re-identification to address key challenges such as occlusion, illumination change, scale, etc. The advancement of transformers in computer vision has opened an opportunity to explore the re-identification process further to enhance performance. In this paper, a framework is developed to perform the re-identification of vehicles across CCTV cameras. To perform re-identification, the proposed framework fuses the vehicle representation learned using a CNN and a transformer model. The framework is tested on a dataset that contains 81 unique vehicle identities observed across 20 CCTV cameras. From the experiments, the fused vehicle re-identification framework yields an mAP of 61.73% which is significantly better when compared with the standalone CNN or transformer model.
Interpreting BERT-based Text Similarity via Activation and Saliency Maps
Authors: Itzik Malkiel, Dvir Ginzburg, Oren Barkan, Avi Caciularu, Jonathan Weill, Noam Koenigstein
Abstract
Recently, there has been growing interest in the ability of Transformer-based models to produce meaningful embeddings of text with several applications, such as text similarity. Despite significant progress in the field, the explanations for similarity predictions remain challenging, especially in unsupervised settings. In this work, we present an unsupervised technique for explaining paragraph similarities inferred by pre-trained BERT models. By looking at a pair of paragraphs, our technique identifies important words that dictate each paragraph's semantics, matches between the words in both paragraphs, and retrieves the most important pairs that explain the similarity between the two. The method, which has been assessed by extensive human evaluations and demonstrated on datasets comprising long and complex paragraphs, has shown great promise, providing accurate interpretations that correlate better with human perceptions.
Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models
Authors: Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, Shuicheng Yan
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Abstract
Adaptive gradient algorithms borrow the moving average idea of heavy ball acceleration to estimate accurate first- and second-order moments of gradient for accelerating convergence. However, Nesterov acceleration which converges faster than heavy ball acceleration in theory and also in many empirical cases is much less investigated under the adaptive gradient setting. In this work, we propose the ADAptive Nesterov momentum algorithm, Adan for short, to effectively speedup the training of deep neural networks. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method, which avoids the extra computation and memory overhead of computing gradient at the extrapolation point. Then Adan adopts NME to estimate the first- and second-order moments of the gradient in adaptive gradient algorithms for convergence acceleration. Besides, we prove that Adan finds an $\epsilon$-approximate first-order stationary point within $O(\epsilon^{-3.5})$ stochastic gradient complexity on the nonconvex stochastic problems (e.g., deep learning problems), matching the best-known lower bound. Extensive experimental results show that Adan surpasses the corresponding SoTA optimizers on both vision transformers (ViTs) and CNNs, and sets new SoTAs for many popular networks, e.g., ResNet, ConvNext, ViT, Swin, MAE, LSTM, Transformer-XL, and BERT. More surprisingly, Adan can use half of the training cost (epochs) of SoTA optimizers to achieve higher or comparable performance on ViT and ResNet, e.t.c., and also shows great tolerance to a large range of minibatch size, e.g., from 1k to 32k. We hope Adan can contribute to the development of deep learning by reducing training cost and relieving engineering burden of trying different optimizers on various architectures. Code will be released at https://github.com/sail-sg/Adan.
BinBert: Binary Code Understanding with a Fine-tunable and Execution-aware Transformer
Authors: Fiorella Artuso, Marco Mormando, Giuseppe A. Di Luna, Leonardo Querzoni
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Abstract
A recent trend in binary code analysis promotes the use of neural solutions based on instruction embedding models. An instruction embedding model is a neural network that transforms sequences of assembly instructions into embedding vectors. If the embedding network is trained such that the translation from code to vectors partially preserves the semantic, the network effectively represents an assembly code model. In this paper we present BinBert, a novel assembly code model. BinBert is built on a transformer pre-trained on a huge dataset of both assembly instruction sequences and symbolic execution information. BinBert can be applied to assembly instructions sequences and it is fine-tunable, i.e. it can be re-trained as part of a neural architecture on task-specific data. Through fine-tuning, BinBert learns how to apply the general knowledge acquired with pre-training to the specific task. We evaluated BinBert on a multi-task benchmark that we specifically designed to test the understanding of assembly code. The benchmark is composed of several tasks, some taken from the literature, and a few novel tasks that we designed, with a mix of intrinsic and downstream tasks. Our results show that BinBert outperforms state-of-the-art models for binary instruction embedding, raising the bar for binary code understanding.
Flow-Guided Transformer for Video Inpainting
Authors: Kaidong Zhang, Jingjing Fu, Dong Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
We propose a flow-guided transformer, which innovatively leverage the motion discrepancy exposed by optical flows to instruct the attention retrieval in transformer for high fidelity video inpainting. More specially, we design a novel flow completion network to complete the corrupted flows by exploiting the relevant flow features in a local temporal window. With the completed flows, we propagate the content across video frames, and adopt the flow-guided transformer to synthesize the rest corrupted regions. We decouple transformers along temporal and spatial dimension, so that we can easily integrate the locally relevant completed flows to instruct spatial attention only. Furthermore, we design a flow-reweight module to precisely control the impact of completed flows on each spatial transformer. For the sake of efficiency, we introduce window partition strategy to both spatial and temporal transformers. Especially in spatial transformer, we design a dual perspective spatial MHSA, which integrates the global tokens to the window-based attention. Extensive experiments demonstrate the effectiveness of the proposed method qualitatively and quantitatively. Codes are available at https://github.com/hitachinsk/FGT.
TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency
Authors: Medhini Narasimhan, Arsha Nagrani, Chen Sun, Michael Rubinstein, Trevor Darrell, Anna Rohrbach, Cordelia Schmid
Subjects: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)
Abstract
YouTube users looking for instructions for a specific task may spend a long time browsing content trying to find the right video that matches their needs. Creating a visual summary (abridged version of a video) provides viewers with a quick overview and massively reduces search time. In this work, we focus on summarizing instructional videos, an under-explored area of video summarization. In comparison to generic videos, instructional videos can be parsed into semantically meaningful segments that correspond to important steps of the demonstrated task. Existing video summarization datasets rely on manual frame-level annotations, making them subjective and limited in size. To overcome this, we first automatically generate pseudo summaries for a corpus of instructional videos by exploiting two key assumptions: (i) relevant steps are likely to appear in multiple videos of the same task (Task Relevance), and (ii) they are more likely to be described by the demonstrator verbally (Cross-Modal Saliency). We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer. Using pseudo summaries as weak supervision, our network constructs a visual summary for an instructional video given only video and transcribed speech. To evaluate our model, we collect a high-quality test set, WikiHow Summaries, by scraping WikiHow articles that contain video demonstrations and visual depictions of steps allowing us to obtain the ground-truth summaries. We outperform several baselines and a state-of-the-art video summarization model on this new benchmark.
Hybrid explicit-implicit learning for multiscale problems with time dependent source
Authors: Yalchin Efendiev (Texas A&M University), Wing Tat Leung (City University of Hong Kong), Wenyuan Li (Texas A&M University), Zecheng Zhang (Carnegie Mellon University)
Abstract
The splitting method is a powerful method for solving partial differential equations. Various splitting methods have been designed to separate different physics, nonlinearities, and so on. Recently, a new splitting approach has been proposed where some degrees of freedom are handled implicitly while other degrees of freedom are handled explicitly. As a result, the scheme contains two equations, one implicit and the other explicit. The stability of this approach has been studied. It was shown that the time step scales as the coarse spatial mesh size, which can provide a significant computational advantage. However, the implicit solution part can still be expensive, especially for nonlinear problems. In this paper, we introduce modified partial machine learning algorithms to replace the implicit solution part of the algorithm. These algorithms are first introduced in arXiv:2109.02147, where a homogeneous source term is considered along with the Transformer, which is a neural network that can predict future dynamics. In this paper, we consider time-dependent source terms which is a generalization of the previous work. Moreover, we use the whole history of the solution to train the network. As the implicit part of the equations is more complicated to solve, we design a neural network to predict it based on training. Furthermore, we compute the explicit part of the solution using our splitting strategy. In addition, we use Proper Orthogonal Decomposition based model reduction in machine learning. The machine learning algorithms provide computational saving without sacrificing accuracy. We present three numerical examples which show that our machine learning scheme is stable and accurate.
HighlightNet: Highlighting Low-Light Potential Features for Real-Time UAV Tracking
Abstract
Low-light environments have posed a formidable challenge for robust unmanned aerial vehicle (UAV) tracking even with state-of-the-art (SOTA) trackers since the potential image features are hard to extract under adverse light conditions. Besides, due to the low visibility, accurate online selection of the object also becomes extremely difficult for human monitors to initialize UAV tracking in ground control stations. To solve these problems, this work proposes a novel enhancer, i.e., HighlightNet, to light up potential objects for both human operators and UAV trackers. By employing Transformer, HighlightNet can adjust enhancement parameters according to global features and is thus adaptive for the illumination variation. Pixel-level range mask is introduced to make HighlightNet more focused on the enhancement of the tracking object and regions without light sources. Furthermore, a soft truncation mechanism is built to prevent background noise from being mistaken for crucial features. Evaluations on image enhancement benchmarks demonstrate HighlightNet has advantages in facilitating human perception. Experiments on the public UAVDark135 benchmark show that HightlightNet is more suitable for UAV tracking tasks than other SOTA low-light enhancers. In addition, real-world tests on a typical UAV platform verify HightlightNet's practicability and efficiency in nighttime aerial tracking-related applications. The code and demo videos are available at https://github.com/vision4robotics/HighlightNet.
Underwater Ranker: Learn Which Is Better and How to Be Better
Abstract
In this paper, we present a ranking-based underwater image quality assessment (UIQA) method, abbreviated as URanker. The URanker is built on the efficient conv-attentional image Transformer. In terms of underwater images, we specially devise (1) the histogram prior that embeds the color distribution of an underwater image as histogram token to attend global degradation and (2) the dynamic cross-scale correspondence to model local degradation. The final prediction depends on the class tokens from different scales, which comprehensively considers multi-scale dependencies. With the margin ranking loss, our URanker can accurately rank the order of underwater images of the same scene enhanced by different underwater image enhancement (UIE) algorithms according to their visual quality. To achieve that, we also contribute a dataset, URankerSet, containing sufficient results enhanced by different UIE algorithms and the corresponding perceptual rankings, to train our URanker. Apart from the good performance of URanker, we found that a simple U-shape UIE network can obtain promising performance when it is coupled with our pre-trained URanker as additional supervision. In addition, we also propose a normalization tail that can significantly improve the performance of UIE networks. Extensive experiments demonstrate the state-of-the-art performance of our method. The key designs of our method are discussed. We will release our dataset and code.
Fast Vocabulary Projection Method via Clustering for Multilingual Machine Translation on GPU
Authors: Hossam Amer, Young Jin Kim, Mohamed Afify, Hitokazu Matsushita, Hany Hassan Awadallah
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract
Multilingual Neural Machine Translation has been showing great success using transformer models. Deploying these models is challenging because they usually require large vocabulary (vocab) sizes for various languages. This limits the speed of predicting the output tokens in the last vocab projection layer. To alleviate these challenges, this paper proposes a fast vocabulary projection method via clustering which can be used for multilingual transformers on GPUs. First, we offline split the vocab search space into disjoint clusters given the hidden context vector of the decoder output, which results in much smaller vocab columns for vocab projection. Second, at inference time, the proposed method predicts the clusters and candidate active tokens for hidden context vectors at the vocab projection. This paper also includes analysis of different ways of building these clusters in multilingual settings. Our results show end-to-end speed gains in float16 GPU inference up to 25% while maintaining the BLEU score and slightly increasing memory cost. The proposed method speeds up the vocab projection step itself by up to 2.6x. We also conduct an extensive human evaluation to verify the proposed method preserves the quality of the translations from the original model.
AVisT: A Benchmark for Visual Object Tracking in Adverse Visibility
Authors: Mubashir Noman, Wafa Al Ghallabi, Daniya Najiha, Christoph Mayer, Akshay Dudhane, Martin Danelljan, Hisham Cholakkal, Salman Khan, Luc Van Gool, Fahad Shahbaz Khan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
One of the key factors behind the recent success in visual tracking is the availability of dedicated benchmarks. While being greatly benefiting to the tracking research, existing benchmarks do not pose the same difficulty as before with recent trackers achieving higher performance mainly due to (i) the introduction of more sophisticated transformers-based methods and (ii) the lack of diverse scenarios with adverse visibility such as, severe weather conditions, camouflage and imaging effects. We introduce AVisT, a dedicated benchmark for visual tracking in diverse scenarios with adverse visibility. AVisT comprises 120 challenging sequences with 80k annotated frames, spanning 18 diverse scenarios broadly grouped into five attributes with 42 object categories. The key contribution of AVisT is diverse and challenging scenarios covering severe weather conditions such as, dense fog, heavy rain and sandstorm; obstruction effects including, fire, sun glare and splashing water; adverse imaging effects such as, low-light; target effects including, small targets and distractor objects along with camouflage. We further benchmark 17 popular and recent trackers on AVisT with detailed analysis of their tracking performance across attributes, demonstrating a big room for improvement in performance. We believe that AVisT can greatly benefit the tracking community by complementing the existing benchmarks, in developing new creative tracking solutions in order to continue pushing the boundaries of the state-of-the-art. Our dataset along with the complete tracking performance evaluation is available at: https://github.com/visionml/pytracking
Continuous Active Learning Using Pretrained Transformers
Authors: Nima Sadri, Gordon V. Cormack
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract
Pre-trained and fine-tuned transformer models like BERT and T5 have improved the state of the art in ad-hoc retrieval and question-answering, but not as yet in high-recall information retrieval, where the objective is to retrieve substantially all relevant documents. We investigate whether the use of transformer-based models for reranking and/or featurization can improve the Baseline Model Implementation of the TREC Total Recall Track, which represents the current state of the art for high-recall information retrieval. We also introduce CALBERT, a model that can be used to continuously fine-tune a BERT-based model based on relevance feedback.
Evaluating Dense Passage Retrieval using Transformers
Authors: Nima Sadri
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract
Although representational retrieval models based on Transformers have been able to make major advances in the past few years, and despite the widely accepted conventions and best-practices for testing such models, a $\textit{standardized}$ evaluation framework for testing them has not been developed. In this work, we formalize the best practices and conventions followed by researchers in the literature, paving the path for more standardized evaluations - and therefore more fair comparisons between the models. Our framework (1) embeds the documents and queries; (2) for each query-document pair, computes the relevance score based on the dot product of the document and query embedding; (3) uses the $\texttt{dev}$ set of the MSMARCO dataset to evaluate the models; (4) uses the $\texttt{trec_eval}$ script to calculate MRR@100, which is the primary metric used to evaluate the models. Most importantly, we showcase the use of this framework by experimenting on some of the most well-known dense retrieval models.
DuETA: Traffic Congestion Propagation Pattern Modeling via Efficient Graph Learning for ETA Prediction at Baidu Maps
Abstract
Estimated time of arrival (ETA) prediction, also known as travel time estimation, is a fundamental task for a wide range of intelligent transportation applications, such as navigation, route planning, and ride-hailing services. To accurately predict the travel time of a route, it is essential to take into account both contextual and predictive factors, such as spatial-temporal interaction, driving behavior, and traffic congestion propagation inference. The ETA prediction models previously deployed at Baidu Maps have addressed the factors of spatial-temporal interaction (ConSTGAT) and driving behavior (SSML). In this work, we focus on modeling traffic congestion propagation patterns to improve ETA performance. Traffic congestion propagation pattern modeling is challenging, and it requires accounting for impact regions over time and cumulative effect of delay variations over time caused by traffic events on the road network. In this paper, we present a practical industrial-grade ETA prediction framework named DuETA. Specifically, we construct a congestion-sensitive graph based on the correlations of traffic patterns, and we develop a route-aware graph transformer to directly learn the long-distance correlations of the road segments. This design enables DuETA to capture the interactions between the road segment pairs that are spatially distant but highly correlated with traffic conditions. Extensive experiments are conducted on large-scale, real-world datasets collected from Baidu Maps. Experimental results show that ETA prediction can significantly benefit from the learned traffic congestion propagation patterns. In addition, DuETA has already been deployed in production at Baidu Maps, serving billions of requests every day. This demonstrates that DuETA is an industrial-grade and robust solution for large-scale ETA prediction services.
Towards Interpretable Sleep Stage Classification Using Cross-Modal Transformers
Authors: Jathurshan Pradeepkumar, Mithunjha Anandakumar, Vinith Kugathasan, Dhinesh Suntharalingham, Simon L. Kappel, Anjula C. De Silva, Chamira U. S. Edussooriya
Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP)
Abstract
Accurate sleep stage classification is significant for sleep health assessment. In recent years, several deep learning and machine learning based sleep staging algorithms have been developed and they have achieved performance on par with human annotation. Despite improved performance, a limitation of most deep-learning based algorithms is their Black-box behavior, which which have limited their use in clinical settings. Here, we propose Cross-Modal Transformers, which is a transformer-based method for sleep stage classification. Our models achieve both competitive performance with the state-of-the-art approaches and eliminates the Black-box behavior of deep-learning models by utilizing the interpretability aspect of the attention modules. The proposed cross-modal transformers consist of a novel cross-modal transformer encoder architecture along with a multi-scale 1-dimensional convolutional neural network for automatic representation learning. Our sleep stage classifier based on this design was able to achieve sleep stage classification performance on par with or better than the state-of-the-art approaches, along with interpretability, a fourfold reduction in the number of parameters and a reduced training time compared to the current state-of-the-art. Our code is available at https://github.com/Jathurshan0330/Cross-Modal-Transformer.
Self-Supervised Vision Transformers for Malware Detection
Abstract
Malware detection plays a crucial role in cyber-security with the increase in malware growth and advancements in cyber-attacks. Previously unseen malware which is not determined by security vendors are often used in these attacks and it is becoming inevitable to find a solution that can self-learn from unlabeled sample data. This paper presents SHERLOCK, a self-supervision based deep learning model to detect malware based on the Vision Transformer (ViT) architecture. SHERLOCK is a novel malware detection method which learns unique features to differentiate malware from benign programs with the use of image-based binary representation. Experimental results using 1.2 million Android applications across a hierarchy of 47 types and 696 families, shows that self-supervised learning can achieve an accuracy of 97% for the binary classification of malware which is higher than existing state-of-the-art techniques. Our proposed model is also able to outperform state-of-the-art techniques for multi-class malware classification of types and family with macro-F1 score of .497 and .491 respectively.
A Vision Transformer-Based Approach to Bearing Fault Classification via Vibration Signals
Abstract
Rolling bearings are the most crucial components of rotating machinery. Identifying defective bearings in a timely manner may prevent the malfunction of an entire machinery system. The mechanical condition monitoring field has entered the big data phase as a result of the fast advancement of machine parts. When working with large amounts of data, the manual feature extraction approach has the drawback of being inefficient and inaccurate. Data-driven methods like the Deep Learning method have been successfully used in recent years for mechanical intelligent fault detection. Convolutional neural networks (CNNs) were mostly used in earlier research to detect and identify bearing faults. The CNN model, however, suffers from the drawback of having trouble managing fault-time information, which results in a lack of classification results. In this study, bearing defects have been classified using a state-of-the-art Vision Transformer (ViT). Bearing defects were classified using Case Western Reserve University (CWRU) bearing failure laboratory experimental data. The research took into account 13 distinct kinds of defects under 0-load situations in addition to normal bearing conditions. Using the short-time Fourier transform (STFT), the vibration signals were converted into 2D time-frequency images. The 2D time-frequency images are used as input parameters for the ViT. The model achieved an overall accuracy of 98.8%.
Z-BERT-A: a zero-shot Pipeline for Unknown Intent detection
Authors: Daniele Comi, Dimitrios Christofidellis, Pier Francesco Piazza, Matteo Manica
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract
Intent discovery is a fundamental task in NLP, and it is increasingly relevant for a variety of industrial applications (Quarteroni 2018). The main challenge resides in the need to identify from input utterances novel unseen in-tents. Herein, we propose Z-BERT-A, a two-stage method for intent discovery relying on a Transformer architecture (Vaswani et al. 2017; Devlin et al. 2018), fine-tuned with Adapters (Pfeiffer et al. 2020), initially trained for Natural Language Inference (NLI), and later applied for unknown in-tent classification in a zero-shot setting. In our evaluation, we firstly analyze the quality of the model after adaptive fine-tuning on known classes. Secondly, we evaluate its performance casting intent classification as an NLI task. Lastly, we test the zero-shot performance of the model on unseen classes, showing how Z-BERT-A can effectively perform in-tent discovery by generating intents that are semantically similar, if not equal, to the ground truth ones. Our experiments show how Z-BERT-A is outperforming a wide variety of baselines in two zero-shot settings: known intents classification and unseen intent discovery. The proposed pipeline holds the potential to be widely applied in a variety of application for customer care. It enables automated dynamic triage using a lightweight model that, unlike large language models, can be easily deployed and scaled in a wide variety of business scenarios. Especially when considering a setting with limited hardware availability and performance whereon-premise or low resource cloud deployments are imperative. Z-BERT-A, predicting novel intents from a single utterance, represents an innovative approach for intent discovery, enabling online generation of novel intents. The pipeline is available as an installable python package at the following link: https://github.com/GT4SD/zberta.
Class-attention Video Transformer for Engagement Intensity Prediction
Authors: Xusheng Ai, Victor S. Sheng, Chunhua Li
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
In order to deal with variant-length long videos, prior works extract multi-modal features and fuse them to predict students' engagement intensity. In this paper, we present a new end-to-end method Class Attention in Video Transformer (CavT), which involves a single vector to process class embedding and to uniformly perform end-to-end learning on variant-length long videos and fixed-length short videos. Furthermore, to address the lack of sufficient samples, we propose a binary-order representatives sampling method (BorS) to add multiple video sequences of each video to augment the training set. BorS+CavT not only achieves the state-of-the-art MSE (0.0495) on the EmotiW-EP dataset, but also obtains the state-of-the-art MSE (0.0377) on the DAiSEE dataset. The code and models will be made publicly available at https://github.com/mountainai/cavt.
PatchDropout: Economizing Vision Transformers Using Patch Dropout
Authors: Yue Liu, Christos Matsoukas, Fredrik Strand, Hossein Azizpour, Kevin Smith
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
Vision transformers have demonstrated the potential to outperform CNNs in a variety of vision tasks. But the computational and memory requirements of these models prohibit their use in many applications, especially those that depend on high-resolution images, such as medical image classification. Efforts to train ViTs more efficiently are overly complicated, necessitating architectural changes or intricate training schemes. In this work, we show that standard ViT models can be efficiently trained at high resolution by randomly dropping input image patches. This simple approach, PatchDropout, reduces FLOPs and memory by at least 50% in standard natural image datasets such as ImageNet, and those savings only increase with image size. On CSAW, a high-resolution medical dataset, we observe a 5 times savings in computation and memory using PatchDropout, along with a boost in performance. For practitioners with a fixed computational or memory budget, PatchDropout makes it possible to choose image resolution, hyperparameters, or model size to get the most performance out of their model.
Multi-modal Transformer Path Prediction for Autonomous Vehicle
Authors: Chia Hong Tseng, Jie Zhang, Min-Te Sun, Kazuya Sakai, Wei-Shinn Ku
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Abstract
Reasoning about vehicle path prediction is an essential and challenging problem for the safe operation of autonomous driving systems. There exist many research works for path prediction. However, most of them do not use lane information and are not based on the Transformer architecture. By utilizing different types of data collected from sensors equipped on the self-driving vehicles, we propose a path prediction system named Multi-modal Transformer Path Prediction (MTPP) that aims to predict long-term future trajectory of target agents. To achieve more accurate path prediction, the Transformer architecture is adopted in our model. To better utilize the lane information, the lanes which are in opposite direction to target agent are not likely to be taken by the target agent and are consequently filtered out. In addition, consecutive lane chunks are combined to ensure the lane input to be long enough for path prediction. An extensive evaluation is conducted to show the efficacy of the proposed system using nuScene, a real-world trajectory forecasting dataset.
Transformer-based Value Function Decomposition for Cooperative Multi-agent Reinforcement Learning in StarCraft
Authors: Muhammad Junaid Khan, Syed Hammad Ahmed, Gita Sukthankar
Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
Abstract
The StarCraft II Multi-Agent Challenge (SMAC) was created to be a challenging benchmark problem for cooperative multi-agent reinforcement learning (MARL). SMAC focuses exclusively on the problem of StarCraft micromanagement and assumes that each unit is controlled individually by a learning agent that acts independently and only possesses local information; centralized training is assumed to occur with decentralized execution (CTDE). To perform well in SMAC, MARL algorithms must handle the dual problems of multi-agent credit assignment and joint action evaluation. This paper introduces a new architecture TransMix, a transformer-based joint action-value mixing network which we show to be efficient and scalable as compared to the other state-of-the-art cooperative MARL solutions. TransMix leverages the ability of transformers to learn a richer mixing function for combining the agents' individual value functions. It achieves comparable performance to previous work on easy SMAC scenarios and outperforms other techniques on hard scenarios, as well as scenarios that are corrupted with Gaussian noise to simulate fog of war.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Authors: Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer
Abstract
Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation. This is made possible by understanding and working around properties of highly systematic emergent features in transformer language models that dominate attention and transformer predictive performance. To cope with these features, we develop a two-part quantization procedure, LLM.int8(). We first use vector-wise quantization with separate normalization constants for each inner product in the matrix multiplication, to quantize most of the features. However, for the emergent outliers, we also include a new mixed-precision decomposition scheme, which isolates the outlier feature dimensions into a 16-bit matrix multiplication while still more than 99.9% of values are multiplied in 8-bit. Using LLM.int8(), we show empirically it is possible to perform inference in LLMs with up to 175B parameters without any performance degradation. This result makes such models much more accessible, for example making it possible to use OPT-175B/BLOOM on a single server with consumer GPUs.
Keyword: autonomous driving
Surrogate-assisted Multi-objective Neural Architecture Search for Real-time Semantic Segmentation
Authors: Zhichao Lu, Ran Cheng, Shihua Huang, Haoming Zhang, Changxiao Qiu, Fan Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
Abstract
The architectural advancements in deep neural networks have led to remarkable leap-forwards across a broad array of computer vision tasks. Instead of relying on human expertise, neural architecture search (NAS) has emerged as a promising avenue toward automating the design of architectures. While recent achievements in image classification have suggested opportunities, the promises of NAS have yet to be thoroughly assessed on more challenging tasks of semantic segmentation. The main challenges of applying NAS to semantic segmentation arise from two aspects: (i) high-resolution images to be processed; (ii) additional requirement of real-time inference speed (i.e., real-time semantic segmentation) for applications such as autonomous driving. To meet such challenges, we propose a surrogate-assisted multi-objective method in this paper. Through a series of customized prediction models, our method effectively transforms the original NAS task into an ordinary multi-objective optimization problem. Followed by a hierarchical pre-screening criterion for in-fill selection, our method progressively achieves a set of efficient architectures trading-off between segmentation accuracy and inference speed. Empirical evaluations on three benchmark datasets together with an application using Huawei Atlas 200 DK suggest that our method can identify architectures significantly outperforming existing state-of-the-art architectures designed both manually by human experts and automatically by other NAS methods.
An Empirical Study of Pseudo-Labeling for Image-based 3D Object Detection
Authors: Xinzhu Ma, Yuan Meng, Yinmin Zhang, Lei Bai, Jun Hou, Shuai Yi, Wanli Ouyang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Image-based 3D detection is an indispensable component of the perception system for autonomous driving. However, it still suffers from the unsatisfying performance, one of the main reasons for which is the limited training data. Unfortunately, annotating the objects in the 3D space is extremely time/resource-consuming, which makes it hard to extend the training set arbitrarily. In this work, we focus on the semi-supervised manner and explore the feasibility of a cheaper alternative, i.e. pseudo-labeling, to leverage the unlabeled data. For this purpose, we conduct extensive experiments to investigate whether the pseudo-labels can provide effective supervision for the baseline models under varying settings. The experimental results not only demonstrate the effectiveness of the pseudo-labeling mechanism for image-based 3D detection (e.g. under monocular setting, we achieve 20.23 AP for moderate level on the KITTI-3D testing set without bells and whistles, improving the baseline model by 6.03 AP), but also show several interesting and noteworthy findings (e.g. the models trained with pseudo-labels perform better than that trained with ground-truth annotations based on the same training data). We hope this work can provide insights for the image-based 3D detection community under a semi-supervised setting. The codes, pseudo-labels, and pre-trained models will be publicly available.
Man-in-the-Middle Attack against Object Detection Systems
Authors: Han Wu, Sareh Rowlands, Johan Wahlstrom
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Abstract
Is deep learning secure for robots? As embedded systems have access to more powerful CPUs and GPUs, deep-learning-enabled object detection systems become pervasive in robotic applications. Meanwhile, prior research unveils that deep learning models are vulnerable to adversarial attacks. Does this put real-world robots at threat? Our research borrows the idea of the Main-in-the-Middle attack from Cryptography to attack an object detection system. Our experimental results prove that we can generate a strong Universal Adversarial Perturbation (UAP) within one minute and then use the perturbation to attack a detection system via the Man-in-the-Middle attack. Our findings raise a serious concern over the applications of deep learning models in safety-critical systems such as autonomous driving.
Virtual Reality Assisted Human Perception in ADAS Development: a Munich 3D Model Study
Authors: Felix Bognar, Oster Markus, Herman Van der Auweraer, Tong Duy Son
Abstract
As the development of autonomous driving (AD) and advanced driver assistance systems (ADAS) progresses, the relevance of the comfort of users is gaining increasing interest. It becomes significant to test and validate perceived comfort performance from the early phase of system development before driving on roads. Most of the present ADAS test procedures are not efficient in performing such comfort evaluation. One of the main challenges is to integrate high-quality, realistic and predictable virtual traffic scenarios into an ADAS testing framework that has physics-based sensors capable of sensing the virtual environment. In this paper, we present our development of a virtual reality based ADAS testing framework that enhances human perception evaluation. The main contribution relies on three aspects. First, we introduce our development of a large and high-quality (realism, structure, texture) 3D traffic model of the Munich city in Germany. Second, we optimize the 3D model for virtual reality purpose, and real-time capable for human-in-the-loop ADAS testing. Finally, the model is then integrated into an ADAS framework for testing and validating ADAS functionalities and perceived comfort performance. The developed framework components are presented with illustrative examples.
Multi-modal Transformer Path Prediction for Autonomous Vehicle
Authors: Chia Hong Tseng, Jie Zhang, Min-Te Sun, Kazuya Sakai, Wei-Shinn Ku
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Abstract
Reasoning about vehicle path prediction is an essential and challenging problem for the safe operation of autonomous driving systems. There exist many research works for path prediction. However, most of them do not use lane information and are not based on the Transformer architecture. By utilizing different types of data collected from sensors equipped on the self-driving vehicles, we propose a path prediction system named Multi-modal Transformer Path Prediction (MTPP) that aims to predict long-term future trajectory of target agents. To achieve more accurate path prediction, the Transformer architecture is adopted in our model. To better utilize the lane information, the lanes which are in opposite direction to target agent are not likely to be taken by the target agent and are consequently filtered out. In addition, consecutive lane chunks are combined to ensure the lane input to be long enough for path prediction. An extensive evaluation is conducted to show the efficacy of the proposed system using nuScene, a real-world trajectory forecasting dataset.
A Simulation Study of Passing Drivers' Responses to the Automated Truck-Mounted Attenuator System in Road Maintenance
Authors: Yu Li, Bill Wang, William Li, Ruwen Qin
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Abstract
The Autonomous Truck-Mounted Attenuator (ATMA) system is a lead-follower vehicle system based on autonomous driving and connected vehicle technologies. The lead truck performs maintenance tasks on the road, and the unmanned follower truck is designed to improve the visibility of the moving work zone to passing vehicles and to protect workers and equipment. While the ATMA has been under testing by transportation maintenance and operations agencies in recent years, a simulator-based testing capability is a supplement, especially if human subjects are involved. This paper aims to discover how passing drivers perceive, understand, and react to the ATMA system in road maintenance accordingly. A driving simulator for ATMA studies is developed for collecting the driving data. Then, driving simulation experiments were performed, wherein a screen-based eye tracker collected sixteen subjects' gaze points and pupil diameters. Data analysis has evidenced the changes in the visual attention pattern of subjects when they were passing the ATMA. On average, the ATMA starts to attract subjects' attention from 500 ft behind the follower truck. Most (87.50%) understood the follower truck's protection purpose, and the majority (86.67%) reasoned the association between the two trucks. But still, many (43.75%) did not recognize that ATMA is a connected autonomous vehicle system. While all subjects safely changed lanes and attempted to pass the slow-moving ATMA, their inadequate understanding of ATMA is a potential risk, like cutting into the ATAM. Results implied that transportation maintenance and operations agencies should take this into consideration in establishing the deployment guidance.
Vehicle-road Cooperative Simulation and 3D Visualization System
Abstract
The safety of single-vehicle autonomous driving technology is limited due to the limits of perception capability of on-board sensors. In contrast, vehicle-road collaboration technology can overcome those limits and improve the traffic safety and efficiency, by expanding the sensing range, improving the perception accuracy, and reducing the response time. However, such a technology is still under development; it requires rigorous testing and verification methods to ensure the reliability and trustworthiness of the technology. In this thesis, we focus on three major tasks: (1) analyze the functional characteristics related to the scenarios of vehicle-road cooperations, highlightening the differences between vehicle-road cooperative systems and traditional single-vehicle autonomous driving systems; (2) refine and classifiy the functional characteristics of vehicle-road cooperative systems; (3) design and develop a simulation system, and provide a visual interface to facilitate development and analysis. The efficiency and effectiveness the proposed method are verfied by experiments.
Towards Robust On-Ramp Merging via Augmented Multimodal Reinforcement Learning
Authors: Gaurav Bagwe, Jian Li, Xiaoyong Yuan, Lan Zhang
Abstract
Despite the success of AI-enabled onboard perception, on-ramp merging has been one of the main challenges for autonomous driving. Due to limited sensing range of onboard sensors, a merging vehicle can hardly observe main road conditions and merge properly. By leveraging the wireless communications between connected and automated vehicles (CAVs), a merging CAV has potential to proactively obtain the intentions of nearby vehicles. However, CAVs can be prone to inaccurate observations, such as the noisy basic safety messages (BSM) and poor quality surveillance images. In this paper, we present a novel approach for Robust on-ramp merge of CAVs via Augmented and Multi-modal Reinforcement Learning, named by RAMRL. Specifically, we formulate the on-ramp merging problem as a Markov decision process (MDP) by taking driving safety, comfort driving behavior, and traffic efficiency into account. To provide reliable merging maneuvers, we simultaneously leverage BSM and surveillance images for multi-modal observation, which is used to learn a policy model through proximal policy optimization (PPO). Moreover, to improve data efficiency and provide better generalization performance, we train the policy model with augmented data (e.g., noisy BSM and noisy surveillance images). Extensive experiments are conducted with Simulation of Urban MObility (SUMO) platform under two typical merging scenarios. Experimental results demonstrate the effectiveness and efficiency of our robust on-ramp merging design.
New submissions for Tue, 16 Aug 22
Keyword: SLAM
There is no result
Keyword: odometry
Look Both Ways: Bidirectional Visual Sensing for Automatic Multi-Camera Registration
Keyword: livox
There is no result
Keyword: loam
There is no result
Keyword: lidar
How Does Data Freshness Affect Real-time Supervised Learning?
Look Both Ways: Bidirectional Visual Sensing for Automatic Multi-Camera Registration
Online Pole Segmentation on Range Images for Long-term LiDAR Localization in Urban Environments
Keyword: loop detection
There is no result
Keyword: nerf
UPST-NeRF: Universal Photorealistic Style Transfer of Neural Radiance Fields for 3D Scene
DM-NeRF: 3D Scene Geometry Decomposition and Manipulation from 2D Images
Keyword: mapping
Revisiting Adversarial Attacks on Graph Neural Networks for Graph Classification
IPvSeeYou: Exploiting Leaked Identifiers in IPv6 for Street-Level Geolocation
HDR-Plenoxels: Self-Calibrating High Dynamic Range Radiance Fields
Block Orthogonal Sparse Superposition Codes for Ultra-Reliable Low-Latency Communications
Global Priors Guided Modulation Network for Joint Super-Resolution and Inverse Tone-Mapping
Visual Localization via Few-Shot Scene Region Classification
MACE: Multi-Agent Autonomous Collaborative Exploration of Unknown Environments
Automatic Landmark Detection and Registration of Brain Cortical Surfaces via Quasi-Conformal Geometry and Convolutional Neural Networks
Non-Blocking Batch A* (Technical Report)
LCSM: A Lightweight Complex Spectral Mapping Framework for Stereophonic Acoustic Echo Cancellation
Keyword: localization
Finding Point with Image: An End-to-End Benchmark for Vision-based UAV Localization
Visual Localization via Few-Shot Scene Region Classification
Terrain-based vehicle localization using an active suspension system
Online Pole Segmentation on Range Images for Long-term LiDAR Localization in Urban Environments
Keyword: transformer
LM-CORE: Language Models with Contextually Relevant External Knowledge
Finding Point with Image: An End-to-End Benchmark for Vision-based UAV Localization
GEDI: A Graph-based End-to-end Data Imputation Framework
Enhanced Vehicle Re-identification for ITS: A Feature Fusion approach using Deep Learning
Interpreting BERT-based Text Similarity via Activation and Saliency Maps
Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models
BinBert: Binary Code Understanding with a Fine-tunable and Execution-aware Transformer
Flow-Guided Transformer for Video Inpainting
TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency
Hybrid explicit-implicit learning for multiscale problems with time dependent source
HighlightNet: Highlighting Low-Light Potential Features for Real-Time UAV Tracking
Underwater Ranker: Learn Which Is Better and How to Be Better
Fast Vocabulary Projection Method via Clustering for Multilingual Machine Translation on GPU
AVisT: A Benchmark for Visual Object Tracking in Adverse Visibility
Continuous Active Learning Using Pretrained Transformers
Evaluating Dense Passage Retrieval using Transformers
DuETA: Traffic Congestion Propagation Pattern Modeling via Efficient Graph Learning for ETA Prediction at Baidu Maps
Towards Interpretable Sleep Stage Classification Using Cross-Modal Transformers
Self-Supervised Vision Transformers for Malware Detection
A Vision Transformer-Based Approach to Bearing Fault Classification via Vibration Signals
Z-BERT-A: a zero-shot Pipeline for Unknown Intent detection
Class-attention Video Transformer for Engagement Intensity Prediction
PatchDropout: Economizing Vision Transformers Using Patch Dropout
Multi-modal Transformer Path Prediction for Autonomous Vehicle
Transformer-based Value Function Decomposition for Cooperative Multi-agent Reinforcement Learning in StarCraft
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Keyword: autonomous driving
Surrogate-assisted Multi-objective Neural Architecture Search for Real-time Semantic Segmentation
An Empirical Study of Pseudo-Labeling for Image-based 3D Object Detection
Man-in-the-Middle Attack against Object Detection Systems
Virtual Reality Assisted Human Perception in ADAS Development: a Munich 3D Model Study
Multi-modal Transformer Path Prediction for Autonomous Vehicle
A Simulation Study of Passing Drivers' Responses to the Automated Truck-Mounted Attenuator System in Road Maintenance
Vehicle-road Cooperative Simulation and 3D Visualization System
Towards Robust On-Ramp Merging via Augmented Multimodal Reinforcement Learning