New submissions for Fri, 22 Apr 22

Keyword: SLAM

There is no result

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

SELMA: SEmantic Large-scale Multimodal Acquisitions in Variable Weather, Daytime and Viewpoints

Authors: Paolo Testolina, Francesco Barbato, Umberto Michieli, Marco Giordani, Pietro Zanuttigh, Michele Zorzi
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2204.09788
Pdf link: https://arxiv.org/pdf/2204.09788
Abstract Accurate scene understanding from multiple sensors mounted on cars is a key requirement for autonomous driving systems. Nowadays, this task is mainly performed through data-hungry deep learning techniques that need very large amounts of data to be trained. Due to the high cost of performing segmentation labeling, many synthetic datasets have been proposed. However, most of them miss the multi-sensor nature of the data, and do not capture the significant changes introduced by the variation of daytime and weather conditions. To fill these gaps, we introduce SELMA, a novel synthetic dataset for semantic segmentation that contains more than 30K unique waypoints acquired from 24 different sensors including RGB, depth, semantic cameras and LiDARs, in 27 different atmospheric and daytime conditions, for a total of more than 20M samples. SELMA is based on CARLA, an open-source simulator for generating synthetic data in autonomous driving scenarios, that we modified to increase the variability and the diversity in the scenes and class sets, and to align it with other benchmark datasets. As shown by the experimental evaluation, SELMA allows the efficient training of standard and multi-modal deep learning architectures, and achieves remarkable results on real-world data. SELMA is free and publicly available, thus supporting open science and research.
Multimodal Gaussian Mixture Model for Realtime Roadside LiDAR Object Detection
Authors: Tianya Zhang, Peter J. Jin, Yi Ge
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2204.09804
Pdf link: https://arxiv.org/pdf/2204.09804
Abstract Background modeling is widely used for intelligent surveillance systems to detect the moving targets by subtracting the static background components. Most roadside LiDAR object detection methods filter out foreground points by comparing new points to pre-trained background references based on descriptive statistics over many frames (e.g., voxel density, slopes, maximum distance). These solutions are not efficient under heavy traffic, and parameter values are hard to transfer from one scenario to another. In early studies, the video-based background modeling methods were considered not suitable for roadside LiDAR surveillance systems due to the sparse and unstructured point clouds data. In this paper, the raw LiDAR data were transformed into a multi-dimensional tensor structure based on the elevation and azimuth value of each LiDAR point. With this high-order data representation, we break the barrier to allow the efficient Gaussian Mixture Model (GMM) method for roadside LiDAR background modeling. The probabilistic GMM is built with superior agility and real-time capability. The proposed Method was compared against two state-of-the-art roadside LiDAR background models and evaluated based on point level, object level, and path level, demonstrating better robustness under heavy traffic and challenging weather. This multimodal GMM method is capable of handling dynamic backgrounds with noisy measurements and substantially enhances the infrastructure-based LiDAR object detection, whereby various 3D modeling for smart city applications could be created
CPGNet: Cascade Point-Grid Fusion Network for Real-Time LiDAR Semantic Segmentation
Authors: Xiaoyan Li, Gang Zhang, Hongyu Pan, Zhenhua Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2204.09914
Pdf link: https://arxiv.org/pdf/2204.09914
Abstract LiDAR semantic segmentation essential for advanced autonomous driving is required to be accurate, fast, and easy-deployed on mobile platforms. Previous point-based or sparse voxel-based methods are far away from real-time applications since time-consuming neighbor searching or sparse 3D convolution are employed. Recent 2D projection-based methods, including range view and multi-view fusion, can run in real time, but suffer from lower accuracy due to information loss during the 2D projection. Besides, to improve the performance, previous methods usually adopt test time augmentation (TTA), which further slows down the inference process. To achieve a better speed-accuracy trade-off, we propose Cascade Point-Grid Fusion Network (CPGNet), which ensures both effectiveness and efficiency mainly by the following two techniques: 1) the novel Point-Grid (PG) fusion block extracts semantic features mainly on the 2D projected grid for efficiency, while summarizes both 2D and 3D features on 3D point for minimal information loss; 2) the proposed transformation consistency loss narrows the gap between the single-time model inference and TTA. The experiments on the SemanticKITTI and nuScenes benchmarks demonstrate that the CPGNet without ensemble models or TTA is comparable with the state-of-the-art RPVNet, while it runs 4.7 times faster.
Understanding the Domain Gap in LiDAR Object Detection Networks
Authors: Jasmine Richter, Florian Faion, Di Feng, Paul Benedikt Becker, Piotr Sielecki, Claudius Glaeser
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2204.10024
Pdf link: https://arxiv.org/pdf/2204.10024
Abstract In order to make autonomous driving a reality, artificial neural networks have to work reliably in the open-world. However, the open-world is vast and continuously changing, so it is not technically feasible to collect and annotate training datasets which accurately represent this domain. Therefore, there are always domain gaps between training datasets and the open-world which must be understood. In this work, we investigate the domain gaps between high-resolution and low-resolution LiDAR sensors in object detection networks. Using a unique dataset, which enables us to study sensor resolution domain gaps independent of other effects, we show two distinct domain gaps - an inference domain gap and a training domain gap. The inference domain gap is characterised by a strong dependence on the number of LiDAR points per object, while the training gap shows no such dependence. These fndings show that different approaches are required to close these inference and training domain gaps.
Keyword: loop detection

There is no result

Keyword: autonomous driving

SELMA: SEmantic Large-scale Multimodal Acquisitions in Variable Weather, Daytime and Viewpoints
Authors: Paolo Testolina, Francesco Barbato, Umberto Michieli, Marco Giordani, Pietro Zanuttigh, Michele Zorzi
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2204.09788
Pdf link: https://arxiv.org/pdf/2204.09788
Abstract Accurate scene understanding from multiple sensors mounted on cars is a key requirement for autonomous driving systems. Nowadays, this task is mainly performed through data-hungry deep learning techniques that need very large amounts of data to be trained. Due to the high cost of performing segmentation labeling, many synthetic datasets have been proposed. However, most of them miss the multi-sensor nature of the data, and do not capture the significant changes introduced by the variation of daytime and weather conditions. To fill these gaps, we introduce SELMA, a novel synthetic dataset for semantic segmentation that contains more than 30K unique waypoints acquired from 24 different sensors including RGB, depth, semantic cameras and LiDARs, in 27 different atmospheric and daytime conditions, for a total of more than 20M samples. SELMA is based on CARLA, an open-source simulator for generating synthetic data in autonomous driving scenarios, that we modified to increase the variability and the diversity in the scenes and class sets, and to align it with other benchmark datasets. As shown by the experimental evaluation, SELMA allows the efficient training of standard and multi-modal deep learning architectures, and achieves remarkable results on real-world data. SELMA is free and publicly available, thus supporting open science and research.
CPGNet: Cascade Point-Grid Fusion Network for Real-Time LiDAR Semantic Segmentation
Authors: Xiaoyan Li, Gang Zhang, Hongyu Pan, Zhenhua Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2204.09914
Pdf link: https://arxiv.org/pdf/2204.09914
Abstract LiDAR semantic segmentation essential for advanced autonomous driving is required to be accurate, fast, and easy-deployed on mobile platforms. Previous point-based or sparse voxel-based methods are far away from real-time applications since time-consuming neighbor searching or sparse 3D convolution are employed. Recent 2D projection-based methods, including range view and multi-view fusion, can run in real time, but suffer from lower accuracy due to information loss during the 2D projection. Besides, to improve the performance, previous methods usually adopt test time augmentation (TTA), which further slows down the inference process. To achieve a better speed-accuracy trade-off, we propose Cascade Point-Grid Fusion Network (CPGNet), which ensures both effectiveness and efficiency mainly by the following two techniques: 1) the novel Point-Grid (PG) fusion block extracts semantic features mainly on the 2D projected grid for efficiency, while summarizes both 2D and 3D features on 3D point for minimal information loss; 2) the proposed transformation consistency loss narrows the gap between the single-time model inference and TTA. The experiments on the SemanticKITTI and nuScenes benchmarks demonstrate that the CPGNet without ensemble models or TTA is comparable with the state-of-the-art RPVNet, while it runs 4.7 times faster.
Understanding the Domain Gap in LiDAR Object Detection Networks
Authors: Jasmine Richter, Florian Faion, Di Feng, Paul Benedikt Becker, Piotr Sielecki, Claudius Glaeser
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2204.10024
Pdf link: https://arxiv.org/pdf/2204.10024
Abstract In order to make autonomous driving a reality, artificial neural networks have to work reliably in the open-world. However, the open-world is vast and continuously changing, so it is not technically feasible to collect and annotate training datasets which accurately represent this domain. Therefore, there are always domain gaps between training datasets and the open-world which must be understood. In this work, we investigate the domain gaps between high-resolution and low-resolution LiDAR sensors in object detection networks. Using a unique dataset, which enables us to study sensor resolution domain gaps independent of other effects, we show two distinct domain gaps - an inference domain gap and a training domain gap. The inference domain gap is characterised by a strong dependence on the number of LiDAR points per object, while the training gap shows no such dependence. These fndings show that different approaches are required to close these inference and training domain gaps.
Is Neuron Coverage Needed to Make Person Detection More Robust?
Authors: Svetlana Pavlitskaya, Şiyar Yıkmış, J. Marius Zöllner
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2204.10027
Pdf link: https://arxiv.org/pdf/2204.10027
Abstract The growing use of deep neural networks (DNNs) in safety- and security-critical areas like autonomous driving raises the need for their systematic testing. Coverage-guided testing (CGT) is an approach that applies mutation or fuzzing according to a predefined coverage metric to find inputs that cause misbehavior. With the introduction of a neuron coverage metric, CGT has also recently been applied to DNNs. In this work, we apply CGT to the task of person detection in crowded scenes. The proposed pipeline uses YOLOv3 for person detection and includes finding DNN bugs via sampling and mutation, and subsequent DNN retraining on the updated training set. To be a bug, we require a mutated image to cause a significant performance drop compared to a clean input. In accordance with the CGT, we also consider an additional requirement of increased coverage in the bug definition. In order to explore several types of robustness, our approach includes natural image transformations, corruptions, and adversarial examples generated with the Daedalus attack. The proposed framework has uncovered several thousand cases of incorrect DNN behavior. The relative change in mAP performance of the retrained models reached on average between 26.21\% and 64.24\% for different robustness types. However, we have found no evidence that the investigated coverage metrics can be advantageously used to improve robustness.
TorchSparse: Efficient Point Cloud Inference Engine
Authors: Haotian Tang, Zhijian Liu, Xiuyu Li, Yujun Lin, Song Han
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)
Arxiv link: https://arxiv.org/abs/2204.10319
Pdf link: https://arxiv.org/pdf/2204.10319
Abstract Deep learning on point clouds has received increased attention thanks to its wide applications in AR/VR and autonomous driving. These applications require low latency and high accuracy to provide real-time user experience and ensure user safety. Unlike conventional dense workloads, the sparse and irregular nature of point clouds poses severe challenges to running sparse CNNs efficiently on the general-purpose hardware. Furthermore, existing sparse acceleration techniques for 2D images do not translate to 3D point clouds. In this paper, we introduce TorchSparse, a high-performance point cloud inference engine that accelerates the sparse convolution computation on GPUs. TorchSparse directly optimizes the two bottlenecks of sparse convolution: irregular computation and data movement. It applies adaptive matrix multiplication grouping to trade computation for better regularity, achieving 1.4-1.5x speedup for matrix multiplication. It also optimizes the data movement by adopting vectorized, quantized and fused locality-aware memory access, reducing the memory movement cost by 2.7x. Evaluated on seven representative models across three benchmark datasets, TorchSparse achieves 1.6x and 1.5x measured end-to-end speedup over the state-of-the-art MinkowskiEngine and SpConv, respectively.
Keyword: mapping

Parametric Level-sets Enhanced To Improve Reconstruction (PaLEnTIR)
Authors: Ege Ozsar, Misha Kilmer, Eric Miller, Eric de Sturler, Arvind Saibaba
Subjects: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2204.09815
Pdf link: https://arxiv.org/pdf/2204.09815
Abstract In this paper, we consider the restoration and reconstruction of piecewise constant objects in two and three dimensions using PaLEnTIR, a significantly enhanced Parametric level set (PaLS) model relative to the current state-of-the-art. The primary contribution of this paper is a new PaLS formulation which requires only a single level set function to recover a scene with piecewise constant objects possessing multiple unknown contrasts. Our model offers distinct advantages over current approaches to the multi-contrast, multi-object problem, all of which require multiple level sets and explicit estimation of the contrast magnitudes. Given upper and lower bounds on the contrast, our approach is able to recover objects with any distribution of contrasts and eliminates the need to know either the number of contrasts in a given scene or their values. We provide an iterative process for finding these space-varying contrast limits. Relative to most PaLS methods which employ radial basis functions (RBFs), our model makes use of non-isotropic basis functions, thereby expanding the class of shapes that a PaLS model of a given complexity can approximate. Finally, PaLEnTIR improves the conditioning of the Jacobian matrix required as part of the parameter identification process and consequently accelerates the optimization methods by controlling the magnitude of the PaLS expansion coefficients, fixing the centers of the basis functions, and the uniqueness of parametric to image mappings provided by the new parameterization. We demonstrate the performance of the new approach using both 2D and 3D variants of X-ray computed tomography, diffuse optical tomography (DOT), denoising, deconvolution problems. Application to experimental sparse CT data and simulated data with different types of noise are performed to further validate the proposed method.
Pixel2Mesh++: 3D Mesh Generation and Refinement from Multi-View Images
Authors: Chao Wen, Yinda Zhang, Chenjie Cao, Zhuwen Li, Xiangyang Xue, Yanwei Fu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2204.09866
Pdf link: https://arxiv.org/pdf/2204.09866
Abstract We study the problem of shape generation in 3D mesh representation from a small number of color images with or without camera poses. While many previous works learn to hallucinate the shape directly from priors, we adopt to further improve the shape quality by leveraging cross-view information with a graph convolution network. Instead of building a direct mapping function from images to 3D shape, our model learns to predict series of deformations to improve a coarse shape iteratively. Inspired by traditional multiple view geometry methods, our network samples nearby area around the initial mesh's vertex locations and reasons an optimal deformation using perceptual feature statistics built from multiple input images. Extensive experiments show that our model produces accurate 3D shapes that are not only visually plausible from the input perspectives, but also well aligned to arbitrary viewpoints. With the help of physically driven architecture, our model also exhibits generalization capability across different semantic categories, and the number of input images. Model analysis experiments show that our model is robust to the quality of the initial mesh and the error of camera pose, and can be combined with a differentiable renderer for test-time optimization.
STFT-Domain Neural Speech Enhancement with Very Low Algorithmic Latency
Authors: Zhong-Qiu Wang, Gordon Wichern, Shinji Watanabe, Jonathan Le Roux
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2204.09911
Pdf link: https://arxiv.org/pdf/2204.09911
Abstract Deep learning based speech enhancement in the short-term Fourier transform (STFT) domain typically uses a large window length such as 32 ms. A larger window contains more samples and the frequency resolution can be higher for potentially better enhancement. This however incurs an algorithmic latency of 32 ms in an online setup, because the overlap-add algorithm used in the inverse STFT (iSTFT) is also performed based on the same 32 ms window size. To reduce this inherent latency, we adapt a conventional dual window size approach, where a regular input window size is used for STFT but a shorter output window is used for the overlap-add in the iSTFT, for STFT-domain deep learning based frame-online speech enhancement. Based on this STFT and iSTFT configuration, we employ single- or multi-microphone complex spectral mapping for frame-online enhancement, where a deep neural network (DNN) is trained to predict the real and imaginary (RI) components of target speech from the mixture RI components. In addition, we use the RI components predicted by the DNN to conduct frame-online beamforming, the results of which are then used as extra features for a second DNN to perform frame-online post-filtering. The frequency-domain beamforming in between the two DNNs can be easily integrated with complex spectral mapping and is designed to not incur any algorithmic latency. Additionally, we propose a future-frame prediction technique to further reduce the algorithmic latency. Evaluation results on a noisy-reverberant speech enhancement task demonstrate the effectiveness of the proposed algorithms. Compared with Conv-TasNet, our STFT-domain system can achieve better enhancement performance for a comparable amount of computation, or comparable performance with less computation, maintaining strong performance at an algorithmic latency as low as 2 ms.
ChildPredictor: A Child Face Prediction Framework with Disentangled Learning
Authors: Yuzhi Zhao, Lai-Man Po, Xuehui Wang, Qiong Yan, Wei Shen, Yujia Zhang, Wei Liu, Chun-Kit Wong, Chiu-Sing Pang, Weifeng Ou, Wing-Yin Yu, Buhua Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2204.09962
Pdf link: https://arxiv.org/pdf/2204.09962
Abstract The appearances of children are inherited from their parents, which makes it feasible to predict them. Predicting realistic children's faces may help settle many social problems, such as age-invariant face recognition, kinship verification, and missing child identification. It can be regarded as an image-to-image translation task. Existing approaches usually assume domain information in the image-to-image translation can be interpreted by "style", i.e., the separation of image content and style. However, such separation is improper for the child face prediction, because the facial contours between children and parents are not the same. To address this issue, we propose a new disentangled learning strategy for children's face prediction. We assume that children's faces are determined by genetic factors (compact family features, e.g., face contour), external factors (facial attributes irrelevant to prediction, such as moustaches and glasses), and variety factors (individual properties for each child). On this basis, we formulate predictions as a mapping from parents' genetic factors to children's genetic factors, and disentangle them from external and variety factors. In order to obtain accurate genetic factors and perform the mapping, we propose a ChildPredictor framework. It transfers human faces to genetic factors by encoders and back by generators. Then, it learns the relationship between the genetic factors of parents and children through a mapping function. To ensure the generated faces are realistic, we collect a large Family Face Database to train ChildPredictor and evaluate it on the FF-Database validation set. Experimental results demonstrate that ChildPredictor is superior to other well-known image-to-image translation methods in predicting realistic and diverse child faces. Implementation codes can be found at https://github.com/zhaoyuzhi/ChildPredictor.
NAND-SPIN-Based Processing-in-MRAM Architecture for Convolutional Neural Network Acceleration
Authors: Yinglin Zhao, Jianlei Yang, Bing Li, Xingzhou Cheng, Xucheng Ye, Xueyan Wang, Xiaotao Jia, Zhaohao Wang, Youguang Zhang, Weisheng Zhao
Subjects: Hardware Architecture (cs.AR); Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/2204.09989
Pdf link: https://arxiv.org/pdf/2204.09989
Abstract The performance and efficiency of running large-scale datasets on traditional computing systems exhibit critical bottlenecks due to the existing "power wall" and "memory wall" problems. To resolve those problems, processing-in-memory (PIM) architectures are developed to bring computation logic in or near memory to alleviate the bandwidth limitations during data transmission. NAND-like spintronics memory (NAND-SPIN) is one kind of promising magnetoresistive random-access memory (MRAM) with low write energy and high integration density, and it can be employed to perform efficient in-memory computation operations. In this work, we propose a NAND-SPIN-based PIM architecture for efficient convolutional neural network (CNN) acceleration. A straightforward data mapping scheme is exploited to improve the parallelism while reducing data movements. Benefiting from the excellent characteristics of NAND-SPIN and in-memory processing architecture, experimental results show that the proposed approach can achieve $\sim$2.6$\times$ speedup and $\sim$1.4$\times$ improvement in energy efficiency over state-of-the-art PIM solutions.
Resilient robot teams: a review integrating decentralised control, change-detection, and learning
Authors: David M. Bossens, Sarvapali Ramchurn, Danesh Tarapore
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2204.10063
Pdf link: https://arxiv.org/pdf/2204.10063
Abstract Purpose of review: This paper reviews opportunities and challenges for decentralised control, change-detection, and learning in the context of resilient robot teams. Recent findings: Exogenous fault detection methods can provide a generic detection or a specific diagnosis with a recovery solution. Robot teams can perform active and distributed sensing for detecting changes in the environment, including identifying and tracking dynamic anomalies, as well as collaboratively mapping dynamic environments. Resilient methods for decentralised control have been developed in learning perception-action-communication loops, multi-agent reinforcement learning, embodied evolution, offline evolution with online adaptation, explicit task allocation, and stigmergy in swarm robotics. Summary: Remaining challenges for resilient robot teams are integrating change-detection and trial-and-error learning methods, obtaining reliable performance evaluations under constrained evaluation time, improving the safety of resilient robot teams, theoretical results demonstrating rapid adaptation to given environmental perturbations, and designing realistic and compelling case studies.
OSSO: Obtaining Skeletal Shape from Outside
Authors: Marilyn Keller, Silvia Zuffi, Michael J. Black, Sergi Pujades
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2204.10129
Pdf link: https://arxiv.org/pdf/2204.10129
Abstract We address the problem of inferring the anatomic skeleton of a person, in an arbitrary pose, from the 3D surface of the body; i.e. we predict the inside (bones) from the outside (skin). This has many applications in medicine and biomechanics. Existing state-of-the-art biomechanical skeletons are detailed but do not easily generalize to new subjects. Additionally, computer vision and graphics methods that predict skeletons are typically heuristic, not learned from data, do not leverage the full 3D body surface, and are not validated against ground truth. To our knowledge, our system, called OSSO (Obtaining Skeletal Shape from Outside), is the first to learn the mapping from the 3D body surface to the internal skeleton from real data. We do so using 1000 male and 1000 female dual-energy X-ray absorptiometry (DXA) scans. To these, we fit a parametric 3D body shape model (STAR) to capture the body surface and a novel part-based 3D skeleton model to capture the bones. This provides inside/outside training pairs. We model the statistical variation of full skeletons using PCA in a pose-normalized space. We then train a regressor from body shape parameters to skeleton shape parameters and refine the skeleton to satisfy constraints on physical plausibility. Given an arbitrary 3D body shape and pose, OSSO predicts a realistic skeleton inside. In contrast to previous work, we evaluate the accuracy of the skeleton shape quantitatively on held-out DXA scans, outperforming the state-of-the-art. We also show 3D skeleton prediction from varied and challenging 3D bodies. The code to infer a skeleton from a body shape is available for research at https://osso.is.tue.mpg.de/, and the dataset of paired outer surface (skin) and skeleton (bone) meshes is available as a Biobank Returned Dataset. This research has been conducted using the UK Biobank Resource.
Analysis of Ferroelectric Negative Capacitance-Hybrid MEMS Actuator Using Energy-Displacement Landscape
Authors: Raghuram Tattamangalam Raman, Jeffin Shibu, Revathy Padmanabhan, Arvind Ajoy
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2204.10180
Pdf link: https://arxiv.org/pdf/2204.10180
Abstract We propose an energy-based framework to analyze the statics and dynamics of a ferroelectric negative capacitance-hybrid Microelectromechanical System (MEMS) actuator. A mapping function that relates the charge on the ferroelectric to displacement of the movable electrode, is used to obtain the Hamiltonian of the hybrid actuator in terms of displacement. We then use graphical energy-displacement and phase portrait plots to analyze static pull-in, dynamic pull-in and pull-out phenomena of the hybrid actuator. Using these, we illustrate the low-voltage operation of the hybrid actuator to static and step inputs, as compared to the standalone MEMS actuator. The results obtained are in agreement with the analytical predictions and numerical simulations. The proposed framework enables straightforward inclusion of adhesion between the contacting surfaces, modeled using van der Waals force. We show that the pull-in voltage is not affected, while the pull-out voltage is reduced due to adhesion. The proposed framework provides a physics-based tool to design and analyze negative capacitance based low-voltage MEMS actuators.
Keyword: localization

Referring Expression Comprehension via Cross-Level Multi-Modal Fusion
Authors: Peihan Miao, Wei Su, Lian Wang, Yongjian Fu, Xi Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2204.09957
Pdf link: https://arxiv.org/pdf/2204.09957
Abstract As an important and challenging problem in vision-language tasks, referring expression comprehension (REC) aims to localize the target object specified by a given referring expression. Recently, most of the state-of-the-art REC methods mainly focus on multi-modal fusion while overlooking the inherent hierarchical information contained in visual and language encoders. Considering that REC requires visual and textual hierarchical information for accurate target localization, and encoders inherently extract features in a hierarchical fashion, we propose to effectively utilize the rich hierarchical information contained in different layers of visual and language encoders. To this end, we design a Cross-level Multi-modal Fusion (CMF) framework, which gradually integrates visual and textual features of multi-layer through intra- and inter-modal. Experimental results on RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame datasets demonstrate the proposed framework achieves significant performance improvements over state-of-the-art methods.
Transformer-Guided Convolutional Neural Network for Cross-View Geolocalization
Authors: Teng Wang, Shujuan Fan, Daikun Liu, Changyin Sun
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2204.09967
Pdf link: https://arxiv.org/pdf/2204.09967
Abstract Ground-to-aerial geolocalization refers to localizing a ground-level query image by matching it to a reference database of geo-tagged aerial imagery. This is very challenging due to the huge perspective differences in visual appearances and geometric configurations between these two views. In this work, we propose a novel Transformer-guided convolutional neural network (TransGCNN) architecture, which couples CNN-based local features with Transformer-based global representations for enhanced representation learning. Specifically, our TransGCNN consists of a CNN backbone extracting feature map from an input image and a Transformer head modeling global context from the CNN map. In particular, our Transformer head acts as a spatial-aware importance generator to select salient CNN features as the final feature representation. Such a coupling procedure allows us to leverage a lightweight Transformer network to greatly enhance the discriminative capability of the embedded features. Furthermore, we design a dual-branch Transformer head network to combine image features from multi-scale windows in order to improve details of the global feature representation. Extensive experiments on popular benchmark datasets demonstrate that our model achieves top-1 accuracy of 94.12\% and 84.92\% on CVUSA and CVACT_val, respectively, which outperforms the second-performing baseline with less than 50% parameters and almost 2x higher frame rate, therefore achieving a preferable accuracy-efficiency tradeoff.
Time Window Frechet and Metric-Based Edit Distance for Passively Collected Trajectories
Authors: Jiaxin Ding, Jie Gao, Steven Skiena
Subjects: Computational Geometry (cs.CG)
Arxiv link: https://arxiv.org/abs/2204.10053
Pdf link: https://arxiv.org/pdf/2204.10053
Abstract The advances of modern localization techniques and the wide spread of mobile devices have provided us great opportunities to collect and mine human mobility trajectories. In this work, we focus on passively collected trajectories, which are sequences of time-stamped locations that mobile entities visit. To analyse such trajectories, a crucial part is a measure of similarity between two trajectories. We propose the time-window Frechet distance, which enforces the maximum temporal separation between points of two trajectories that can be paired in the calculation of the Frechet distance, and the metric-based edit distance which incorporates the underlying metric in the computation of the insertion and deletion costs. Using these measures, we can cluster trajectories to infer group motion patterns. We look at the $k$-gather problem which requires each cluster to have at least $k$ trajectories. We prove that k-gather remains NP-hard under edit distance, metric-based edit distance and Jaccard distance. Finally, we improve over previous results on discrete Frechet distance and show that there is no strongly sub-quadratic time with approximation factor less than $1.61$ in two dimensional setting unless SETH fails.
Feature anomaly detection system (FADS) for intelligent manufacturing
Authors: Anthony Garland, Kevin Potter, Matt Smith
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2204.10318
Pdf link: https://arxiv.org/pdf/2204.10318
Abstract Anomaly detection is important for industrial automation and part quality assurance, and while humans can easily detect anomalies in components given a few examples, designing a generic automated system that can perform at human or above human capabilities remains a challenge. In this work, we present a simple new anomaly detection algorithm called FADS (feature-based anomaly detection system) which leverages pretrained convolutional neural networks (CNN) to generate a statistical model of nominal inputs by observing the activation of the convolutional filters. During inference the system compares the convolutional filter activation of the new input to the statistical model and flags activations that are outside the expected range of values and therefore likely an anomaly. By using a pretrained network, FADS demonstrates excellent performance similar to or better than other machine learning approaches to anomaly detection while at the same time FADS requires no tuning of the CNN weights. We demonstrate FADS ability by detecting process parameter changes on a custom dataset of additively manufactured lattices. The FADS localization algorithm shows that textural differences that are visible on the surface can be used to detect process parameter changes. In addition, we test FADS on benchmark datasets, such as the MVTec Anomaly Detection dataset, and report good results.

zhuhu00 / Paper-Daily-Notice

New submissions for Fri, 22 Apr 22 #147

Keyword: SLAM

Keyword: Visual inertial

Keyword: livox

Keyword: loam

Keyword: Visual inertial odometry

Keyword: lidar

SELMA: SEmantic Large-scale Multimodal Acquisitions in Variable Weather, Daytime and Viewpoints

Multimodal Gaussian Mixture Model for Realtime Roadside LiDAR Object Detection

CPGNet: Cascade Point-Grid Fusion Network for Real-Time LiDAR Semantic Segmentation

Understanding the Domain Gap in LiDAR Object Detection Networks

Keyword: loop detection

Keyword: autonomous driving

SELMA: SEmantic Large-scale Multimodal Acquisitions in Variable Weather, Daytime and Viewpoints

CPGNet: Cascade Point-Grid Fusion Network for Real-Time LiDAR Semantic Segmentation

Understanding the Domain Gap in LiDAR Object Detection Networks

Is Neuron Coverage Needed to Make Person Detection More Robust?

TorchSparse: Efficient Point Cloud Inference Engine

Keyword: mapping

Parametric Level-sets Enhanced To Improve Reconstruction (PaLEnTIR)

Pixel2Mesh++: 3D Mesh Generation and Refinement from Multi-View Images

STFT-Domain Neural Speech Enhancement with Very Low Algorithmic Latency

ChildPredictor: A Child Face Prediction Framework with Disentangled Learning

NAND-SPIN-Based Processing-in-MRAM Architecture for Convolutional Neural Network Acceleration

Resilient robot teams: a review integrating decentralised control, change-detection, and learning

OSSO: Obtaining Skeletal Shape from Outside

Analysis of Ferroelectric Negative Capacitance-Hybrid MEMS Actuator Using Energy-Displacement Landscape

Keyword: localization

Referring Expression Comprehension via Cross-Level Multi-Modal Fusion

Transformer-Guided Convolutional Neural Network for Cross-View Geolocalization

Time Window Frechet and Metric-Based Edit Distance for Passively Collected Trajectories

Feature anomaly detection system (FADS) for intelligent manufacturing