Abstract
The latest trends in inverse rendering techniques for reconstruction use neural networks to learn 3D representations as neural fields. NeRF-based techniques fit multi-layer perceptrons (MLPs) to a set of training images to estimate a radiance field which can then be rendered from any virtual camera by means of volume rendering algorithms. Major drawbacks of these representations are the lack of well-defined surfaces and non-interactive rendering times, as wide and deep MLPs must be queried millions of times per single frame. These limitations have recently been singularly overcome, but managing to accomplish this simultaneously opens up new use cases. We present KiloNeuS, a new neural object representation that can be rendered in path-traced scenes at interactive frame rates. KiloNeuS enables the simulation of realistic light interactions between neural and classic primitives in shared scenes, and it demonstrably performs in real-time with plenty of room for future optimizations and extensions.
Keyword: mapping
SCIM: Simultaneous Clustering, Inference, and Mapping for Open-World Semantic Scene Understanding
Authors: Hermann Blum, Marcus G. Müller, Abel Gawel, Roland Siegwart, Cesar Cadena
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Abstract
In order to operate in human environments, a robot's semantic perception has to overcome open-world challenges such as novel objects and domain gaps. Autonomous deployment to such environments therefore requires robots to update their knowledge and learn without supervision. We investigate how a robot can autonomously discover novel semantic classes and improve accuracy on known classes when exploring an unknown environment. To this end, we develop a general framework for mapping and clustering that we then use to generate a self-supervised learning signal to update a semantic segmentation model. In particular, we show how clustering parameters can be optimized during deployment and that fusion of multiple observation modalities improves novel object discovery compared to prior work.
SpA-Former: Transformer image shadow detection and removal via spatial attention
Abstract
In this paper, we propose an end-to-end SpA-Former to recover a shadow-free image from a single shaded image. Unlike traditional methods that require two steps for shadow detection and then shadow removal, the SpA-Former unifies these steps into one, which is a one-stage network capable of directly learning the mapping function between shadows and no shadows, it does not require a separate shadow detection. Thus, SpA-former is adaptable to real image de-shadowing for shadows projected on different semantic regions. SpA-Former consists of transformer layer and a series of joint Fourier transform residual blocks and two-wheel joint spatial attention. The network in this paper is able to handle the task while achieving a very fast processing efficiency. Our code is relased on https://github.com/ zhangbaijin/Spatial-Transformer-shadow-removal
A High Resolution Multi-exposure Stereoscopic Image & Video Database of Natural Scenes
Abstract
Immersive displays such as VR headsets, AR glasses, Multiview displays, Free point televisions have emerged as a new class of display technologies in recent years, offering a better visual experience and viewer engagement as compared to conventional displays. With the evolution of 3D video and display technologies, the consumer market for High Dynamic Range (HDR) cameras and displays is quickly growing. The lack of appropriate experimental data is a critical hindrance for the development of primary research efforts in the field of 3D HDR video technology. Also, the unavailability of sufficient real world multi-exposure experimental dataset is a major bottleneck for HDR imaging research, thereby limiting the quality of experience (QoE) for the viewers. In this paper, we introduce a diversified stereoscopic multi-exposure dataset captured within the campus of Indian Institute of Technology Madras, which is home to a diverse flora and fauna. The dataset is captured using ZED stereoscopic camera and provides intricate scenes of outdoor locations such as gardens, roadside views, festival venues, buildings and indoor locations such as academic and residential areas. The proposed dataset accommodates wide depth range, complex depth structure, complicate object movement, illumination variations, rich color dynamics, texture discrepancy in addition to significant randomness introduced by moving camera and background motion. The proposed dataset is made publicly available to the research community. Furthermore, the procedure for capturing, aligning and calibrating multi-exposure stereo videos and images is described in detail. Finally, we have discussed the progress, challenges, potential use cases and future research opportunities with respect to HDR imaging, depth estimation, consistent tone mapping and 3D HDR coding.
40 Years of Designing Code Comprehension Experiments: A Systematic Mapping Study
Authors: Marvin Wyrich, Justus Bogner, Stefan Wagner
Abstract
The relevance of code comprehension in a developer's daily work was recognized more than 40 years ago. Over the years, several studies have gathered evidence that developers do indeed invest a considerable amount of their daily work in code comprehension. Consequently, many studies were conducted to find out how developers could be supported during code comprehension and which code characteristics contribute to better comprehension. Today, such experiments are more common than ever. While this is great for advancing the field, the number of publications makes it difficult to keep an overview. Additionally, designing rigorous experiments with human participants is a challenging task, and the multitude of design decisions and options can make it difficult for researchers to select a suitable design. We therefore conducted a systematic mapping study of 95 source code comprehension experiments published between 1979 and 2019. By systematically structuring the design characteristics of code comprehension studies, we provide a basis for subsequent discussion of the huge diversity of design options in the face of a lack of basic research on their consequences and comparability. We describe what topics have been studied, as well as how these studies have been designed, conducted, and reported. Frequently chosen design options and deficiencies are pointed out. We conclude with five concrete action items that we as a research community should address moving forward to improve publications of code comprehension experiments.
Multi-party Secure Broad Learning System for Privacy Preserving
Authors: Xiao-Kai Cao, Chang-Dong Wang, Jian-Huang Lai, Qiong Huang, C. L. Philip Chen
Abstract
Multi-party learning is an indispensable technique for improving the learning performance via integrating data from multiple parties. Unfortunately, directly integrating multi-party data would not meet the privacy preserving requirements. Therefore, Privacy-Preserving Machine Learning (PPML) becomes a key research task in multi-party learning. In this paper, we present a new PPML method based on secure multi-party interactive protocol, namely Multi-party Secure Broad Learning System (MSBLS), and derive security analysis of the method. The existing PPML methods generally cannot simultaneously meet multiple requirements such as security, accuracy, efficiency and application scope, but MSBLS achieves satisfactory results in these aspects. It uses interactive protocol and random mapping to generate the mapped features of data, and then uses efficient broad learning to train neural network classifier. This is the first privacy computing method that combines secure multi-party computing and neural network. Theoretically, this method can ensure that the accuracy of the model will not be reduced due to encryption, and the calculation speed is very fast. We verify this conclusion on three classical datasets.
Attack Techniques and Threat Identification for Vulnerabilities
Authors: Constantin Adam, Muhammed Fatih Bulut, Daby Sow, Steven Ocepek, Chris Bedell, Lilian Ngweta
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Abstract
Modern organizations struggle with insurmountable number of vulnerabilities that are discovered and reported by their network and application vulnerability scanners. Therefore, prioritization and focus become critical, to spend their limited time on the highest risk vulnerabilities. In doing this, it is important for these organizations not only to understand the technical descriptions of the vulnerabilities, but also to gain insights into attackers' perspectives. In this work, we use machine learning and natural language processing techniques, as well as several publicly available data sets to provide an explainable mapping of vulnerabilities to attack techniques and threat actors. This work provides new security intelligence, by predicting which attack techniques are most likely to be used to exploit a given vulnerability and which threat actors are most likely to conduct the exploitation. Lack of labeled data and different vocabularies make mapping vulnerabilities to attack techniques at scale a challenging problem that cannot be addressed easily using supervised or unsupervised (similarity search) learning techniques. To solve this problem, we first map the vulnerabilities to a standard set of common weaknesses, and then common weaknesses to the attack techniques. This approach yields a Mean Reciprocal Rank (MRR) of 0.95, an accuracy comparable with those reported for state-of-the-art systems. Our solution has been deployed to IBM Security X-Force Red Vulnerability Management Services, and in production since 2021. The solution helps security practitioners to assist customers to manage and prioritize their vulnerabilities, providing them with an explainable mapping of vulnerabilities to attack techniques and threat actors
Automated Compliance Blueprint Optimization with Artificial Intelligence
Authors: Abdulhamid Adebayo, Daby Sow, Muhammed Fatih Bulut
Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract
For highly regulated industries such as banking and healthcare, one of the major hindrances to the adoption of cloud computing is compliance with regulatory standards. This is a complex problem due to many regulatory and technical specification (techspec) documents that the companies need to comply with. The critical problem is to establish the mapping between techspecs and regulation controls so that from day one, companies can comply with regulations with minimal effort. We demonstrate the practicality of an approach to automatically analyze regulatory standards using Artificial Intelligence (AI) techniques. We present early results to identify the mapping between techspecs and regulation controls, and discuss challenges that must be overcome for this solution to be fully practical.
Towards Robust Blind Face Restoration with Codebook Lookup Transformer
Abstract
Blind face restoration is a highly ill-posed problem that often requires auxiliary guidance to 1) improve the mapping from degraded inputs to desired outputs, or 2) complement high-quality details lost in the inputs. In this paper, we demonstrate that a learned discrete codebook prior in a small proxy space largely reduces the uncertainty and ambiguity of restoration mapping by casting blind face restoration as a code prediction task, while providing rich visual atoms for generating high-quality faces. Under this paradigm, we propose a Transformer-based prediction network, named CodeFormer, to model global composition and context of the low-quality faces for code prediction, enabling the discovery of natural faces that closely approximate the target faces even when the inputs are severely degraded. To enhance the adaptiveness for different degradation, we also propose a controllable feature transformation module that allows a flexible trade-off between fidelity and quality. Thanks to the expressive codebook prior and global modeling, CodeFormer outperforms state of the arts in both quality and fidelity, showing superior robustness to degradation. Extensive experimental results on synthetic and real-world datasets verify the effectiveness of our method.
Keyword: localization
A Feature Memory Rearrangement Network for Visual Inspection of Textured Surface Defects Toward Edge Intelligent Manufacturing
Authors: Haiming Yao, Wenyong Yu, Xue Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Abstract
Recent advances in the industrial inspection of textured surfaces-in the form of visual inspection-have made such inspections possible for efficient, flexible manufacturing systems. We propose an unsupervised feature memory rearrangement network (FMR-Net) to accurately detect various textural defects simultaneously. Consistent with mainstream methods, we adopt the idea of background reconstruction; however, we innovatively utilize artificial synthetic defects to enable the model to recognize anomalies, while traditional wisdom relies only on defect-free samples. First, we employ an encoding module to obtain multiscale features of the textured surface. Subsequently, a contrastive-learning-based memory feature module (CMFM) is proposed to obtain discriminative representations and construct a normal feature memory bank in the latent space, which can be employed as a substitute for defects and fast anomaly scores at the patch level. Next, a novel global feature rearrangement module (GFRM) is proposed to further suppress the reconstruction of residual defects. Finally, a decoding module utilizes the restored features to reconstruct the normal texture background. In addition, to improve inspection performance, a two-phase training strategy is utilized for accurate defect restoration refinement, and we exploit a multimodal inspection method to achieve noise-robust defect localization. We verify our method through extensive experiments and test its practical deployment in collaborative edge--cloud intelligent manufacturing scenarios by means of a multilevel detection method, demonstrating that FMR-Net exhibits state-of-the-art inspection accuracy and shows great potential for use in edge-computing-enabled smart industries.
Exploration of User Privacy in 802.11 Probe Requests with MAC Address Randomization Using Temporal Pattern Analysis
Authors: Tomas Bravenec, Joaquín Torres-Sospedra, Michael Gould, Tomas Fryza
Subjects: Cryptography and Security (cs.CR); Signal Processing (eess.SP)
Abstract
Wireless networks have become an integral part of our daily lives and lately there is increased concern about privacy and protecting the identity of individual users. In this paper we address the evolution of privacy measures in Wi-Fi probe request frames. We focus on the lack of privacy measures before the implementation of MAC Address Randomization, and on the way anti-tracking measures evolved throughout the last decade. We do not try to reverse MAC address randomization to get the real ad-dress of the device, but instead analyse the possibility of further tracking/localization without needing the real MAC address of the specific users. To gain better analysis results, we introduce temporal pattern matching approach to identification of devices using randomized MAC addresses.
Weakly-supervised Action Localization via Hierarchical Mining
Abstract
Weakly-supervised action localization aims to localize and classify action instances in the given videos temporally with only video-level categorical labels. Thus, the crucial issue of existing weakly-supervised action localization methods is the limited supervision from the weak annotations for precise predictions. In this work, we propose a hierarchical mining strategy under video-level and snippet-level manners, i.e., hierarchical supervision and hierarchical consistency mining, to maximize the usage of the given annotations and prediction-wise consistency. To this end, a Hierarchical Mining Network (HiM-Net) is proposed. Concretely, it mines hierarchical supervision for classification in two grains: one is the video-level existence for ground truth categories captured by multiple instance learning; the other is the snippet-level inexistence for each negative-labeled category from the perspective of complementary labels, which is optimized by our proposed complementary label learning. As for hierarchical consistency, HiM-Net explores video-level co-action feature similarity and snippet-level foreground-background opposition, for discriminative representation learning and consistent foreground-background separation. Specifically, prediction variance is viewed as uncertainty to select the pairs with high consensus for proposed foreground-background collaborative learning. Comprehensive experimental results show that HiM-Net outperforms existing methods on THUMOS14 and ActivityNet1.3 datasets with large margins by hierarchically mining the supervision and consistency. Code will be available on GitHub.
Business Document Information Extraction: Towards Practical Benchmarks
Authors: Matyáš Skalický, Štěpán Šimsa, Michal Uřičář, Milan Šulc
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
Information extraction from semi-structured documents is crucial for frictionless business-to-business (B2B) communication. While machine learning problems related to Document Information Extraction (IE) have been studied for decades, many common problem definitions and benchmarks do not reflect domain-specific aspects and practical needs for automating B2B document communication. We review the landscape of Document IE problems, datasets and benchmarks. We highlight the practical aspects missing in the common definitions and define the Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR) problems. There is a lack of relevant datasets and benchmarks for Document IE on semi-structured business documents as their content is typically legally protected or sensitive. We discuss potential sources of available documents including synthetic data.
Keyword: transformer
CoCoPIE XGen: A Full-Stack AI-Oriented Optimizing Framework
Authors: Xiaofeng Li, Bin Ren, Xipeng Shen, Yanzhi Wang
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Programming Languages (cs.PL)
Abstract
There is a growing demand for shifting the delivery of AI capability from data centers on the cloud to edge or end devices, exemplified by the fast emerging real-time AI-based apps running on smartphones, AR/VR devices, autonomous vehicles, and various IoT devices. The shift has however been seriously hampered by the large growing gap between DNN computing demands and the computing power on edge or end devices. This article presents the design of XGen, an optimizing framework for DNN designed to bridge the gap. XGen takes cross-cutting co-design as its first-order consideration. Its full-stack AI-oriented optimizations consist of a number of innovative optimizations at every layer of the DNN software stack, all designed in a cooperative manner. The unique technology makes XGen able to optimize various DNNs, including those with an extreme depth (e.g., BERT, GPT, other transformers), and generate code that runs several times faster than those from existing DNN frameworks, while delivering the same level of accuracy.
Abstract
Many problems in science and engineering involve optimizing an expensive black-box function over a high-dimensional space. For such black-box optimization (BBO) problems, we typically assume a small budget for online function evaluations, but also often have access to a fixed, offline dataset for pretraining. Prior approaches seek to utilize the offline data to approximate the function or its inverse but are not sufficiently accurate far from the data distribution. We propose Black-box Optimization Transformer (BOOMER), a generative framework for pretraining black-box optimizers using offline datasets. In BOOMER, we train an autoregressive model to imitate trajectory runs of implicit black-box function optimizers. Since these trajectories are unavailable by default, we develop a simple randomized heuristic to synthesize trajectories by sorting random points from offline data. We show theoretically that this heuristic induces trajectories that mimic transitions from diverse low-fidelity (exploration) to high-fidelity (exploitation) samples. Further, we introduce mechanisms to control the rate at which a trajectory transitions from exploration to exploitation, and use it to generalize outside the offline data at test-time. Empirically, we instantiate BOOMER using a casually masked Transformer and evaluate it on Design-Bench, where we rank the best on average, outperforming state-of-the-art baselines.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Authors: Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, Yonghui Wu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects. We also explore and highlight limitations of our models in order to define and exemplify key areas of focus for further improvements. See https://parti.research.google/ for high-resolution images.
Parallel Pre-trained Transformers (PPT) for Synthetic Data-based Instance Segmentation
Authors: Ming Li, Jie Wu, Jinhang Cai, Jie Qin, Yuxi Ren, Xuefeng Xiao, Min Zheng, Rui Wang, Xin Pan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Abstract
Recently, Synthetic data-based Instance Segmentation has become an exceedingly favorable optimization paradigm since it leverages simulation rendering and physics to generate high-quality image-annotation pairs. In this paper, we propose a Parallel Pre-trained Transformers (PPT) framework to accomplish the synthetic data-based Instance Segmentation task. Specifically, we leverage the off-the-shelf pre-trained vision Transformers to alleviate the gap between natural and synthetic data, which helps to provide good generalization in the downstream synthetic data scene with few samples. Swin-B-based CBNet V2, SwinL-based CBNet V2 and Swin-L-based Uniformer are employed for parallel feature learning, and the results of these three models are fused by pixel-level Non-maximum Suppression (NMS) algorithm to obtain more robust results. The experimental results reveal that PPT ranks first in the CVPR2022 AVA Accessibility Vision and Autonomy Challenge, with a 65.155% mAP.
NVIDIA-UNIBZ Submission for EPIC-KITCHENS-100 Action Anticipation Challenge 2022
Authors: Tsung-Ming Tai, Oswald Lanz, Giuseppe Fiameni, Yi-Kwan Wong, Sze-Sen Poon, Cheng-Kuang Lee, Ka-Chun Cheung, Simon See
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
In this report, we describe the technical details of our submission for the EPIC-Kitchen-100 action anticipation challenge. Our modelings, the higher-order recurrent space-time transformer and the message-passing neural network with edge learning, are both recurrent-based architectures which observe only 2.5 seconds inference context to form the action anticipation prediction. By averaging the prediction scores from a set of models compiled with our proposed training pipeline, we achieved strong performance on the test set, which is 19.61% overall mean top-5 recall, recorded as second place on the public leaderboard.
Feature Re-calibration based MIL for Whole Slide Image Classification
Authors: Philip Chikontwe, Soo Jeong Nam, Heounjeong Go, Meejeong Kim, Hyun Jung Sung, Sang Hyun Park
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Whole slide image (WSI) classification is a fundamental task for the diagnosis and treatment of diseases; but, curation of accurate labels is time-consuming and limits the application of fully-supervised methods. To address this, multiple instance learning (MIL) is a popular method that poses classification as a weakly supervised learning task with slide-level labels only. While current MIL methods apply variants of the attention mechanism to re-weight instance features with stronger models, scant attention is paid to the properties of the data distribution. In this work, we propose to re-calibrate the distribution of a WSI bag (instances) by using the statistics of the max-instance (critical) feature. We assume that in binary MIL, positive bags have larger feature magnitudes than negatives, thus we can enforce the model to maximize the discrepancy between bags with a metric feature loss that models positive bags as out-of-distribution. To achieve this, unlike existing MIL methods that use single-batch training modes, we propose balanced-batch sampling to effectively use the feature loss i.e., (+/-) bags simultaneously. Further, we employ a position encoding module (PEM) to model spatial/morphological information, and perform pooling by multi-head self-attention (PSMA) with a Transformer encoder. Experimental results on existing benchmark datasets show our approach is effective and improves over state-of-the-art MIL methods.
S2TNet: Spatio-Temporal Transformer Networks for Trajectory Prediction in Autonomous Driving
Authors: Weihuang Chen, Fangfang Wang, Hongbin Sun
Abstract
To safely and rationally participate in dense and heterogeneous traffic, autonomous vehicles require to sufficiently analyze the motion patterns of surrounding traffic-agents and accurately predict their future trajectories. This is challenging because the trajectories of traffic-agents are not only influenced by the traffic-agents themselves but also by spatial interaction with each other. Previous methods usually rely on the sequential step-by-step processing of Long Short-Term Memory networks (LSTMs) and merely extract the interactions between spatial neighbors for single type traffic-agents. We propose the Spatio-Temporal Transformer Networks (S2TNet), which models the spatio-temporal interactions by spatio-temporal Transformer and deals with the temporel sequences by temporal Transformer. We input additional category, shape and heading information into our networks to handle the heterogeneity of traffic-agents. The proposed methods outperforms state-of-the-art methods on ApolloScape Trajectory dataset by more than 7\% on both the weighted sum of Average and Final Displacement Error. Our code is available at https://github.com/chenghuang66/s2tnet.
SpA-Former: Transformer image shadow detection and removal via spatial attention
Abstract
In this paper, we propose an end-to-end SpA-Former to recover a shadow-free image from a single shaded image. Unlike traditional methods that require two steps for shadow detection and then shadow removal, the SpA-Former unifies these steps into one, which is a one-stage network capable of directly learning the mapping function between shadows and no shadows, it does not require a separate shadow detection. Thus, SpA-former is adaptable to real image de-shadowing for shadows projected on different semantic regions. SpA-Former consists of transformer layer and a series of joint Fourier transform residual blocks and two-wheel joint spatial attention. The network in this paper is able to handle the task while achieving a very fast processing efficiency. Our code is relased on https://github.com/ zhangbaijin/Spatial-Transformer-shadow-removal
Polar Parametrization for Vision-based Surround-View 3D Detection
Abstract
3D detection based on surround-view camera system is a critical technique in autopilot. In this work, we present Polar Parametrization for 3D detection, which reformulates position parametrization, velocity decomposition, perception range, label assignment and loss function in polar coordinate system. Polar Parametrization establishes explicit associations between image patterns and prediction targets, exploiting the view symmetry of surround-view cameras as inductive bias to ease optimization and boost performance. Based on Polar Parametrization, we propose a surround-view 3D DEtection TRansformer, named PolarDETR. PolarDETR achieves promising performance-speed trade-off on different backbone configurations. Besides, PolarDETR ranks 1st on the leaderboard of nuScenes benchmark in terms of both 3D detection and 3D tracking at the submission time (Mar. 4th, 2022). Code will be released at \url{https://github.com/hustvl/PolarDETR}.
Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer
Abstract
Visual question answering (VQA) in surgery is largely unexplored. Expert surgeons are scarce and are often overloaded with clinical and academic workloads. This overload often limits their time answering questionnaires from patients, medical students or junior residents related to surgical procedures. At times, students and junior residents also refrain from asking too many questions during classes to reduce disruption. While computer-aided simulators and recording of past surgical procedures have been made available for them to observe and improve their skills, they still hugely rely on medical experts to answer their questions. Having a Surgical-VQA system as a reliable 'second opinion' could act as a backup and ease the load on the medical experts in answering these questions. The lack of annotated medical data and the presence of domain-specific terms has limited the exploration of VQA for surgical procedures. In this work, we design a Surgical-VQA task that answers questionnaires on surgical procedures based on the surgical scene. Extending the MICCAI endoscopic vision challenge 2018 dataset and workflow recognition dataset further, we introduce two Surgical-VQA datasets with classification and sentence-based answers. To perform Surgical-VQA, we employ vision-text transformers models. We further introduce a residual MLP-based VisualBert encoder model that enforces interaction between visual and text tokens, improving performance in classification-based answering. Furthermore, we study the influence of the number of input image patches and temporal visual features on the model performance in both classification and sentence-based answering.
Transformer Neural Networks Attending to Both Sequence and Structure for Protein Prediction Tasks
Abstract
The increasing number of protein sequences decoded from genomes is opening up new avenues of research on linking protein sequence to function with transformer neural networks. Recent research has shown that the number of known protein sequences supports learning useful, task-agnostic sequence representations via transformers. In this paper, we posit that learning joint sequence-structure representations yields better representations for function-related prediction tasks. We propose a transformer neural network that attends to both sequence and tertiary structure. We show that such joint representations are more powerful than sequence-based representations only, and they yield better performance on superfamily membership across various metrics.
Answer Fast: Accelerating BERT on the Tensor Streaming Processor
Authors: Ibrahim Ahmed, Sahil Parmar, Matthew Boyd, Michael Beidler, Kris Kang, Bill Liu, Kyle Roach, John Kim, Dennis Abts
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Abstract
Transformers have become a predominant machine learning workload, they are not only the de-facto standard for natural language processing tasks, but they are also being deployed in other domains such as vision and speech recognition. Many of the transformer-based applications are real-time systems such as machine translation and web search. These real time systems often come with strict end-to-end inference latency requirements. Unfortunately, while the majority of the transformer computation comes from matrix multiplications, transformers also include several non-linear components that tend to become the bottleneck during an inference. In this work, we accelerate the inference of BERT models on the tensor streaming processor. By carefully fusing all the nonlinear components with the matrix multiplication components, we are able to efficiently utilize the on-chip matrix multiplication units resulting in a deterministic tail latency of 130 $\mu$s for a batch-1 inference through BERT-base, which is 6X faster than the current state-of-the-art.
A Unified and Biologically-Plausible Relational Graph Representation of Vision Transformers
Authors: Yuzhong Chen, Yu Du, Zhenxiang Xiao, Lin Zhao, Lu Zhang, David Weizhong Liu, Dajiang Zhu, Tuo Zhang, Xintao Hu, Tianming Liu, Xi Jiang
Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract
Vision transformer (ViT) and its variants have achieved remarkable successes in various visual tasks. The key characteristic of these ViT models is to adopt different aggregation strategies of spatial patch information within the artificial neural networks (ANNs). However, there is still a key lack of unified representation of different ViT architectures for systematic understanding and assessment of model representation performance. Moreover, how those well-performing ViT ANNs are similar to real biological neural networks (BNNs) is largely unexplored. To answer these fundamental questions, we, for the first time, propose a unified and biologically-plausible relational graph representation of ViT models. Specifically, the proposed relational graph representation consists of two key sub-graphs: aggregation graph and affine graph. The former one considers ViT tokens as nodes and describes their spatial interaction, while the latter one regards network channels as nodes and reflects the information communication between channels. Using this unified relational graph representation, we found that: a) a sweet spot of the aggregation graph leads to ViTs with significantly improved predictive performance; b) the graph measures of clustering coefficient and average path length are two effective indicators of model prediction performance, especially when applying on the datasets with small samples; c) our findings are consistent across various ViT architectures and multiple datasets; d) the proposed relational graph representation of ViT has high similarity with real BNNs derived from brain science data. Overall, our work provides a novel unified and biologically-plausible paradigm for more interpretable and effective representation of ViT ANNs.
Traffic-Twitter Transformer: A Nature Language Processing-joined Framework For Network-wide Traffic Forecasting
Authors: Meng-Ju Tsai, Zhiyong Cui, Hao (Frank)Yang, Yinhai Wang
Abstract
With accurate and timely traffic forecasting, the impacted traffic conditions can be predicted in advance to guide agencies and residents to respond to changes in traffic patterns appropriately. However, existing works on traffic forecasting mainly relied on historical traffic patterns confining to short-term prediction, under 1 hour, for instance. To better manage future roadway capacity and accommodate social and human impacts, it is crucial to propose a flexible and comprehensive framework to predict physical-aware long-term traffic conditions for public users and transportation agencies. In this paper, the gap of robust long-term traffic forecasting was bridged by taking social media features into consideration. A correlation study and a linear regression model were first implemented to evaluate the significance of the correlation between two time-series data, traffic intensity and Twitter data intensity. Two time-series data were then fed into our proposed social-aware framework, Traffic-Twitter Transformer, which integrated Nature Language representations into time-series records for long-term traffic prediction. Experimental results in the Great Seattle Area showed that our proposed model outperformed baseline models in all evaluation matrices. This NLP-joined social-aware framework can become a valuable implement of network-wide traffic prediction and management for traffic agencies.
Towards Unsupervised Content Disentanglement in Sentence Representations via Syntactic Roles
Authors: Ghazi Felhi, Joseph Le Roux, Djamé Seddah
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract
Linking neural representations to linguistic factors is crucial in order to build and analyze NLP models interpretable by humans. Among these factors, syntactic roles (e.g. subjects, direct objects,$\dots$) and their realizations are essential markers since they can be understood as a decomposition of predicative structures and thus the meaning of sentences. Starting from a deep probabilistic generative model with attention, we measure the interaction between latent variables and realizations of syntactic roles and show that it is possible to obtain, without supervision, representations of sentences where different syntactic roles correspond to clearly identified different latent variables. The probabilistic model we propose is an Attention-Driven Variational Autoencoder (ADVAE). Drawing inspiration from Transformer-based machine translation models, ADVAEs enable the analysis of the interactions between latent variables and input tokens through attention. We also develop an evaluation protocol to measure disentanglement with regard to the realizations of syntactic roles. This protocol is based on attention maxima for the encoder and on latent variable perturbations for the decoder. Our experiments on raw English text from the SNLI dataset show that $\textit{i)}$ disentanglement of syntactic roles can be induced without supervision, $\textit{ii)}$ ADVAE separates syntactic roles better than classical sequence VAEs and Transformer VAEs, $\textit{iii)}$ realizations of syntactic roles can be separately modified in sentences by mere intervention on the associated latent variables. Our work constitutes a first step towards unsupervised controllable content generation. The code for our work is publicly available.
Behavior Transformers: Cloning $k$ modes with one stone
Authors: Nur Muhammad Mahi Shafiullah, Zichen Jeff Cui, Ariuntuya Altanzaya, Lerrel Pinto
Abstract
While behavior learning has made impressive progress in recent times, it lags behind computer vision and natural language processing due to its inability to leverage large, human-generated datasets. Human behaviors have wide variance, multiple modes, and human demonstrations typically do not come with reward labels. These properties limit the applicability of current methods in Offline RL and Behavioral Cloning to learn from large, pre-collected datasets. In this work, we present Behavior Transformer (BeT), a new technique to model unlabeled demonstration data with multiple modes. BeT retrofits standard transformer architectures with action discretization coupled with a multi-task action correction inspired by offset prediction in object detection. This allows us to leverage the multi-modal modeling ability of modern transformers to predict multi-modal continuous actions. We experimentally evaluate BeT on a variety of robotic manipulation and self-driving behavior datasets. We show that BeT significantly improves over prior state-of-the-art work on solving demonstrated tasks while capturing the major modes present in the pre-collected datasets. Finally, through an extensive ablation study, we analyze the importance of every crucial component in BeT. Videos of behavior generated by BeT are available at https://notmahi.github.io/bet
Towards Robust Blind Face Restoration with Codebook Lookup Transformer
Abstract
Blind face restoration is a highly ill-posed problem that often requires auxiliary guidance to 1) improve the mapping from degraded inputs to desired outputs, or 2) complement high-quality details lost in the inputs. In this paper, we demonstrate that a learned discrete codebook prior in a small proxy space largely reduces the uncertainty and ambiguity of restoration mapping by casting blind face restoration as a code prediction task, while providing rich visual atoms for generating high-quality faces. Under this paradigm, we propose a Transformer-based prediction network, named CodeFormer, to model global composition and context of the low-quality faces for code prediction, enabling the discovery of natural faces that closely approximate the target faces even when the inputs are severely degraded. To enhance the adaptiveness for different degradation, we also propose a controllable feature transformation module that allows a flexible trade-off between fidelity and quality. Thanks to the expressive codebook prior and global modeling, CodeFormer outperforms state of the arts in both quality and fidelity, showing superior robustness to degradation. Extensive experimental results on synthetic and real-world datasets verify the effectiveness of our method.
Abstract
We focus on bridging domain discrepancy in lane detection among different scenarios to greatly reduce extra annotation and re-training costs for autonomous driving. Critical factors hinder the performance improvement of cross-domain lane detection that conventional methods only focus on pixel-wise loss while ignoring shape and position priors of lanes. To address the issue, we propose the Multi-level Domain Adaptation (MLDA) framework, a new perspective to handle cross-domain lane detection at three complementary semantic levels of pixel, instance and category. Specifically, at pixel level, we propose to apply cross-class confidence constraints in self-training to tackle the imbalanced confidence distribution of lane and background. At instance level, we go beyond pixels to treat segmented lanes as instances and facilitate discriminative features in target domain with triplet learning, which effectively rebuilds the semantic context of lanes and contributes to alleviating the feature confusion. At category level, we propose an adaptive inter-domain embedding module to utilize the position prior of lanes during adaptation. In two challenging datasets, ie TuSimple and CULane, our approach improves lane detection performance by a large margin with gains of 8.8% on accuracy and 7.4% on F1-score respectively, compared with state-of-the-art domain adaptation algorithms.
Fighting Fire with Fire: Avoiding DNN Shortcuts through Priming
Abstract
Across applications spanning supervised classification and sequential control, deep learning has been reported to find "shortcut" solutions that fail catastrophically under minor changes in the data distribution. In this paper, we show empirically that DNNs can be coaxed to avoid poor shortcuts by providing an additional "priming" feature computed from key input features, usually a coarse output estimate. Priming relies on approximate domain knowledge of these task-relevant key input features, which is often easy to obtain in practical settings. For example, one might prioritize recent frames over past frames in a video input for visual imitation learning, or salient foreground over background pixels for image classification. On NICO image classification, MuJoCo continuous control, and CARLA autonomous driving, our priming strategy works significantly better than several popular state-of-the-art approaches for feature selection and data augmentation. We connect these empirical findings to recent theoretical results on DNN optimization, and argue theoretically that priming distracts the optimizer away from poor shortcuts by creating better, simpler shortcuts.
New submissions for Thu, 23 Jun 22
Keyword: SLAM
There is no result
Keyword: odometry
There is no result
Keyword: livox
There is no result
Keyword: loam
There is no result
Keyword: lidar
There is no result
Keyword: loop detection
There is no result
Keyword: nerf
KiloNeuS: Implicit Neural Representations with Real-Time Global Illumination
Keyword: mapping
SCIM: Simultaneous Clustering, Inference, and Mapping for Open-World Semantic Scene Understanding
SpA-Former: Transformer image shadow detection and removal via spatial attention
A High Resolution Multi-exposure Stereoscopic Image & Video Database of Natural Scenes
40 Years of Designing Code Comprehension Experiments: A Systematic Mapping Study
Multi-party Secure Broad Learning System for Privacy Preserving
Attack Techniques and Threat Identification for Vulnerabilities
Automated Compliance Blueprint Optimization with Artificial Intelligence
Towards Robust Blind Face Restoration with Codebook Lookup Transformer
Keyword: localization
A Feature Memory Rearrangement Network for Visual Inspection of Textured Surface Defects Toward Edge Intelligent Manufacturing
Exploration of User Privacy in 802.11 Probe Requests with MAC Address Randomization Using Temporal Pattern Analysis
Weakly-supervised Action Localization via Hierarchical Mining
Business Document Information Extraction: Towards Practical Benchmarks
Keyword: transformer
CoCoPIE XGen: A Full-Stack AI-Oriented Optimizing Framework
Generative Pretraining for Black-Box Optimization
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Parallel Pre-trained Transformers (PPT) for Synthetic Data-based Instance Segmentation
NVIDIA-UNIBZ Submission for EPIC-KITCHENS-100 Action Anticipation Challenge 2022
Feature Re-calibration based MIL for Whole Slide Image Classification
S2TNet: Spatio-Temporal Transformer Networks for Trajectory Prediction in Autonomous Driving
SpA-Former: Transformer image shadow detection and removal via spatial attention
Polar Parametrization for Vision-based Surround-View 3D Detection
Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer
Transformer Neural Networks Attending to Both Sequence and Structure for Protein Prediction Tasks
Answer Fast: Accelerating BERT on the Tensor Streaming Processor
A Unified and Biologically-Plausible Relational Graph Representation of Vision Transformers
Traffic-Twitter Transformer: A Nature Language Processing-joined Framework For Network-wide Traffic Forecasting
Towards Unsupervised Content Disentanglement in Sentence Representations via Syntactic Roles
Behavior Transformers: Cloning $k$ modes with one stone
Towards Robust Blind Face Restoration with Codebook Lookup Transformer
Keyword: autonomous driving
Multi-level Domain Adaptation for Lane Detection
Fighting Fire with Fire: Avoiding DNN Shortcuts through Priming