CVPR 2022

audio

Audio-Adaptive Activity Recognition Across Video Domains

Wnet: Audio-Guided Video Semantic Segmentation via Wavelet-Based Cross-Modal Denoising Networks

Finding Fallen Objects Via Asynchronous Audio-Visual Integration

Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis

Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language

Self-supervised object detection from audio-visual correspondence

sound

Mix and Localize: Localizing Sound Sources from Mixtures

Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

A Proposal-based Paradigm for Self-supervised Sound Source Localization in Videos

Sound and Visual Representation Learning with Multiple Pretraining Tasks

PoseKernelLifter: Metric Lifting of 3D Human Pose using Sound

Weakly Paired Associative Learning for Sound and Image Representations via Bimodal Associative Memory

Sound-Guided Semantic Image Manipulation

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

Embodied

Continuous Scene Representations for Embodied AI

Interactron: Embodied Adaptive Object Detection

Simple but Effective: CLIP Embeddings for Embodied AI Learning Embodied Object-Search Strategies from 50k Human Demonstrations Symmetry-aware Neural Architecture for Embodied Visual Exploration Continuous Scene Representations for Embodied AI

Navigation

Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation

Reinforced Structured State-Evolution for Vision-Language Navigation

Online Learning of Reusable Abstract Models for Object Goal Navigation

Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation

Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

Towards real-world navigation with deep differentiable planners

ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts

Coupling Vision and Proprioception for Navigation of Legged Robots

Less is More: Generating Grounded Navigation Instructions from Landmarks

What do navigation agents learn about their environment?

HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation

EnvEdit: Environment Editing for Vision-and-Language Navigation

PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning

Is Mapping Necessary for Realistic PointGoal Navigation?

Cross-modal Map Learning for Vision and Language Navigation

One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones

agent

Meta Agent Teaming Active Learning for Pose Estimation

What do navigation agents learn about their environment?

HiVT: Hierarchical Vector Transformer for Multi-Agent Motion Prediction

rearrangement

IFOR: Iterative Flow Minimization for Robotic Object Rearrangement

Hire-MLP: Vision MLP via Hierarchical Rearrangement

yyf17 / awesome-embodied-intelligent

SoundSpace #1