Cascade Transformers for End-to-End Person Search |
Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning |
Long-Tailed Recognition via Weight Balancing |
InfoGCN: Representation Learning for Human Skeleton-based Action Recognition |
Interactive Geometry Editing of Neural Radiance Fields |
MLSLT: Towards Multilingual Sign Language Translation |
360MonoDepth: High-Resolution 360° Monocular Depth Estimation |
Generating Diverse and Natural 3D Human Motions from Text |
Masked-attention Mask Transformer for Universal Image Segmentation |
Pointly-Supervised Instance Segmentation |
A Closer Look at Few-shot Image Generation |
Learning Local-Global Contextual Adaptation for Multi-Person Pose Estimation |
Neural 3D Scene Reconstruction with the Manhattan-world Assumption |
Masked Autoencoders Are Scalable Vision Learners |
De-rendering 3D Objects in the Wild |
Direct Voxel Grid Optimization: Super-fast Convergence for Radiance Fields Reconstruction |
Finding Badly Drawn Bunnies |
GradViT: Gradient Inversion of Vision Transformers |
On the Importance of Asymmetry for Siamese Representation Learning |
Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation |
Self-Supervised Material and Texture Representation Learning for Remote Sensing Tasks |
Rethinking Efficient Lane Detection via Curve Modeling |
StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis |
Learning Fair Classifiers with Partially Annotated Group Labels |
Demystifying the Neural Tangent Kernel from a Practical Perspective: Can it be trusted for Neural Architecture Search without training? |
Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis |
A ConvNet for the 2020s |
Consistent 3D Scene Stylization as Stylized NeRF via 2D-3D Mutual Learning |
Weakly Supervised Semantic Segmentation by Pixel-to-Prototype Contrast |
Connecting the Complementary-view Videos: Joint Camera Identification and Subject Association |
Decoupled Knowledge Distillation |
Maximum Spatial Perturbation Consistency for Unpaired Image-to-Image Translation |
Compound Domain Generalization via Meta-Knowledge Encoding |
Bilateral Video Magnification Filter |
EDTER: Edge Detection with Transformer |
Structure-Aware Motion Transfer with Deformable Anchor Model |
Attentive Fine-Grained Structured Sparsity for Image Restoration |
Sign Language Video Retrieval with Free-Form Textual Queries |
SplitNets: Designing Neural Architectures for Efficient Distributed Computing on Head-Mounted Systems |
Neural Mean Discrepancy for Efficient Out-of-Distribution Detection |
LAKe-Net: Topology-Aware Point Cloud Completion by Localizing Aligned Keypoints |
Focal and Global Knowledge Distillation for Detectors |
Enhancing Adversarial Robustness for Deep Metric Learning |
Novel Class Discovery in Semantic Segmentation |
IDEA-Net: Dynamic 3D Point Cloud Interpolation via Deep Embedding Alignment |
WarpingGAN:Warping Multiple Uniform Priors for Adversarial 3D Point Cloud Generation |
Rethinking Reconstruction Autoencoder-Based Out-of-Distribution Detection |
HyperDet3D: Learning a Scene-Conditioned 3D Object Detector |
Deep Decomposition for Stochastic Normal-Abnormal Transport |
Signing at Scale: Learning to Co-Articulate Signs for Large-Scale Photo-Realistic Sign Language Production |
Self-supervised Video Transformers |
HLRTF: Hierarchical Low-Rank Tensor Factorization for Inverse Problems in Multi-Dimensional Imaging |
φ-SfT: Shape-from-Template with a Physics-based Deformation Model |
Boosting View Synthesis with Residual Transfer |
DINE: Domain Adaptation from Single and Multiple Black-box Predictors |
Occluded Human Mesh Recovery |
Understanding Uncertainty Maps in Vision with Statistical Testing |
Equivariance Allows Handling Multiple Nuisance Variables When Analyzing Pooled Neuroimaging Datasets |
Learning from Pixel-Level Label Noise: A New Perspective for Light Field Salient Object Detection |
Self-Supervised Global-Local Structure Modeling for Point Cloud Domain Adaptation with Reliable Voted Pseudo Labels |
Towards An End-to-End Framework for Flow-Guided Video Inpainting |
E-CIR: Event-Enhanced Continuous Intensity Recovery |
Beyond Cross-view Image Retrieval: Highly Accurate Vehicle Localization using Satellite Image |
Unsupervised Hierarchical Semantic Segmentation with Multiview Cosegmentation and Clustering Transformers |
Forward Propagation, Backward Regression and Pose Association for Hand Tracking in the Wild |
FERV39k: A Large-Scale Multi-Scene Dataset for Facial Expression Recognition in Videos |
Efficient Neural Radiance Fields |
Robust Equivariant Imaging: a fully unsupervised framework for learning to image from noisy and partial measurements |
HumanNeRF: Efficiently Generated Human Radiance Field from Sparse Inputs |
Attributable Visual Similarity Learning |
Efficient Multi-view Stereo by Iterative Dynamic Cost Volume |
Replacing Labeled Real-image Datasets with Auto-generated Contours |
SOMSI: Spherical Novel View Synthesis with Soft Occlusion Multi-Sphere Images |
AutoSDF: Shape Priors for 3D Completion, Reconstruction, and Generation |
MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions |
PIE-Net: Photometric Invariant Edge Guided Network for Intrinsic Image Decomposition |
DST: Dynamic Substitute Training for Data-free Black-box Attack |
HCSC: Hierarchical Contrastive Selective Coding |
Towards Diverse and Natural Scene-aware 3D Human Motion Synthesis |
Inertia-Guided Flow Completion and Style Fusion for Video Inpainting |
PlaneMVS: 3D Plane Reconstruction from Multi-View Stereo |
Ref-NeRF: Structured View-Dependent Appearance for Neural Radiance Fields |
Interactiveness Field of Human-Object Interactions |
Learning Memory-Augmented Unidirectional Metrics for Cross-modality Person Re-identification |
Event-based Video Reconstruction via Potential-assisted Spiking Neural Network |
SIGMA: Semantic-complete Graph Matching for Domain Adaptive Object Detection |
Surface Reconstruction from Point Clouds by Learning Predictive Context Priors |
Active Teacher for Semi-Supervised Object Detection |
Self-Sustaining Representation Expansion for Non-Exemplar Class-Incremental Learning |
RCL: Recurrent Continuous Localization for Temporal Action Detection |
GroupNet: Multiscale Hypergraph Neural Networks for Trajectory Prediction with Relational Reasoning |
SPAMs: Structured Implicit Parametric Models |
A Keypoint-based Global Association Network for Lane Detection |
Weakly Supervised Semantic Segmentation using Out-of-Distribution Data |
BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment |
Investigating Tradeoffs in Real-World Video Super-Resolution |
OakInk: A Large-scale Knowledge Repository for Understanding Hand-Object Interaction |
Bending Graphs: Hierarchical Shape Matching using Gated Optimal Transport |
The Norm Must Go On: Dynamic Unsupervised Domain Adaptation by Normalization |
SimT: Handling Open-set Noise for Domain Adaptive Semantic Segmentation |
Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation |
Graph Sampling Based Deep Metric Learning for Generalizable Person Re-Identification |
Stochastic Trajectory Prediction via Motion Indeterminacy Diffusion |
Unbiased Subclass Regularization for Semi-Supervised Semantic Segmentation |
Stratified Transformer for 3D Point Cloud Segmentation |
Cloning Outfits from Real-World Images to 3D Characters for Generalizable Person Re-Identification |
ImplicitAtlas: Learning Deformable Shape Templates in Medical Imaging |
Sparse Instance Activation for Real-Time Instance Segmentation |
Pastiche Master: Exemplar-Based High-Resolution Portrait Style Transfer |
Unsupervised Image-to-Image Translation with Generative Prior |
Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation |
Versatile Multi-Modal Pre-Training for Human-Centric Perception |
Instance-wise Occlusion and Depth Orders in Natural Scenes |
Degradation-agnostic Correspondence from Resolution-asymmetric Stereo |
No Pain, Big Gain: Classify Dynamic Point Cloud Sequences with Static Models by Fitting Feature-level Space-time Surfaces |
Multi-Dimensional with Intensity: A Crowd-sourced Method for Measuring the Perception of Facial Expression |
Class-Incremental Learning with Strong Pretrained Models |
A Patch-centric Error Analysis of Image Super-Resolution |
IFOR: Iterative Flow Minimization for Robotic Object Rearrangement |
3D-aware Image Synthesis via Learning Structural and Textural Representations |
DeeCap: Dynamic Early Exiting for Efficient Image Captioning |
GAN-Supervised Dense Visual Alignment |
Multilayer GAN Inversion and Editing |
On Aliased Resizing and Surprising Subtleties in GAN Evaluation |
Learning Pixel Trajectories with Multiscale Contrastive Random Walks |
Comparing Correspondences: Video Prediction with Correspondences-wise Losses |
Mix and Localize: Localizing Sound Sources from Mixtures |
AziNorm: Exploiting the Radial Symmetry of Point Cloud for Azimuth-Normalized 3D Perception |
Fourier PlenOctrees for Dynamic Radiance Field Rendering in Real-time |
Point Cloud Pre-training with Natural 3D Structures |
Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding |
Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation |
Mr.BiQ: Post-Training Non-Uniform Quantization based on Minimizing the Reconstruction Error |
Drop the GAN: In Defense of Patches Nearest Neighbors as Single Image Generative Models |
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition |
MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection |
Reversible Vision Transformers |
RigNeRF: Fully Controllable Neural 3D Portraits |
Rethinking Depth Estimation for Multi-View Stereo: A Unified Representation |
Integrative Few-Shot Learning for Classification and Segmentation |
Learning Affordance Grounding from Exocentric Images |
Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection |
Exploring Geometry Consistency for monocular 3D object detection |
Visual Abductive Reasoning |
Putting People in their Place: Monocular Regression of 3D People in Depth |
Exploiting Explainable Metrics for Augmented SGD |
Rethinking Bayesian Deep Learning Methods for Semi-Supervised Volumetric Medical Image Segmentation |
A Hybrid Quantum-Classical Algorithm for Robust Fitting |
Dataset Distillation by Matching Training Trajectories |
DiLiGenT10^2: A Photometric Stereo Benchmark Dataset with Controlled Shape and Material Variation |
Scene Representation Transformer |
ConDor: Self-Supervised Canonicalization of 3D Pose for Partial Shapes |
Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion |
Injecting Visual Concepts into End-to-End Image Captioning |
Learning Neural Light Fields with Ray-Space Embedding Networks |
What's in your hands? 3D Reconstruction of Generic Objects in Hands |
Virtual Correspondences: Human as a Cue for Extreme-View Geometry |
Unsupervised Activity Segmentation by Joint Representation Learning and Online Clustering |
TransRank: Self-supervised Video Representation Learning via Ranking-based Transformation Recognition |
SketchEdit: Mask-Free Local Image Manipulation with Partial Sketches |
GroupViT: Zero-Shot Transfer to Semantic Segmentation with Text Supervision |
LSVC: A Learning-based Stereo Video Compression Framework |
BEHAVE: Dataset and Method for Tracking Human Object Interactions |
Learning to Align Sequential Actions in the Wild |
Motion-from-Blur: 3D Shape and Motion Estimation of Motion-blurred Objects in Videos |
Fixing Malfunctional Objects With Learned Physical Simulation and Functional Prediction |
Simulated Adversarial Testing of Face Recognition Models |
GOAL: Generating 4D Whole-Body Motion for Hand-Object Grasping |
Ensembling Off-the-shelf Models for GAN Training |
Global Tracking Transformers |
Visible-Thermal UAV Tracking: A Large-Scale Benchmark and New Baseline |
Joint Global and Local Hierarchical Priors for Learned Image Compression |
D-Grasp: Physically Plausible Dynamic Grasp Synthesis for Hand-Object Interactions |
Human-Aware Object Placement for Visual Environment Reconstruction |
Dual-path Image Inpainting with Auxiliary GAN Inversion |
Accurate 3D Body Shape Regression using Metric and Semantic Attributes |
BARC: Learning to Regress 3D Dog Shape from Images by Exploiting Breed Information |
Capturing and Inferring Dense Full-Body Human-Scene Contact |
Not All Labels Are Equal: Rationalizing The Labeling Costs for Training Object Detection |
Background Activation Suppression for Weakly Supervised Object Localization |
Attribute Group Editing for Reliable Few-shot Image Generation |
Negative-aware Attention for Image-Text Matching |
Watch It Move: Unsupervised Discovery of 3D Joints for Re-Posing of Articulated Objects |
TransWeather: Transformer-based Restoration of Images Degraded by Adverse Weather Conditions |
HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening |
gDNA: Towards Generative Detailed Neural Avatars |
CaDeX: Learning Canonical Deformation Coordinate Space for Dynamic Surface Representation via Neural Homeomorphism |
BACON: Band-limited Coordinate Networks for Multiscale Scene Representation |
Revisiting Near/Remote Sensing with Geospatial Attention |
Simple multi-dataset detection |
Generalizable Cross-modality Medical Image Segmentation via Style Augmentation and Dual Normalization |
Point-to-Voxel Knowledge Distillation for LiDAR Semantic Segmentation |
Online Convolutional Re-parameterization |
Neural Inertial Localization |
MNSRNet: Multimodal Transformer Network for 3D Surface Super-Resolution |
Unsupervised Pre-training for Temporal Action Localization Tasks |
Augmented Geometric Distillation for Data-Free Incremental Person ReID |
HEAT: Holistic Edge Attention Transformer for Structured Reconstruction |
NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition |
ContrastMask: Contrastive Learning to Segment Every Thing |
Unified Multivariate Gaussian Mixture for Efficient Neural Image Compression |
CoordGAN: Self-Supervised Dense Correspondences Emerge from GANs |
MAT: Mask-Aware Transformer for Large Hole Image Inpainting |
A Comprehensive Study of End-to-End Temporal Action Detection |
Rethinking Image Cropping: Exploring Diverse Compositions from Global Views |
OcclusionFusion: Occlusion-aware Motion Estimation for Real-time Dynamic 3D Reconstruction |
MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation |
Asynchronous Event-based Graph-Neural Networks |
RAMA: A Rapid Multicut Algorithm on GPU |
EvUnroll: Neuromorphic Events based Rolling Shutter Image Correction |
Cycle-Consistent Counterfactuals by Latent Transformations |
Understanding 3D Object Articulation in Internet Videos |
Synthetic Generation of Face Videos with Plethysmograph Physiology |
MonoJSG: Joint Semantic and Geometric Cost Volume for Monocular 3D Object Detection |
Neural Architecture Search with Representation Mutual Information |
Weakly Supervised Temporal Sentence Grounding with Gaussian-based Contrastive Proposal Learning |
Blind2Unblind: Self-Supervised Image Denoising with Visible Blind Spots |
Semi-Supervised Object Detection via Multi-instance Alignment with Global Class Prototypes |
Fine-Grained Predicates Learning for Scene Graph Generation |
Meta Distribution Alignment for Generalizable Person Re-Identification |
Align Representations with Base: A New Approach to Self-Supervised Learning |
Style-Based Global Appearance Flow for Virtual Try-On |
Learning Semantic Associations for Mirror Detection |
Task Decoupled Framework for Reference-based Super-Resolution |
Beyond Semantic to Instance Segmentation: Weakly-Supervised Instance Segmentation via Semantic Knowledge Transfer and Self-Refinement |
Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image Reconstruction |
GLAMR: Global Occlusion-Aware Human Mesh Recovery with Dynamic Cameras |
Fast and Unsupervised Action Boundary Detection for Action Segmentation |
Neural MoCon: Neural Motion Control for Physically Plausible Human Motion Capture |
Unified Transformer Tracker for Object Tracking |
NeuralHOFusion: Neural Volumetric Rendering under Human-object Interactions |
H$^2$FA R-CNN: Holistic and Hierarchical Feature Alignment for Cross-domain Weakly Supervised Object Detection |
ICON: Implicit Clothed humans Obtained from Normals |
Semantic-Aware Domain Generalized Segmentation |
ZebraPose: Coarse to Fine Surface Encoding for 6DoF Object Pose Estimation |
Detecting Deepfakes with Self-Blended Images |
Exact Feature Distribution Matching for Arbitrary Style Transfer and Domain Generalization |
FreeSOLO: Learning to Segment Objects without Annotations |
Auditing Privacy Defenses in Federated Learning via Generative Gradient Leakage |
Differentially Private Federated Learning with Local Regularization and Sparsification |
Modeling 3D Layout For Group Re-Identification |
DASO: Distribution-Aware Semantics-Oriented Pseudo-label for Imbalanced Semi-Supervised Learning |
Structured Local Radiance Fields for Human Avatar Modeling |
Contrastive Regression for Domain Adaptation on Gaze Estimation |
Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition |
Joint Distribution Matters: Deep Brownian Distance Covariance for Few-Shot Classification |
Tree Energy Loss: Towards Sparsely Annotated Semantic Segmentation |
Learning Second Order Local Anomaly for General Face Forgery Detection |
LGT-Net: Indoor Panoramic Room Layout Estimation with Geometry-Aware Transformer Network |
Audio-Adaptive Activity Recognition Across Video Domains |
Towards Robust and Adaptive Motion Forecasting: A Causal Representation Perspective |
Joint Hand Motion and Interaction Hotspots Prediction from Egocentric Videos |
Omnivore: A Single Model for Many Visual Modalities |
Multi-Frame Self-Supervised Depth with Transformers |
Voice-Face Homogeneity Tells Deepfake |
Representation Compensation Networks for Continual Semantic Segmentation |
Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation |
FLAVA: A Foundational Language And Vision Alignment Model |
Vision Prompt Tuning |
Vehicle trajectory prediction works, but not everywhere |
Camera-Conditioned Stable Feature Generation for Isolated Camera Supervised Person Re-IDentification |
ReSTR: Convolution-free Referring Image Segmentation Using Transformers |
DATA: Domain-Aware and Task-Aware Self-supervised Learning |
Sketching without Worrying: Noise-Tolerant Sketch-Based Image Retrieval |
Balanced MSE for Imbalanced Visual Regression |
The Devil Is in the Details: Window-based Attention for Image Compression |
DeltaCNN: End-to-End CNN Inference of Sparse Frame Differences in Videos |
CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding |
Video Frame Interpolation Transformer |
Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling |
LASER: LAtent SpacE Rendering for 2D Visual Localization |
LaTr: Layout-Aware Transformer for Scene-Text VQA |
Universal Photometric Stereo Network using Global Lighting Contexts |
Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training |
Stochastic Backpropagation: A Memory Efficient Strategy for Training Video Models |
Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory |
Multi-View Consistent Generative Adversarial Networks for 3D-aware Image Synthesis |
AdaViT: Adaptive Tokens for Efficient Vision Transformer |
Neural Template: Topology-aware Reconstruction and Disentangled Generation of 3D Meshes |
CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow |
Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition |
Cross-Modal Transferable Adversarial Attacks from Images to Videos |
PTTR: Relational 3D Point Cloud Object Tracking with Transformer |
Deformation and Correspondence Aware Unsupervised Synthetic-to-Real Scene Flow Estimation for Point Clouds |
Lifelong Unsupervised Domain Adaptive Person Re-identification with Coordinated Anti-forgetting and Adaptation |
Object Localization under Single Coarse Point Supervision |
Weakly Supervised Temporal Action Localization via Representative Snippet Knowledge Propagation |
TubeDETR: Spatio-Temporal Video Grounding with Transformers |
Reinforced Structured State-Evolution for Vision-Language Navigation |
Learning to Anticipate Future with Dynamic Context Removal |
Learning Program Representations for Food Images and Cooking Recipes |
Transferability Estimation using Bhattacharyya Class Separability |
LiDAR Snowfall Simulation for Robust 3D Object Detection |
Masked Feature Prediction for Vision Self-Supervised Pre-Training |
Unbiased Teacher v2: Semi-supervised Object Detection for Anchor-free and Anchor-based Detectors |
Shape from Polarization for Complex Scenes in the Wild |
PhotoScene: Physically-Based Material and Lighting Transfer for Indoor Scenes |
Node Representation Learning in Graph via Node-to-Neighbourhood Mutual Information Maximization |
Selective-Supervised Contrastive Learning with Noisy Labels |
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation |
L2G: A Simple Local-to-Global Knowledge Transfer Framework for Weakly Supervised Semantic Segmentation |
TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing |
Leveraging Self-Supervision for Cross-Domain Crowd Counting |
Learning from Untrimmed Videos: Self-Supervised Video Representation Learning with Hierarchical Consistency |
TimeReplayer: Unlocking the Potential of Event Cameras for Video Interpolation |
Self-supervised Image-specific Prototype Exploration for Weakly Supervised Semantic Segmentation |
Class-Balanced Pixel-Level Self-Labeling for Domain Adaptive Semantic Segmentation |
Probabilistic Warp Consistency for Weakly-Supervised Semantic Correspondences |
DIFNet: Boosting Visual Information Flow for Image Captioning |
ScaleNet: A Shallow Architecture for Scale Estimation |
HODOR: High-level Object Descriptors for Object Re-segmentation in Video Learned from Static Images |
Density-preserving Deep Point Cloud Compression |
Exploring Dual-task Correlation for Pose Guided Person Image Generation |
Exploring Endogenous Shift for Cross-domain Detection: A Large-scale Benchmark and Perturbation Suppression Network |
Transferability metrics for selecting Source Model Ensembles |
The Auto Arborist Dataset: A Large-Scale Benchmark for Multimodal Urban Forest Monitoring Under Domain Shift |
EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation |
Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection |
Learning from Temporal Gradient for Semi-supervised Action Recognition |
JoinABLe: Learning Bottom-up Assembly of Parametric CAD Joints |
DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion |
Defensive Patches for Robust Recognition in the Physical World |
UniCoRN: A Unified Conditional Image Repainting Network |
APES: Articulated Part Extraction from Sprite Sheets |
Learning Deep Implicit Functions for 3D Shapes with Dynamic Code Clouds |
Neural Rays for Occlusion-aware Image-based Rendering |
DisARM: Displacement Aware Relation Module for 3D Detection |
A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved Neural Network Calibration |
RIM-Net: Recursive Implicit Fields for Unsupervised Learning of Hierarchical Shape Structures |
Weakly Supervised Object Localization as Domain Adaption |
Reflash Dropout in Image Super-Resolution |
Semantic Segmentation by Early Region Proxy |
EyePAD++: A Distillation-based approach for joint Eye Authentication and Presentation Attack Detection using Periocular Images |
Online Learning of Reusable Abstract Models for Object Goal Navigation |
Time Microscope: Event-based Frame Interpolation with Parametric Non-linear Flow and Multi-scale Fusion |
OSOP: A Multi-Stage One Shot Object Pose Estimation Framework |
Localization Distillation for Dense Object Detection |
RegNeRF: Regularizing Neural Radiance Fields for View Synthesis from Sparse Inputs |
Cross-Image Relational Knowledge Distillation for Semantic Segmentation |
Trustworthy Long-tailed Classification |
Episodic Memory Question Answering |
REX: Reasoning-aware and Grounded Explanation |
Query and Attention Augmentation for Knowledge-Based Explainable Reasoning |
LOLNerf: Learn from One Look |
Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions |
CoNeRF: Controllable Neural Radiance Fields |
Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space |
UnweaveNet: Unweaving Activity Storiess |
MeMOT: Multi-Object Tracking with Memory |
VisualHow: Multimodal Problem Solving |
Affine Medical Image Registration with Coarse-to-Fine Vision Transformer |
Unpaired Deep Image Deraining Using Dual Contrastive Learning |
DiRA: Discriminative, Restorative, and Adversarial Learning for Self-supervised Medical Image Analysis |
Mask Transfiner for High-Quality Instance Segmentation |
GLASS: Geometric Latent Augmentation for Shape Spaces |
Global Convergence of MAML and Theory-Inspired Neural Architecture Search for Few-Shot Learning |
Multi-modal Extreme Classification |
CodedVTR: Codebook-Based Sparse Voxel Transformer in Geometric Regions |
Frequency-driven Imperceptible Adversarial Attack on Semantic Similarity |
Learning to Refactor Action and Co-occurrence Features for Temporal Action Localization |
Self-augmented Unpaired Image Dehazing via Density and Depth Decomposition |
QueryDet: Cascaded Sparse Query for Accelerating High-Resolution Small Object Detection |
Cross-modal Representation Learning for Zero-shot Action Recognition |
Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation |
AUV-Net: Learning Aligned UV Maps for Texture Transfer and Synthesis |
Bijective Mapping Network for Shadow Removal |
ObjectFormer for Image Manipulation Detection and Localization |
GraFormer: Graph-oriented Transformer for 3D Pose Estimation |
Multi-Granularity Alignment Domain Adaptation for Object Detection |
Adaptive Hierarchical Representation Learning for Long-Tailed Object Detection |
Physical Inertial Poser (PIP): Physics-aware Real-time Human Motion Tracking from Sparse Inertial Sensors |
3D Scene Painting via Semantic Image Synthesis |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection |
One-bit Active Query with Contrastive Pairs |
HOI4D: A 4D Egocentric Dataset for Category-Level Human-Object Interaction |
Leveraging Object-Level Rotation Equivariance for 3D Object Detection |
DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting |
JIFF: Jointly-aligned Implicit Face Function for High Fidelity Single View Clothed Human Reconstruction |
Prompt Distribution Learning |
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows |
X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning |
Beyond 3D Siamese Tracking: A Motion-Centric Paradigm for 3D Single Object Tracking in Point Clouds |
Noisy Boundaries: Lemon or Lemonade for Semi-supervised Instance Segmentation? |
Interactive Image Synthesis with Panoptic Layout Generation |
Learning to Find Good Models in RANSAC |
Meta-attention for ViT-backed Continual Learning |
Deep Anomaly Discovery from Unlabeled Videos via Normality Advantage and Self-Paced Refinement |
Improving neural implicit surfaces geometry with patch warping |
Rope3D: Take A New Look from the 3D Roadside Perception Dataset for Autonomous Driving and Monocular 3D Object Detection Task |
AME: Attention and Memory Enhancement in Hyper-Parameter Optimization |
TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation |
Automated Progressive Learning for Efficient Training of Vision Transformers |
Templates for 3D Object Pose Estimation Revisited: Generalization to New Objects and Robustness to Occlusions |
Towards Implicit Text-Guided 3D Shape Generation |
Spatial-Temporal Parallel Transformer for Arm-Hand Dynamic Estimation |
Revisiting skeleton-based action recognition |
Mutual Quantization for Cross-Modal Search with Noisy Labels |
Revisiting Temporal Alignment for Video Restoration |
Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation |
Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities |
Video Frame Interpolation with Transformer |
Autofocus for Event Cameras |
Event-based Direct Sparse Odometry |
OpenTAL: Towards Open Set Temporal Action Localization |
Programmatic Concept Learning for Human Motion Description and Synthesis |
MAXIM: Multi-Axis MLP for Image Processing |
Temporal Alignment Networks for Long-term Video |
Doodle It Yourself: Class Incremental Learning by Drawing a Few Sketches |
Registering Explicit to Implicit: Towards High-Fidelity Garment mesh Reconstruction from Single Images |
Progressive End-to-End Object Detection in Crowded Scenes |
Object-aware Video-language Pre-training for Retrieval |
Multi-Source Uncertainty Mining for Deep Unsupervised Saliency Detection |
Surface Representation for Point Clouds |
Context-Aware Video Reconstruction for Rolling Shutter Cameras |
MonoScene: Monocular 3D Semantic Scene Completion |
Weakly But Deeply Supervised Occlusion-Reasoned Parametric Road Layouts |
Point Cloud Color Constancy |
HDNet: High-resolution Dual-domain Learning for Spectral Compressive Imaging |
iPLAN: Interactive and Procedural Layout Planning |
End-to-End Multi-Person Pose Estimation with Transformers |
Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation |
Adversarial Eigen Attack on Black-Box Models |
Domain-Aware Representation Learning for Unsupervised Domain Generalization |
Sub-word Level Lip Reading With Visual Attention |
Efficient Video Instance Segmentation via Tracklet Query and Proposal |
Towards cross-modal pose localization from text-based position descriptions |
Opening up Open World Tracking |
Dynamic Clustering Mask Transformers for Panoptic Segmentation |
Compressive Single-Photon 3D Cameras |
Style-ERD: Responsive and Coherent Online Motion Style Transfer |
MixFormer: Mixing Features across Windows and Dimensions |
Robust Image Forgery Detection over Online Social Network Shared Images |
Semantic-aligned Fusion Transformer for One-shot Object Detection |
Long-term Video Frame Interpolation Via Feature Propagation |
Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation |
GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection |
ETHSeg: An Amodel Instance Segmentation Network and a Real-world Dataset for X-Ray Waste Inspection |
SEEG: Semantic Energized Co-speech Gesture Generation |
Instance-Dependent Label-Noise Learning With Manifold-Regularized Transition Matrix Estimation |
Acquiring a Dynamic Light Field through a Single-Shot Coded Image |
How many Observations are Enough? Knowledge Distillation for Trajectory Forecasting |
FaceVerse: a Fine-grained and Detail-changeable 3D Neural Face Model from a Hybrid Dataset |
Learning Where to Learn in Cross-View Self-Supervised Learning |
Automatic Relation-aware Graph Network Proliferation |
CoSSL: Co-Learning of Representation and Classifier for Imbalanced Semi-Supervised Learning |
P3Depth: Monocular Depth Estimation with a Piecewise Planarity Prior |
Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability |
En-Compactness: Self-Distillation Embedding & Contrastive Generation for Generalized Zero-Shot Learning |
Unsupervised Learning of Accurate Siamese Tracking |
Accelerating DETR Convergence via Semantic-Aligned Matching |
Co-advise: Cross Inductive Bias Distillation |
Medial Spectral Coordinates for 3D Shape Analysis |
Coupled Iterative Refinement for 6D Multi-Object Pose Estimation |
DeepCurrents: Learning Implicit Representations of Shapes with Boundaries |
Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image |
Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation |
Day-to-Night Image Synthesis for Training Nighttime Neural ISPs |
Playable Environments: Video Manipulation in Space and Time |
Unified Contrastive Learning in Image-Text-Label Space |
Many-to-many Splatting for Efficient Video Frame Interpolation |
Uncertainty-Aware Deep Multi-View Photometric Stereo |
Multi-Robot Active Mapping via Neural Bipartite Graph Matching |
Location-free Human Pose Estimation |
Multiview Transformers for Video Recognition |
RIO: Rotation-equivariance supervised learning of robust inertial odometry |
Few Shot Generative Model Adaption via Relaxed Spatial Structural Alignment |
MiniViT: Compressing Vision Transformers with Weight Multiplexing |
Pop-Out Motion: 3D-Aware Image Deformation via Learning Shape Laplacian |
On the Road to Online Adaptation for Semantic Image Segmentation |
Generalized Binary Search Network for Highly-Efficient Multi-View Stereo |
Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation |
MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens |
Dynamic Kernel Selection for Improved Generalization and Memory Efficiency in Meta-learning |
Regional Semantic Contrast and Aggregation for Weakly Supervised Semantic Segmentation |
DLFormer:Discrete Latent Transformer for Video Inpainting |
Continuous Scene Representations for Embodied AI |
vCLIMB: A Novel Video Class Incremental Learning Benchmark |
NODEO: A Neural Ordinary Differential Equation Based Optimization Framework for Deformable Image Registration |
ONCE-3DLanes: Building Monocular 3D Lane Detection |
ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer |
HairMapper: Removing Hair from Portraits Using GANs |
Dist-PU: Positive-Unlabeled Learning from a Label Distribution Perspective |
Diversity Matters: Fully Exploiting Depth Clues for Reliable Monocular 3D Object Detection |
Interactive Multi-Class Tiny-Object Detection |
Generalizable Human Pose Triangulation |
Towards Discriminative Representation: Multi-view Trajectory Contrastive Learning for Online Multi-object Tracking |
A Simple Episodic Linear Probe Improves Visual Recognition in the Wild |
Learning to Learn by Jointly Optimizing Neural Architecture and Weights |
Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning |
Learning Soft Estimator of Keypoint Scale and Orientation with Probabilistic Covariant Loss |
Towards Semi-Supervised Deep Facial Expression Recognition with An Adaptive Confidence Margin |
Cross Domain Object Detection by Target-Perceived Dual Branch Distillation |
Depth-Aware Generative Adversarial Network for Talking Head Video Generation |
OccAM's Laser: Occlusion-based Attribution Maps for 3D Object Detectors on LiDAR Data |
Improving Adversarially Robust Few-shot Image Classification with Generalizable Representations |
DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion |
Stable Long-Term Recurrent Video Super-Resolution |
Evading the Simplicity Bias: Training a Diverse Set of Models Discovers Solutions with Superior OOD Generalization |
SelfD: Self-Learning Large-Scale Driving Policies From the Web |
InstaFormer: Instance-Aware Image-to-Image Translation with Transformer |
AutoGPart: Intermediate Supervision Search for Generalizable 3D Part Segmentation |
GASP, a generalized framework for agglomerative clustering of signed graphs and its application to Instance Segmentation |
Exploring and Evaluating Image Restoration Potential in Dynamic Scenes |
Multi-level Feature Learning for Contrastive Multi-view Clustering |
Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data |
Threshold Matters in WSSS: Manipulating the Activation for the Robust and Accurate Segmentation Model Against Thresholds |
StyleSwin: Transformer-based GAN for High-resolution Image Generation |
Semi-Supervised Learning of Semantic Correspondence with Pseudo-Labels |
Divide and Conquer: Compositional Experts for Generalized Novel Class Discovery |
Splicing ViT Features for Semantic Appearance Transfer |
Optimizing Video Prediction via Video Frame Interpolation |
Iterative Corresponding Geometry: Fusing Region and Depth for Highly Efficient 3D Tracking of Textureless Objects |
HARA: A Hierarchical Approach for Robust Rotation Averaging |
Revisiting Weakly Supervised Pre-Training of Visual Perception Models |
Safe-Student for Safe Deep Semi-Supervised Learning with Unseen-Class Unlabeled Data |
PatchFormer: An Efficient Point Transformer with Patch Attention |
Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised Correspondence Learning |
Neural Global Shutter: Learn to Restore Video from a Rolling Shutter Camera with Global Reset Feature |
Conditional Prompt Learning for Vision-Language Models |
Stability-driven Contact Reconstruction From Monocular Color Images |
SharpContour: A Contour-based Boundary Refinement Approach for Efficient and Accurate Instance Segmentation |
MSDN: Mutually Semantic Distillation Network for Zero-Shot Learning |
GeneralDepth: Unsupervised Learning of Single-Image Depth Estimation in General Scenes |
Revisiting AP Loss for Dense Object Detection: Adaptive Ranking Pair Selection |
No-Reference Point Cloud Quality Assessment via Domain Adaptation |
DArch: Dental Arch Prior-assisted 3D Tooth Instance Segmentation with Weak Annotations |
Self-Supervised Keypoint Discovery in Behavioral Videos |
Toward Practical Self-Supervised Monocular Indoor Depth Estimation |
Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices? |
DPGEN: Differentially Private Generative Energy-Guided Network for Natural Image Synthesis |
Learning the Degradation Distribution for Blind Image Super-Resolution |
ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization |
Exploiting Rigidity Constraints for LiDAR Scene Flow Estimation |
Democracy Does Matter: Comprehensive Feature Mining for Co-Salient Object Detection |
Unsupervised Domain Adaptation for Nighttime Aerial Tracking |
UDA-COPE: Unsupervised Domain Adaptation for Category-level Object Pose Estimation |
3D Shape Reconstruction from 2D Images with Disentangled Attribute Flow |
Multimodal Dynamics: Dynamical Fusion for Trustworthy Multimodal Classification |
Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer |
StyTr2: Image Style Transfer with Transformers |
BokehMe: When Neural Rendering Meets Classical Rendering |
Memory-augmented Deep Conditional Unfolding Network for Pan-sharpening |
Learning Object Context for Novel-view Scene Layout Generation |
FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality Assessment |
TCTrack: Temporal Contexts for Aerial Tracking |
RBGNet: Ray-based Grouping for 3D Object Detection |
3PSDF: Three-Pole Signed Distance Function for Learning Surfaces with Arbitrary Topologies |
PanopticNeRF: A Semantic Object-Aware Neural Scene Representation |
Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation |
Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer |
Reconstructing Surfaces for Sparse Point Clouds with On-Surface Priors |
Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships |
Details or Artifacts: A Locally Discriminative Learning Approach to Realistic Image Super-Resolution |
Learning Motion-Dependent Appearance for High-Fidelity Rendering of Dynamic Humans from a Single Camera |
A Voxel Graph CNN for Object Classification with Event Cameras |
How Good Is Aesthetic Ability of a Fashion Model? |
Recurrent Dynamic Embedding for Video Object Segmentation |
Self-Distillation from the Last Mini-Batch for Consistency Regularization |
Group Contextualization for Video Recognition |
Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos |
Dual Adversarial Adaptation for Cross-Device Real-World Image Super-Resolution |
Urban Radiance Fields |
Practical Evaluation of Adversarial Robustness via Adaptive Auto Attack |
PINA: Learning a Personalized Implicit Neural Avatar from a Single RGB-D Video Sequence |
Disentangled3D: Learning a 3D Generative Model with Disentangled Geometry and Appearance from Monocular Images |
Global Sensing and Measurements Reuse for Image Compressed Sensing |
AKB-48: A Real-World Articulated Object Knowledge Base |
Structured Sparse R-CNN for Direct Scene Graph Generation |
Photorealistic Monocular 3D Reconstruction of Humans Wearing Clothing |
Spectral Unsupervised Domain Adaptation for Visual Recognition |
SimMatch: Semi-supervised Learning with Similarity Matching |
Multi-grained Spatio-Temporal Features Perceived Network for Event-based Lip-Reading |
POCO: Point Convolution for Surface Reconstruction |
HerosNet: Hyperspectral Explicable Reconstruction and Optimal Sampling Deep Network for Snapshot Compressive Imaging |
Towards Robust Rain Removal Against Adversarial Attacks: A Comprehensive Benchmark Analysis and Beyond |
FedDC: Federated Learning with Non-IID Data via Local Drift Decoupling and Correction |
Open-set Text Recognition via Character-Context Decoupling |
Generalized Few-shot Semantic Segmentation |
Causal Transportability for Neural Representations |
Uncertainty-Guided Probabilistic Transformer for Complex Action Recognition |
Matching Feature Sets for Few-Shot Image Classification |
Interactron: Embodied Adaptive Object Detection |
It’s About Time: Analog Clock Reading in the Wild |
A Graph Matching Perspective with Transformers on Video Instance Segmentation |
GIF: Neural Implicit Function for General Shape Representation |
AdaViT: Adaptive Vision Transformers for Efficient Image Recognition |
Language as Queries for Referring Video Object Segmentation |
Federated Class-Incremental Learning |
Human Hands as Probes for Interactive Object Understanding |
STIF: Learning Continuous Video Representation for Space-Time Super-Resolution |
Bridging Video-text Retrieval with Multiple Choice Questions |
FoggyStereo: Stereo Matching with Fog Volume Representation |
MonoGround: Detecting Monocular 3D Objects from the Ground |
CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation |
ELIC: Efficient Learned Image Compression with Unevenly Grouped Space-Channel Contextual Adaptive Coding |
Local Texture Estimator for Implicit Representation Function |
Neural Recognition of Dashed Curves with Gestalt Law of Continuity |
Voxel Field Fusion for 3D Object Detection |
Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers |
Both Style and Fog Matter: Cumulative Domain Adaptation for Semantic Foggy Scene Understanding |
SCS-Co: Self-Consistent Style Contrastive Learning for Image Harmonization |
H4D: Human 4D Modeling by Learning Neural Compositional Representation |
PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer |
A Unified Query-based Paradigm for Point Cloud Understanding |
AdaInt: Learning Adaptive Intervals for 3D Lookup Tables on Real-time Image Enhancement |
FS6D: Few-Shot 6D Pose Estimation of Novel Objects |
CLIP-Event: Connecting Text and Images with Event Structures |
Category Contrast for Unsupervised Domain Adaptation in Visual Tasks |
GateHUB: Gated History Unit with Background Suppression for Online Action Detection |
MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video |
Learning 3D Object Shape and Layout without 3D Supervision |
Discrete Cosine Transform Network for Guided Depth Super-Resolution |
DTFD-MIL: Double-Tier Feature Distillation Multiple Instance Learning for Histopathology Whole Slide Image Classification |
Recurrent Glimpse-based Decoder for Detection with Transformer |
HSC4D: Human-centered 4D Scene Capture in Large-scale Indoor-outdoor Space Using Wearable IMUs and LiDAR |
Multi-Object Tracking Meets Moving UAV |
Estimating Fine-Grained Noise Model via Contrastive Learning |
ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via Exploiting CLIP Cues |
Task-specific Inconsistency Alignment for Domain Adaptive Object Detection |
Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization |
Global-Aware Registration of Less-Overlap RGB-D Scans |
XMP-Font: Self-Supervised Cross-Modality Pre-training for Few-Shot Font Generation |
A Simple Data Mixing Prior for Improving Self-Supervised Vision Transformer |
Dense Learning based Semi-Supervised Object Detection |
RNNPose: Recurrent 6-DoF Object Pose Refinement with Robust Correspondence Field Estimation and Pose Optimization |
Global Context with Discrete Diffusion in Vector Quantised Modelling for Image Generation |
Collaborative Learning for Hand and Object Reconstruction with Attention-guided Graph Convolution |
End-to-end Generative Pretraining for Multimodal Video Captioning |
Exposure Normalization and Compensation for Multiple Exposure Correction |
Interpretable part-whole hierarchies and conceptual-semantic relationships in neural networks |
Multi-label Classification with Partial Annotations using Class-aware Selective Loss |
Fire Together Wire Together: A Dynamic Pruning Approach with Self-Supervised Mask Prediction |
IterMVS: Iterative Probability Estimation for Efficient Multi-View Stereo |
Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation |
Hierarchical Nearest Neighbor Graph Embedding for Efficient Dimensionality Reduction |
Decoupling Makes Weakly Supervised Local Feature Better |
Not All Points Are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds |
Expanding Large Pre-trained Unimodal Models with Multimodal Information Injection for Image-Text Multimodal Classification |
Semi-Weakly-Supervised Learning of Complex Actions from Instructional Videos |
Set-Supervised Action Learning in Procedural Videos via Pairwise Order Consistency |
SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation |
BANMo: Building Animatable 3D Neural Models from Many Casual Videos |
HD-CSE: Learning Dense Correspondence of Clothed Humans with Vision Transformers |
Efficient Geometry-aware 3D Generative Adversarial Networks |
CAPRI-Net: Learning Compact CAD Shapes with Adaptive Primitive Assembly |
HL-Net: Heterophily Learning Network for Scene Graph Generation |
Towards Efficient Data Free Black-box Adversarial Attack |
Neural Collaborative Graph Machines for Table Structure Recognition |
Dimension Embeddings for Monocular 3D Object Detection |
Nested Collaborative Learning for Long-Tailed Visual Recognition |
Scalable Penalized Regression for Noise Detection in Learning with Noisy Labels |
Calibrating Deep Neural Networks by Pairwise Constraints |
HybridCR: Weakly-Supervised 3D Point Cloud Semantic Segmentation via Hybrid Contrastive Regularization |
Few-Shot Font Generation by Learning Fine-Grained Local Styles |
Point-NeRF: Point-based Neural Radiance Fields |
Spatial-Temporal Space Hand-in-Hand: Spatial-Temporal Video Super-Resolution via Cycle-Projected Mutual Learning |
Learning from All Vehicles |
Gait Recognition in the Wild with Dense 3D Representations and A Benchmark |
DETReg: Unsupervised Pretraining with Region Priors for Object Detection |
Rethinking Semantic Segmentation: A Prototype View |
Distillation Using Oracle Queries for Transformer-based Human-Object Interaction Detection |
MobRecon: Mobile-Friendly Hand Mesh Reconstruction from Monocular Image |
Spatio-temporal Relation Modeling for Few-shot Action Recognition |
RestoreFormer: High-Quality Blind Face Restoration from Undegraded Key-Value Pairs |
DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis |
Domain-Agnostic Prior for Unsupervised Transfer Segmentation |
Unimodal-Concentrated Loss: Fully Adaptive Label Distribution Learning for Ordinal Regression |
Pyramid Grafting Network for One-Stage High Resolution Saliency Detection |
Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding |
Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation |
Towards Discovering the Effectiveness of Moderately Confident Samples for Semi-Supervised Learning |
Semi-Supervised Video Semantic Segmentation with Inter-Frame Feature Reconstruction |
Revisiting the "Video" in Video-Language Understanding |
SNUG: Self-Supervised Neural Dynamic Garments |
FocalClick: Towards Practical Interactive Image Segmentation |
DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation |
GRAM: Generative Radiance Manifolds for 3D-Aware Image Generation |
Temporally Efficient Vision Transformer for Video Instance Segmentation |
C-CAM: Causal CAM for Weakly Supervised Semantic Segmentation on Medical Image |
Adversarial Texture for Fooling Person Detectors in the Physical World |
Automatic Color Image Stitching Using Quaternion Rank-1 Alignment |
TemporalUV: Capturing Loose Clothing with Temporally Coherent UV Coordinates |
Kernelized Few-shot Object Detection by Integral Aggregation |
Image-to-Lidar Self-Supervised Distillation for Autonomous Driving Data |
Amodal Segmentation through Out-of-Task and Out-of-Distribution Generalization with a Bayesian Model |
FocusCut: Diving into a Focus View in Interactive Segmentation |
Mutual Information-driven Pan-sharpening |
Gradient-SDF: A Semi-Implicit Surface Representation for 3D Reconstruction |
Neural Head Avatars from Monocular RGB Videos |
Point-Level Region Contrast for Object Detection Pre-Training |
HODEC: Towards Efficient High-Order DEcomposed Convolutional Neural Networks |
Bridging Global Context Interactions for High-Fidelity Image Completion |
CDGNet: Class Distribution Guided Network for Human Parsing |
Primitive3D: Learning from 3D Objects Assembled with Random Primitives |
HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular Video |
TransMix: Attend to Mix for Vision Transformers |
JRDB-Act: A Large-scale Dataset for Spatio-temporal Action, Social Group and Activity Detection |
Few-shot Head Swapping in the Wild |
Neural Texture Extraction and Distribution for Controllable Person Image Synthesis |
Embracing Single Stride 3D Object Detector with Sparse Transformer |
Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning |
Portrait Eyeglasses and Shadow Removal by Leveraging 3D Synthetic Data |
Expanding Low-Density Latent Regions for Open-Set Object Detection |
GMFlow: Learning Optical Flow via Global Matching |
Source-Free Domain Adaptation via Distribution Estimation |
Aesthetic Text Logo Synthesis via Content-aware Layout Inferring |
An Image Patch is a Wave: Phase-Aware Vision MLP |
FisherMatch: Semi-Supervised Rotation Regression via Entropy-based Filtering |
BE-STI: Spatial-Temporal Integrated Network for Class-agnostic Motion Prediction with Bidirectional Enhancement |
DC-SSL: Addressing Mismatched Class Distribution in Semi-supervised Learning |
Deterministic Point Cloud Registration via Novel Transformation Decomposition |
Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos |
Deep Visual Geo-localization Benchmark |
LC-FDNet: Learned Lossless Image Compression with Frequency Decomposition Network |
Towards Robust Vision Transformer |
Volumetric Bundle Adjustment for Photorealistic Real-time Reconstruction |
Continual Test-Time Domain Adaptation |
Scribble-Supervised LiDAR Semantic Segmentation |
TableFormer: Table Structure Understanding with Transformers |
Focal Sparse Convolutional Networks for 3D Object Detection |
CLRNet: Cross Layer Refinement Network for Lane Detection |
Transformer Based Line Segment Classifier with Image Context for Real-Time Vanishing Point Detection in Manhattan World |
NeRFReN: Neural Radiance Fields with Reflections |
HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing |
Ditto: Building Digital Twins of Articulated Objects from Interaction |
CroMo: Cross-Modal Learning for Monocular Depth Estimation |
Mobile-Former: Bridging MobileNet and Transformer |
MetaFormer is Actually What You Need for Vision |
RU-Net: Regularized Unrolling Network for Scene Graph Generation |
Dreaming to Prune Image Deraining Networks |
Salvage of Supervision in Weakly Supervised Object Detection |
Lagrange Motion Analysis and View Embeddings for Improved Gait Recognition |
Lite Pose: Efficient Architecture Design for 2D Human Pose Estimation |
SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning |
FMCNet: Feature-Level Modality Compensation for Visible-Infrared Person Re-Identification |
Generalizing Gaze Estimation with Rotation Consistency |
SIOD: Single Instance Annotated Per Category Per Image for Object Detection |
Temporal Complementarity-Guided Reinforcement Learning for Image-to-Video Person Re-Identification |
A Differentiable Two-stage Alignment Scheme for Burst Image Reconstruction with Large Shift |
Manifold Learning Benefits GANs |
Domain Generalization via Shuffled Style Assembly for Face Anti-Spoofing |
OW-DETR: Open-world Detection Transformer |
Learning Optimal K-space Acquisition and Reconstruction using Physics-Informed Neural Networks |
Global Tracking via Ensemble of Local Trackers |
Robust Region Feature Synthesizer for Zero-Shot Object Detection |
Confidence Propagation Cluster: Unleash Full Potential of Object Detectors |
PartGlot: Learning Shape Part Segmentation from Language Reference Games |
Self-Taught Metric Learning without Labels |
GPV-Pose: Category-level Object Pose Estimation via Geometry-guided Point-wise Voting |
OmniFusion: 360 Monocular Depth Estimation via Geometry-Aware Fusion |
3D Common Corruptions and Data Augmentation |
DIVeR: Real-time and Accurate Neural Radiance Fields with Deterministic Integration for Volume Rendering |
Boosting Robustness of Image Matting with Context Assembling and Strong Data Augmentation |
Cross-modal Clinical Graph Transformer For Ophthalmic Report Generation |
Correlation-Aware Deep Tracking |
Learning to Imagine: Diversify Memory for Incremental Learning using Unlabeled Data |
Block-NeRF: Scalable Large Scene Neural View Synthesis |
Vector Quantized Diffusion Model for Text-to-Image Synthesis |
Boosting Crowd Counting via Multifaceted Attention |
Physically-guided Disentangled Implicit Rendering for 3D Face Modeling |
IFRNet: Intermediate Feature Refine Network for Efficient Frame Interpolation |
TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers |
Back to Reality: Weakly-supervised 3D Detection with Shape-guided Label Enhancement |
Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding |
Blind Image Super-resolution with Elaborate Degradation Modeling on Noise and Kernel |
Reduce Information Loss in Transformers for Pluralistic Image Inpainting |
OCSampler: Compressing Videos to One Clip with Single-step Sampling |
Masking Adversarial Damage: Finding Adversarial Saliency for Robust and Sparse Network |
SemAffiNet: Semantic-Affine Transformation for Point Cloud Segmentation |
High-resolution Face Swapping via Latent Semantics Disentanglement |
Deep Rectangling for Image Stitching: A Learning Baseline |
Detector-Free Weakly Supervised Group Activity Recognition |
Unsupervised Domain Generalization by learning a Bridge Across Domains |
RSCFed: Random Sampling Consensus Federated Semi-supervised Learning |
IntraQ: Learning Synthetic Images with Intra-Class Heterogeneity for Zero-Shot Network Quantization |
A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution |
Learned Queries for Efficient Local Attention |
Look Back and Forth: Video Super-Resolution with Explicit Temporal Difference Modeling |
HVH: Learning a Hybrid Neural Volumetric Representation for Dynamic Hair Performance Capture |
Robust Contrastive Learning against Noisy Views |
Discovering Objects that Can Move |
TubeFormer-DeepLab: Video Mask Transformer |
Sparse and Complete Latent Organization for Geospatial Semantic Segmentation |
ITSA: An Information Theoretic Approach to Automatic Shortcut Avoidance and Domain Generalization in Stereo Matching Networks |
Few-shot Backdoor Defense Using Shapley Estimation |
Exploring Domain-Invariant Parameters for Source Free Domain Adaptation |
Ev-TTA: Test-Time Adaptation for Event-Based Object Recognition |
Likert Scoring with Grade Decoupling for Long-term Action Assessment |
Unpaired Cartoon Image Synthesis via Gated Cycle Mapping |
Contextual Instance Decoupling for Robust Multi-Person Pose Estimation |
Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes |
Modulated Contrast for Versatile Image Translation |
Oriented RepPoints for Aerial Object Detection |
INS-Conv: Incremental Sparse Convolution for Online 3D Segmentation |
PanopticDepth: Instance-Decoupled Depth Estimation for Unified Depth-Aware Panoptic Segmentation |
Point-BERT : Pre-Training 3D Point Cloud Transformers with Masked Point Modeling |
Implicit Sample Extension for Unsupervised Person Re-Identification |
Incorporating Semi-Supervised and Positive-Unlabeled learning for Boosting Full Reference Image Quality Assessment |
HairCLIP: Design Your Hair by Text and Reference Image |
C2AM Loss: Chasing a Better Decision Boundary for Long-Tail Object Detection |
MogFace: Towards a Deeper Appreciation on Face Detection |
RegionCLIP: Region-based Language-Image Pretraining |
HP-Capsule: Unsupervised Face Part Discovery by Hierarchical Parsing Capsule Network |
Structure-Aware Flow Generation for Human Body Reshaping |
Revisiting Document Image Dewarping by Grid Regularization |
GANSeg: Learning to Segment by Unsupervised Hierarchical Image Generation |
Align and Prompt: Video-and-Language Pre-training with Entity Prompts |
Bridging the Gap between Classification and Localization for Weakly Supervised Object Localization |
Shunted Self-Attention via Multi-Scale Token Aggregation |
VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention |
MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer |
YouMVOS: An Actor-centric Multi-shot Video Object Segmentation Dataset |
Single-Stage is Enough: Multi-Person Absolute 3D Pose Estimation |
UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection |
DiSparse: Disentangled Sparsification for Multitask Model Compression |
Coarse-to-fine Deep Video Coding with Hyperprior-guided Mode Prediction |
Weakly Supervised High-Fidelity Clothing Model Generation |
Deep Generalized Unfolding Networks for Image Restoration |
Panoptic-PHNet: Towards Real-Time and High-Precision LiDAR Panoptic Segmentation via Clustering Pseudo Heatmap |
ES6D: A Computation Efficient and Symmetry-Aware 6D Pose Regression Framework |
Iterative Deep Homography Estimation |
Homography Loss for Monocular 3D Object Detection |
Infrared Invisible Clothing: Hiding from Infrared Detectors at Multiple Angles in Real World |
Deep Stereo Image Compression via Bi-directional Coding |
Degree-of-linear-polarization-based Color Constancy |
Unleashing Potential of Unsupervised Pre-Training with Intra-Identity Regularization for Person Re-Identification |
Aladdin: Joint Atlas Building and Diffeomorphic Registration Learning with Pairwise Alignment |
Learning Transferable Human-Object Interaction Detector with Natural Language Supervision |
PNP: Robust Learning from Noisy Labels by Probabilistic Noise Prediction |
RayMVSNet: Learning Ray-based 1D Implicit Fields for Accurate Multi-View Stereo |
Shapley-NAS: Discovering Operation Contribution for Neural Architecture Search |
Few-shot Keypoint Detection with Uncertainty Learning for Unseen Species |
Reusing the Task-specific Classifier as a Discriminator: Discriminator-free Adversarial Domain Adaptation |
``The Pedestrian next to the Lamppost'' Adaptive Object Graphs for Better Instantaneous Mapping |
Point2Seq: Detecting 3D Objects as Sequences |
Towards Noiseless Object Contours for Weakly Supervised Semantic Segmentation |
Syntax-Aware Network for Handwritten Mathematical Expression Recognition |
RAGO: Recurrent Graph Optimizer For Multiple Rotation Averaging |
A Brand New Dance Partner: Music-Conditioned Pluralistic Dancing Controlled by Multiple Dance Genres |
BNVF: Dense 3D Reconstruction using Bi-level Neural Volume Fusion |
AutoLoss-Zero: Searching Loss Functions from Scratch for Generic Tasks |
Exploring the Equivalence of Siamese Self-Supervised Learning via A Unified Gradient Framework |
Cross-domain Few-shot Learning with Task-specific Adapters |
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks |
Geometric and Textural Augmentation for Domain Gap Reduction |
Geometric Transformer for Fast and Robust Point Cloud Registration |
Group R-CNN for Point-based Weakly Semi-supervised Object Detection |
Wnet: Audio-Guided Video Semantic Segmentation via Wavelet-Based Cross-Modal Denoising Networks |
3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds |
ELSR: Efficient Line Segment Reconstruction with Planes and Points Guidance |
A Proposal-based Paradigm for Self-supervised Sound Source Localization in Videos |
Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer |
End-to-End Referring Video Object Segmentation with Multimodal Transformers |
Neural fields as learnable kernels for 3D reconstruction |
IDR: Self-Supervised Image Denoising via Iterative Data Refinement |
TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers |
SimAN: Exploring Self-Supervised Representation Learning of Scene Text via Similarity-Aware Normalization |
Deep vanishing point detection: Geometric priors make dataset variations vanish |
On Adversarial Robustness of Trajectory Prediction for Autonomous Vehicles |
Learning Multiple Dense Prediction Tasks from Partially Annotated Data |
Quarantine: Sparsity Can Uncover the Trojan Attack Trigger for Free |
Video Demoireing with Relation-based Temporal Consistency |
FLAG: Flow-based 3D Avatar Generation from Sparse Observations |
Learning an Optimal Linear Program for Multi-Target Tracking |
IRON: Inverse Rendering by Optimizing Neural SDFs and Materials from Photometric Images |
Stereoscopic Universal Perturbations across Different Architectures and Datasets |
The Flag Median and FlagIRLS |
NeRF in the Dark: High Dynamic Range View Synthesis from Noisy Raw Images |
BoxeR: Box-Attention for 2D and 3D Transformers |
DynamicEarthNet: Daily Multi-Spectral Satellite Dataset for Semantic Change Segmentation |
UBnormal: New Benchmark for Supervised Open-Set Video Anomaly Detection |
Self-Supervised Predictive Convolutional Attentive Block for Anomaly Detection |
CADTransformer: Panoptic Symbol Spotting Transformer for CAD Drawings |
The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy |
Learning To Recognize Procedural Activities with Distant Supervision |
Audio-driven Neural Gesture Reenactment with Video Motion Graphs |
Towards Bidirectional Arbitrary Image Rescaling: Joint Optimization and Cycle Idempotence |
Hire-MLP: Vision MLP via Hierarchical Rearrangement |
Escaping Data Scarcity for High-Resolution Heterogeneous Face Hallucination |
DeepDPM: Deep Clustering With an Unknown Number of Clusters |
ZeroWaste Dataset: Towards Deformable Object Segmentation in Cluttered Scenes |
Context-Aware Sequence Alignment using 4D Skeletal Augmentation |
COAP: Compositional Articulated Occupancy of People |
Sound and Visual Representation Learning with Multiple Pretraining Tasks |
The Wanderings of Odysseus in 3D Scenes |
Deblurring via Stochastic Refinement |
SMPL-A: Modeling Person-Specific Deformable Anatomy |
Neural Point Light Fields |
FedCor: Correlation-Based Active Client Selection Strategy for Heterogeneous Federated Learning |
ADeLA: Automatic Dense Labeling with Attention for Viewpoint Shift in Semantic Segmentation |
Adversarial Parametric Pose Prior |
Generating Useful Accident-Prone Driving Scenarios via a Learned Traffic Prior |
Pre-Training meets Self-Training for Supersizing 3D Reconstruction |
Safe Self-Refinement for Transformer-based Domain Adaptation |
ElePose: Unsupervised 3D Human Pose Estimation by Predicting Camera Elevation and Learning Normalizing Flows on 2D Poses |
Towards Multimodal Depth Estimation from Light Fields |
Deformable Sprites for Unsupervised Video Decomposition |
Can You Spot the Chameleon? Adversarially Camouflaging Images from Co-Salient Object Detection |
MISF: Multi-level Interactive Siamese Filtering for High-Fidelity Image Inpainting |
Aug-NeRF: Training Stronger Neural Radiance Fields with Triple-Level Physically-Grounded Augmentations |
Semi-supervised Semantic Segmentation with Error Localization Network |
Quantization-aware Deep Optics for Snapshot Hyperspectral Imaging |
Gravitationally Lensed Black Hole Emission Tomography |
Improving Video Model Transfer with Dynamic Representation Learning |
FWD: Real-time Novel View Synthesis with Forward Warping and Depth |
Enhancing Adversarial Training with Second-Order Statistics of Weights |
Patch Slimming for Efficient Vision Transformers |
3DAC: Learning Attribute Compression for Point Clouds |
SNR-Aware Low-light Image Enhancement |
Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation |
Motion-modulated Temporal Fragment Alignment Network For Few-Shot Action Recognition |
Self-Supervised Bulk Motion Artifact Removal in Optical Coherence Tomography Angiography |
Salient-to-Broad Transition for Video Person Re-identification |
Which images to label for few-shot medical landmark detection? |
Hybrid Relation Guided Set Matching for Few-shot Action Recognition |
Progressively Generating Better Initial Guesses Towards Next Stages for High-Quality Human Motion Prediction |
Bringing Old Films Back to Life |
Face Relighting with Geometrically Consistent Shadows |
Learning Cloth-Irrelevant Features for Cloth-Changing Person Re-identification |
DPICT: Deep Progressive Image Compression Using Trit-Planes |
From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering |
Simple but Effective: CLIP Embeddings for Embodied AI |
Scene Consistency Representation Learning for Video Scene Segmentation |
Neural Data-Dependent Transform for Learned Image Compression |
CamLiFlow: Bidirectional Camera-LiDAR Fusion for Joint Optical Flow and Scene Flow Estimation |
Global Matching with Overlapping Attention for Optical Flow Estimation |
Meta Agent Teaming Active Learning for Pose Estimation |
Robust Combination of Distributed Gradients Under Adversarial Perturbations |
Toward Fast, Flexible, and Robust Low-Light Image Enhancement |
Motion-aware Contrastive Video Representation Learning via Foreground-background Merging |
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval |
L-Verse: Bidirectional Generation Between Image and Text |
GANORCON: Are Generative Models Useful for Few-shot Segmentation? |
Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation |
Towards Robust Adaptive Object Detection under Noisy Annotations |
Point2Cyl: Reverse Engineering 3D Objects -- from Point Clouds to Extrusion Cylinders |
MM-TTA: Multi-Modal Test-Time Adaptation for 3D Semantic Segmentation |
Subspace Adversarial Training |
Structural and Statistical Texture Knowledge Distillation for Semantic Segmentation |
UniVIP: A Unified Framework for Self-Supervised Visual Pre-training |
MUM : Mix Image Tiles and UnMix Feature Tiles for Semi-Supervised Object Detection |
SS3D: Sparsely-Supervised 3D Object Detection from Point Cloud |
On the Integration of Self-Attention and Convolution |
Single-Domain Generalized Object Detection in Urban Scene via Cyclic-Disentangled Self-Distillation |
Human Instance Matting via Mutual Guidance and Multi-Instance Refinement |
Delving Deep into the Generalization of Vision Transformers under Distribution Shifts |
Causality Inspired Representation Learning for Domain Generalization |
Learning Local Displacements for Point Cloud Completion |
Remember Intentions: Retrospective-Memory-based Trajectory Prediction |
Contextual Similarity Distillation for Asymmetric Image Retrieval |
Self-Supervised Models are Continual Learners |
High-Fidelity Human Avatars from a Single RGB Camera |
Not All Relations are Equal: Mining Informative Labels for Scene Graph Generation |
TWIST: Two-Way Inter-label Self-Training for Semi-supervised 3D Instance Segmentation |
Focal length and object pose estimation via render and compare |
Kubric: A scalable dataset generator |
VRDFormer: End-to-End Video Visual Relation Detection with Transformers |
A Large-scale Comprehensive Dataset and Copy-overlap Aware Evaluation Protocol for Segment-level Video Copy Detection |
Brain-inspired Multilayer Perceptron with Spiking Neurons |
Few Could Be Better Than All: Feature Sampling and Grouping for Scene Text Detection |
High Quality Segmentation for Ultra High-resolution Images |
Physically Disentangled Intra- and Inter-domain Adaptation for Varicolored Haze Removal |
HandOccNet: Occlusion-Robust 3D Hand Mesh Estimation Network |
Future Transformer for Long-term Action Anticipation |
Decoupling Zero-Shot Semantic Segmentation |
Long-tail Recognition via Compositional Knowledge Transfer |
Open Challenges in Deep Stereo: the Booster Dataset |
BigDatasetGAN: Synthesizing ImageNet with Pixel-wise Annotations |
Recall@k Surrogate Loss with Large Batches and Similarity Mixup |
PoseTriplet: Co-evolving 3D Human Pose Estimation, Imitation, and Hallucination under Self-supervision |
Dynamic Dual-Output Diffusion Models |
End-to-End Human-Gaze-Target Detection with Transformers |
EMOCA: Emotion Driven Monocular Face Capture and Animation |
R(Det)$^2$: Randomized Decision Routing for Object Detection |
Diffusion Autoencoders: Toward a Meaningful and Decodable Representation |
PatchNet: A Simple Face Anti-Spoofing Framework via Fine-Grained Patch Recognition |
NeurMiPs: Neural Mixture of Planar Experts for View Synthesis |
Learning to generate line drawings that convey geometry and semantics |
AlignQ: Alignment Quantization with ADMM-based Correlation Preservation |
Learning Embodied Object-Search Strategies from 50k Human Demonstrations |
Longitudinal Self-Supervision for Learning 2D Amodal Representation |
Controllable Dynamic Multi-Task Architectures |
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation |
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning |
Depth-supervised NeRF: Fewer Views and Faster Training for Free |
Learning to Detect Mobile Objects from LiDAR Scans Without Labels |
Revisiting Random Channel Pruning for Neural Network Compression |
ActiveZero: Mixed Domain Learning for Active Stereovision with Zero Annotation |
Learning sRGB-to-Raw De-rendering with Content-Aware Metadata |
SimVQA: Exploring Simulated Environments for Visual Question Answering |
Cross-Domain Adaptive Teacher for Object Detection |
Modality-Agnostic Learning for Radar-Lidar Fusion in Vehicle Detection |
A Thousand Words Are Worth More Than a Picture: Natural Language-Centric Outside-Knowledge Visual Question Answering |
Towards General Purpose Vision Systems: An End-to-End Task-Agnostic Vision-Language Architecture |
Holocurtains: Programming Light Curtains via Binary Holography |
Leverage Your Local and Global Representations: A New Self-Supervised Learning Strategy |
3D human tongue reconstruction from single "in-the-wild" images |
Pushing the Performance Limit of Scene Text Recognizer without Human Annotation |
SAR-Net: Shape Alignment and Recovery Network for Category-level 6D Object Pose and Size Estimation |
Improving Subgraph Recognition with Variational Graph Information Bottleneck |
Towards Multi-domain Single Image Dehazing via Test-time Training |
EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching |
CHEX: CHannel EXploration for CNN Model Compression |
ImFace: A Nonlinear 3D Morphable Face Model with Implicit Neural Representations |
Deblur-NeRF: Neural Radiance Fields from Blurry images |
An MIL-Derived Transformer for Weakly Supervised Point Cloud Segmentation |
Distribution Consistent Neural Architecture Search |
Training Object Detectors from Scratch: An Empirical Study in the Era of Vision Transformer |
Glass Segmentation using Intensity and Spectral Polarization Cues |
GAT-CADNet: Graph Attention Network for Panoptic Symbol Spotting in CAD Drawings |
Unsupervised Deraining: Where Contrastive Learning Meets Self-similarity |
Delving into the Estimation Shift of Batch Normalization in a Network |
Depth Estimation by Combining Binocular Stereo and Monocular Structured-Light |
Full-Range Virtual Try-On with Recurrent Tri-Level Transformation |
Class Re-Activation Maps for Weakly-Supervised Semantic Segmentation |
Generalizing Interactive Backpropagating Refinement for Dense Prediction Networks |
Protecting Celebrities from DeepFake with Identity Consistency Transformer |
SVIP: Sequence VerIfication for Procedures in Videos |
Cannot See the Forest for the Trees: Aggregating Multiple Viewpoints to Better Classify Objects in Videos |
Deep Saliency Prior for Reducing Visual Distraction |
ClothFormer: Taming Video Virtual Try-on in All Module |
FLARF: Fast LArge-scale Radiance Field Reconstruction |
Estimating Structural Disparities in Face Models |
Faithful Extreme Rescaling via Generative Prior Reciprocated Invertible Representations |
Animal Kingdom: A Large and Diverse Dataset for Animal Behavior Understanding |
Uniform Subdivision of Omnidirectional Camera Space for Efficient Spherical Stereo Matching |
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval |
Scene Graph Expansion for Semantics-Guided Image Outpainting |
Deep Constrained Least Squares for Blind Image Super-Resolution |
MaskGIT: Masked Generative Image Transformer |
CMT: Convolutional Neural Networks Meet Vision Transformers |
GraftNet: Towards Domain Generalized Stereo Matching with a Broad-Spectrum and Task-Oriented Feature |
SoftGroup for 3D Instance Segmentation on Point Clouds |
Partial Class Activation Attention for Semantic Segmentation |
AnyFace: Free-style Text-to-Face Synthesis and Manipulation |
PoseKernelLifter: Metric Lifting of 3D Human Pose using Sound |
LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection |
Make It Move: Controllable Image-to-Video Generation with Text Descriptions |
Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels |
Learning What Not to Segment: A New Perspective on Few-Shot Segmentation |
TT-VSR: Learning Trajectory-Aware Transformer for Video Super-Resolution |
Canonical Voting: Towards Robust Oriented Bounding Box Detection in 3D Scenes |
DyRep: Bootstrapping Training with Dynamic Re-parameterization |
VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning |
GreedyNASv2: Greedier Search with a Greedy Path Filter |
HDR-NeRF: High Dynamic Range Neural Radiance Fields |
Novel-View Object Selection in Neural Volumetric Representations |
Relieving Long-tailed Instance Segmentation via Pairwise Class Balance |
Complex Video Action Reasoning via Learnable Markov Logic Network |
PCL: Proxy-based Contrastive Learning for Domain Generalization |
Unifying Motion Deblurring and Frame Interpolation with Events |
Shape-invariant 3D Adversarial Point Clouds |
Learning Pixel-Level Distinctions for Video Highlight Detection |
Wavelet Knowledge Distillation: Towards Efficient Image-to-Image Translation |
ADAS: A Direct Adaptation Strategy for Multi-Target Domain Adaptive Semantic Segmentation |
PSTR: End-to-End One-Step Person Search With Transformers |
Towards real-world navigation with deep differentiable planners |
Multi-class Token Transformer for Weakly Supervised Semantic Segmentation |
Fourier Document Restoration for Robust Document Dewarping and Recognition |
Neural RGB-D Surface Reconstruction |
LMGP: Lifted Multicut Meets Geometry Projections for Multi-Camera Multi-Object Tracking |
ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation |
Spatio-Temporal Gating-Adjacency GCN for Human Motion Prediction |
What Matters For Meta-Learning Vision Regression Tasks? |
Self-supervised Learning of Adversarial Examples: Towards Good Generalizations for Deepfake Detection |
Ray Priors through Reprojection: Improving Neural Radiance Fields for Novel View Extrapolation |
Perception Prioritized Training of Diffusion Models |
Pseudo-Stereo for Monocular 3D Object Detection in Autonomous Driving |
Human Trajectory Prediction with Momentary Observation |
General Facial Representation Learning in a Visual-Linguistic Manner |
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions |
Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model |
Contextual Outpainting with Object-level Contrastive Learning |
Optical Flow Estimation for Spiking Camera |
PointCLIP: Point Cloud Understanding by CLIP |
Large scale pre-training for person re-identification with noisy labels |
Zoom In and Out: A Mixed-scale Triplet Network for Camouflaged Object Detection |
Blended Diffusion for Text-driven Editing of Natural Images |
CREAM: Weakly Supervised Object Localization via Class RE-Activation Mapping |
Finding Fallen Objects Via Asynchronous Audio-Visual Integration |
HeadNeRF: A Real-time NeRF-Based Parametric Head Model |
Interacting Attention Graph for Single Image Two-Hand Reconstruction |
Learning based Multi-modality Image and Video Compression |
DR.VIC: Decomposition and Reasoning for Video Individual Counting |
End-to-End Compressed Video Representation Learning for Generic Event Boundary Detection |
BaLeNAS: Differentiable Architecture Search via Bayesian Learning Rule |
Task Adaptive Parameter Sharing for Multi-Task Learning |
ViM: Out-Of-Distribution with Virtual-logit Matching |
Pyramid Adversarial Training Improves ViT Performance |
Depth-Guided Sparse Structure-from-Motion for Movies and TV Shows |
Part-based Pseudo Label Refinement for Unsupervised Person Re-identification |
Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment |
MVS2D: Efficient Multi-view Stereo via Attention-Driven 2D Convolutions |
Consistent Explanations by Constrastive Learning |
FvOR: Robust Joint Shape and Pose Optimization for Few-view Object Reconstruction |
Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision |
Frame Averaging for Equivariant Shape Space Learning |
iFS-RCNN: An Incremental Few-shot Instance Segmenter |
Bring Evanescent Representations to Life in Lifelong Class Incremental Learning |
Text to Image Generation with Semantic-Spatial Aware GAN |
Real-Time Light-Weight Near-Field Photometric Stereo |
DESTR: Object Detection with Split Transformer |
Backdoor Attacks on Self-Supervised Learning |
Diverse Image Outpainting via GAN Inversion |
High-Resolution Image Synthesis with Latent Diffusion Models |
NFormer: Robust Person Re-identification with Neighbor Transformer |
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality |
CrossLoc: Scalable Aerial Localization Assisted by Multimodal Synthetic Data |
SceneSqueezer: Learning to Compress Scene for Camera Relocalization |
Dancing under the stars: video denoising in starlight |
Tracking People by Predicting 3D Appearance, Location and Pose |
BCOT: A Markerless High-Precision 3D Object Tracking Benchmark |
Continual Stereo Matching of Continuous Driving Scenes with Growing Architecture |
CVF-SID: Cyclic multi-Variate Function for Self-Supervised Image Denoising by Disentangling Noise from Image |
Unknown-Aware Object Detection: Learning What You Don’t Know from Videos in the Wild |
BodyGAN: General-purpose Controllable Neural Human Body Generation |
Training-free Transformer Architecture Search |
Learning to Affiliate: Mutual Centralized Learning for Few-shot Classification |
Single-Photon Structured Light |
Towards Practical Certifiable Patch Defense with Vision Transformer |
On Generalizing Beyond Domains in Cross-Domain Continual Learning |
Practical Learned Lossless JPEG Recompression with Multi-Level Cross-Channel Entropy Model in the DCT Domain |
GazeOnce: Real-Time Multi-Person Gaze Estimation |
RendNet: Unified 2D/3D Recognizer with Latent Space Rendering |
Identifying Ambiguous Similarity Conditions via Semantic Matching |
Learn from Others and Be Yourself in Heterogeneous Federated Learning |
Enhancing Face Recognition with Self-Supervised 3D Reconstruction |
Visual Vibration Tomography: Estimating Interior Material Properties from Monocular Video |
ACPL: Anti-curriculum Pseudo-labelling for Semi-supervised Medical Image Classification |
The Two Dimensions of Worst-case Training and the Integrated Effect for Out-of-domain Generalization |
Perturbed and Strict Mean Teachers for Semi-supervised Semantic Segmentation |
Directional Self-supervised Learning for Heavy Image Augmentations |
CPPF: Towards Robust Category-Level 9D Pose Estimation in the Wild |
Cross-patch Dense Contrastive Learning for Semi-supervised Segmentation of Cellular Nuclei in Histopathologic Images |
Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition |
UCC: Uncertainty guided Cross-head Co-training for Semi-Supervised Semantic Segmentation |
Few-Shot Object Detection with Fully Cross-Transformer |
Exploiting Temporal Relations on Radar Perception for Autonomous Driving |
Unsupervised Visual Representation Learning by Online Constrained K-Means |
Contextual Debiasing for Visual Recognition with Causal Mechanisms |
Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes |
Towards Accurate Facial Landmark Detection via Cascaded Transformers |
DIP: Deep Inverse Patchmatch for High-Resolution Optical Flow |
Critical Regularizations for Neural Surface Reconstruction in the Wild |
Per-Clip Video Object Segmentation |
CAFE: Learning to Condense Dataset by Aligning Features |
ArtiBoost: Boosting Articulated 3D Hand-Object Pose Estimation via Online Exploration and Synthesis |
SphereSR: 360° Image Super-Resolution with Arbitrary Projection via Continuous Spherical Image Representation |
Learning to Restore 3D Face from In-the-Wild Degraded Images |
BEVT: BERT Pretraining of Video Transformers |
A Hybrid Egocentric Activity Anticipation Framework via Memory-Augmented Recurrent and One-shot Representation Forecasting |
Sparse Fuse Dense: Towards High Quality 3D Detection with Depth Completion |
MSTR: Mutli-Scale Transformer for End-to-End Human-Object Interaction Detection |
Synthetic Aperture Imaging with Events and Frames |
AP-BSN: Self-Supervised Denoising for Real-World Images via Asymmetric PD and Blind-Spot Network |
Dynamic MLP for Fine-Grained Image Classification by Leveraging Geographical and Temporal Information |
Lepard: Learning partial point cloud matching in rigid and deformable scenes |
Neural Compression-Based Feature Learning for Video Restoration |
Learning to Collaborate in Decentralized Learning of Personalized Models |
Rethinking Parsing Branch for Human Densepose Estimation |
Collaborative Transformers for Grounded Situation Recognition |
ISNet: Shape Matters for Infrared Small Target Detection |
Bi-level Doubly Variational Learning for Energy-based Latent Variable Models |
PSMNet: Position-aware Stereo Merging Network for Room Layout Estimation |
Bi-level Alignment for Cross-Domain Crowd Counting |
Unsupervised Homography Estimation with Coplanarity-Aware GAN |
Real-time Object Detection for Streaming Perception |
Neural Window Fully-connected CRFs for Monocular Depth Estimation |
Deep Hyperspectral-Depth Reconstruction Using Single Color-Dot Projection |
Decoupled Multi-task Learning with Cyclical Self-Regulation for Face Parsing |
Shadows can be Dangerous: Stealthy and Effective Physical-world Adversarial Attack by Natural Phenomenon |
Towards Understanding Adversarial Robustness of Optical Flow Networks |
Class-Incremental Learning by Knowledge Distillation with Adaptive Feature Consolidation |
A Continuous Video Generator with the Price, Quality and Perks of StyleGAN2 |
Self-Supervised Learning of Object Parts for Semantic Segmentation |
High-Resolution Image Harmonization via Collaborative Dual Transformations |
Slot-VPS: Object-centric Representation Learning for Video Panoptic Segmentation |
FIFO: Learning Fog-invariant Features for Foggy Scene Segmentation |
Forecasting Characteristic 3D Poses of Human Actions |
Equalized Focal Loss for Dense Long-tailed Object Detection |
Style Neophile: Constantly Seeking Novel Styles for Domain Generalization |
Mining Multi-View Information: A Strong Self-Supervised Framework for Depth-based 3D Hand Pose and Mesh Estimation |
The Devil is in the Labels: Noisy Label Correction for Robust Scene Graph Generation |
Correlation Verification for Image Retrieval |
Exploring Denoised Cross-video Contrast for Weakly-supervised Temporal Action Localization |
UBoCo : Unsupervised Boundary Contrastive Learning for Generic Event Boundary Detection |
Multi-View Mesh Reconstruction with Neural Deferred Shading |
SoftCollage: A Differentiable Probabilistic Tree Generator for Image Collage |
OVE6D: Object Viewpoint Encoding For Depth-based 6D Object Pose Estimation |
Smooth-Swap: A Simple Enhancement for Face-Swapping with Smoothness |
3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection |
Image Disentanglement Autoencoder for Steganography without Embedding |
Gated2Gated: Self-Supervised Depth Estimation from Gated Images |
Interact before Align: Leveraging Cross-Modal Knowledge for Domain Adaptive Action Recognition |
DN-DETR: Accelerate DETR Training by Introducing Query DeNoising |
The Probabilistic Normal Epipolar Constraint for Frame-To-Frame Rotation Optimization under Uncertain Feature Positions |
A Scalable Combinatorial Solver for Elastic Geometrically Consistent 3D Shape Matching |
Enhancing Classifier Conservativeness and Robustness by Polynomiality |
Raw High-Definition Radar for Multi-Task Learning |
Self-Supervised Image Representation Learning with Geometric Set Consistency |
Multi-View Transformer for 3D Visual Grounding |
Semiconductor Defect Detection by Hybrid Classical-Quantum Deep Learning |
Attention Reveals Occlusions |
Revisiting Domain Generalized Stereo Matching Networks from a Feature Consistency Perspective |
Chi-transformer: Towards Reliable Stereo From Cues |
NinjaDesc: Content-Concealing Visual Descriptors via Adversarial Learning |
SwapMix: Diagnosing and Regularizing the Over-reliance on Visual Context in Visual Question Answering |
Learning Part Segmentation through Unsupervised Domain Adaptation from Synthetic Vehicles |
CellTypeGraph: A New Geometric Computer Vision Benchmark |
Siamese Contrastive Embedding Network for Compositional Zero-Shot Learning |
Reference-based Video Super-Resolution Using Multi-Camera Video Triplets |
End-to-End Semi-Supervised Learning for Video Action Detection |
Parameter-free Online Test-time Adaptation |
3D Shape Variational Autoencoder Latent Disentanglement via Mini-Batch Feature Swapping for Bodies and Faces |
Dual-Key Multimodal Backdoors for Visual Question Answering |
Can Neural Nets Learn the Same Model Twice? Investigating Reproducibility and Double Descent from the Decision Boundary Perspective |
RePaint: Inpainting using Denoising Diffusion Probabilistic Models |
Improving GAN Equilibrium by Raising Spatial Awareness |
Beyond Supervised vs. Unsupervised: Representative Benchmarking and Analysis of Image Representation Learning |
A variational Bayesian method for similarity learning in non-rigid image registration |
Task2Sim: Towards Effective Pre-training and Transfer from Synthetic Data |
Adaptive Trajectory Prediction via Transferable GNN |
Learning to Learn across Diverse Data Biases in Deep Face Recognition |
RIDDLE: Lidar Data Compression with Range Image Deep Delta Encoding |
Total Variation Optimization Layers for Computer Vision |
Transforming Model Prediction for Tracking |
Human Mesh Recovery from Multiple Shots |
FastDOG: Fast Discrete Optimization on GPU |
Estimating Example Difficulty using Variance of Gradients |
Closing the Generalization Gap of Cross-silo Federated Medical Image Segmentation |
Scale-Equivalent Distillation for Semi-Supervised Object Detection |
Long-term Visual Map Sparsification with Heterogeneous GNN |
ResSFL: A Resistance Transfer Framework for Defending Model Inversion Attack in Split Federated Learning |
Fast Point Transformer |
Sketch3T: Test-time Training for Zero-Shot SBIR |
Generative Flows with Invertible Attentions |
ABO: Dataset and Benchmarks for Real-World 3D Object Understanding |
A Dual Weighting Label Assignment Scheme for Object Detection |
ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts |
Explore the Spatio-temporal Aggregation for Insubstantial Object Detection:Benchmark Dataset and Baseline |
A Deeper Dive Into What Deep Spatiotemporal Networks Encode: Quantifying Static vs. Dynamic Information |
DGECN: A Depth-Guided Edge Convolutional Network For End-to-End 6D Pose Estimation |
BNUDC: A Two-Branched Deep Neural Network for Restoring Images from Under-Display Cameras |
Towards Fewer Annotations: Active Learning via Region Impurity and Prediction Uncertainty for Domain Adaptive Semantic Segmentation |
Hallucinated Neural Radiance Fields in the Wild |
The Devil is in the Margin: Margin-based Label Smoothing for Network Calibration |
Deep Depth from Focus with Differential Focus Volume |
Towards Layer-wise Image Vectorization |
Robust Federated Learning with Noisy and Heterogeneous Clients |
Retrieval-based Spatially Adaptive Normalization for Semantic Image Synthesis |
Dynamic Prototype Convolution Network for Few-Shot Semantic Segmentation |
Video Shadow Detection via Spatio-Temporal Interpolation Consistency Training |
It's All In the Teacher: Zero-Shot Quantization Brought Closer to the Teacher |
VISOLO: Grid-Based Space-Time Aggregation for Efficient Online Video Instance Segmentation |
Rethinking Spatial Invariance of Convolutional Networks for Object Counting |
Self-supervised Correlation Mining Network for Person Image Generation |
ISDNet: Integrating Shallow and Deep Networks for Efficient Ultra-high Resolution Segmentation |
Exploring Effective Data for Surrogate Training Towards Black-box Attack |
Contrastive Learning for Space-Time Correspondence via Self-cycle Consistency |
Accelerating Video Object Segmentation with Compressed Video |
Weakly Paired Associative Learning for Sound and Image Representations via Bimodal Associative Memory |
Incremental Cross-view Mutual Distillation for Self-supervised Medical CT Synthesis |
Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer |
Non-parametric Depth Distribution Modelling based Depth Inference for Multi-view Stereo |
LISA: Learning Implicit Shape and Appearance of Hands |
GIQE: Generic Image Quality Enhancement via N$^{th}$ Order Iterative Degradation |
Continual Learning for Visual Search with Backward Consistent Feature Embedding |
STCrowd: A Multimodal Dataset for Pedestrian Perception in Crowded Scenes |
Differentiable Stereopsis: Meshes from multiple views using differentiable rendering |
ST++: Make Self-training Work Better for Semi-supervised Semantic Segmentation |
Arbitrary-Scale Image Synthesis |
CRIS: CLIP-Driven Referring Image Segmentation |
ShapeFormer: Transformer-based Shape Completion via Sparse Representation |
Quantifying Societal Bias Amplification in Image Captioning |
Omni-DETR: Omni-Supervised Object Detection with Transformers |
XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding |
Cross-Architecture Self-supervised Video Representation Learning |
Feature Erasing and Diffusion Network for Occluded Person Re-Identification |
Styleformer: Transformer based Generative Adversarial Networks with Style Vector |
A Re-Balancing Strategy for Class-Imbalanced Classification Based on Instance Difficulty |
360-Attack: Distortion-Aware Perturbations from Perspective-Views |
CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields |
Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos |
Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing |
NICE-SLAM: Neural Implicit Scalable Encoding for SLAM |
FIBA: Frequency-Injection based Backdoor Attack in Medical Image Analysis |
Learning Modal-Invariant and Temporal-Memory for Video-based Visible-Infrared Person Re-Identification |
Continual Predictive Learning from Videos |
BatchFormer: Learning to Explore Sample Relationships for Robust Representation Learning |
Learning to Zoom Inside Camera Imaging Pipeline |
TeachAugment: Data Augmentation Optimization Using Teacher Knowledge |
PhyIR: Physics-based Inverse Rendering for Panoramic Indoor Images |
Finding Good Configurations of Planar Primitives in Unorganized Point Clouds |
Towards Better Understanding Attribution Methods |
B-cos Networks: Alignment is All We Need for Interpretability |
TO-FLOW: Efficient Continuous Normalizing Flows with Temporal Optimization adjoint with Moving Speed |
Learning Invisible Markers for Hidden Codes in Offline-to-online Photography |
Learning Distinctive Margin toward Active Domain Adaptation |
Adiabatic Quantum Computing for Multi Object Tracking |
Learnable Lookup Table for Neural Network Quantization |
Artistic Style Discovery With Independent Components |
Occlusion-Aware Cost Constructor for Light Field Depth Estimation |
Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning |
Which Model to Transfer? Finding the Needle in the Growing Haystack |
Using 3D Topological Connectivity for Ghost Particle Reduction in Flow Reconstruction |
Neural Points: Point Cloud Representation with Neural Fields |
C$^2$AM: Contrastive learning of Class-agnostic Activation Map for Weakly Supervised Object Localization and Semantic Segmentation |
RCP: Recurrent Closest Point for Point Cloud |
Label, Verify, Correct: A Simple Few-Shot Object Detection Method |
Weakly-supervised Action Transition Learning for Stochastic Human Motion Prediction |
Dual-Generator Face Reenactment |
BoostMIS: Boosting Medical Image Semi-supervised Learning with Adaptive Pseudo Labeling and Informative Active Annotation |
InfoNeRF: Ray Entropy Minimization for Few-Shot Neural Volume Rendering |
Balanced Contrastive Learning for Long-Tailed Visual Recognition |
The Devil is in the Pose: Ambiguity-free 3D Rotation-invariant Learning via Pose-aware Convolution |
Partially Does It: Towards Scene-Level FG-SBIR with Partial Input |
Source-Free Object Detection by Learning to Overlook Domain Style |
Region-Aware Face Swapping |
COOPERNAUT: End-to-End Driving with Cooperative Perceptionfor Networked Vehicles |
NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks |
SkinningNet: Two-Stream Graph Convolutional Neural Network for Skinning Prediction of Synthetic Characters |
Efficient Large-scale Localization by Global Instance Recognition |
All-photon Polarimetric Time-of-Flight Imaging |
Parametric Scattering Networks |
MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering |
Coarse-to-Fine Feature Mining for Video Semantic Segmentation |
Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation |
Robust Egocentric Photo-realistic Facial Expression Transfer for Virtual Reality |
Rethinking Visual Geo-localization for Large-Scale Applications |
Polymorphic-GAN: Generating Aligned Samples across Multiple Domains with Learned Morph Maps |
Balanced and Hierarchical Relation Learning for One-shot Object Detection |
High-Fidelity GAN Inversion for Image Attribute Editing |
Killing Two Birds with One Stone: Efficient and Robust Training of Face Recognition CNNs by Partial FC |
I M Avatar: Implicit Morphable Head Avatars from Videos |
Proactive Image Manipulation Detection |
Text Spotting Transformers |
Learning a Structured Latent Space for Unsupervised Point Cloud Completion |
PCA-Based Knowledge Distillation Towards Lightweight and Content-Style Balanced Photorealistic Style Transfer Models |
Grounding Answers for Visual Questions Asked by Visually Impaired People |
Efficient Classification of Very Large Images with Tiny Objects |
Leveraging Adversarial Examples to Quantify Membership Information Leakage |
Towards Practical Deployment-Stage Backdoor Attack on Deep Neural Networks |
When to Prune? A Policy towards Early Structural Pruning |
Robust Optimization as Data Augmentation for Large-scale Graphs |
Sylph: A Hypernetwork Framework for Incremental Few-shot Object Detection |
Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis |
Harmony: A Generic Unsupervised Approach for Disentangling Semantic Content from Parameterized Transformations |
The Implicit Values of A Good Hand Shake: Handheld Multi-Frame Neural Depth Refinement |
Noise2NoiseFlow: Realistic Camera Noise Modeling without Clean Images |
MetaPose: Fast 3D Pose from Multiple Views without 3D Supervision |
Virtual Elastic Objects |
StyleSDF: High-Resolution 3D-Consistent Image and Geometry Generation |
Rethinking Architecture Design for Tackling Data Heterogeneity in Federated Learning |
Self-supervised Neural Articulated Shape and Appearance Models |
A Self-Supervised Descriptor for Image Copy Detection |
Rethinking Deep Face Restoration |
Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes |
Rethinking Controllable Variational Autoencoders |
Convolutions for Spatial Interaction Modeling |
Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization |
AdaFace: Quality Adaptive Margin for Face Recognition |
Towards End-to-End Unified Scene Text Detection and Layout Analysis |
Active Learning by Feature Mixing |
Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs |
Towards Better Plasticity-Stability Trade-off in Incremental Learning: A Simple Linear Connector |
Cloth-Changing Person Re-identification from A Single Image with Gait Prediction and Regularization |
SpaceEdit: Learning a Unified Editing Space for Open-Domain Image Editing |
Learning to Answer Questions in Dynamic Audio-Visual Scenarios |
Non-generative Generalized Zero-shot Learning via Task-correlated Disentanglement and Controllable Samples Synthesis |
Knowledge-Driven Self-Supervised Representation Learning for Facial Action Unit Recognition |
Coupling Vision and Proprioception for Navigation of Legged Robots |
URetinex-Net: Retinex-based Deep Unfolding Network for Low-light Image Enhancement |
Modeling Image Composition for Complex Scene Generation |
Think Twice Before Detecting GAN-generated Fake Images from their Spectral Domain Imprints |
Undoing the Damage of Label Shift for Cross-domain Semantic Segmentation |
Implicit Motion Handling for Video Camouflaged Object Detection |
Contrastive Conditional Neural Processes |
Exploring Set Similarity for Dense Self-supervised Representation Learning |
E2V-SDE: From Asynchronous Events to Fast and Continuous Video Reconstruction via Neural Stochastic Differential Equations |
Catching Both Gray and Black Swans: Open-set Supervised Anomaly Detection |
M5Product: Self-harmonized Contrastive Learning for E-commercial Multi-modal Pretraining |
CycleMix: A Holistic Strategy for Medical Image Segmentation from Scribble Supervision |
Mixed Multimodal Tokens for Vision Transformers |
Rethinking the Augmentation Module in Contrastive Learning: Learning Hierarchical Augmentation Invariance with Expanded Views |
AirObject: A Temporally Evolving Graph Embedding for Object Identification |
Balanced Multimodal Learning via On-the-fly Gradient Modulation |
Ray3D: ray-based 3D human pose estimation for monocular absolute 3D localization |
Computing Wasserstein-$p$ Distance Between Images with Linear Cost |
Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape Estimation from Monocular Video |
Feature Statistics Mixing Regularization for Generative Adversarial Networks |
Expressive Talking Head Generation with Granular Audio-Visual Control |
Geometric Anchor Correspondence Mining with Uncertainty Modelling for Universal Domain Adaptation |
OSSO: Obtaining Skeletal Shape from Outside |
How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs |
GIRAFFE HD: A High-Resolution 3D-aware Generative Model |
Continual Object Detection via Prototypical Task Correlation Guided Gating Mechanism |
Pixel screening based intermediate correction for blind deblurring |
LAS-AT: Adversarial Training with Learnable Attack Strategy |
Eigenlanes: Data-Driven Lane Descriptors for Structurally Diverse Lanes |
Moving Window Regression: A Novel Approach to Ordinal Regression |
SC^2-PCR: A Second Order Spatial Compatibility for Efficient and Robust Point Cloud Registration |
APRIL: Finding the Achilles' Heel on Privacy Leakage for Vision Transformers |
Eigencontours: Novel Contour Descriptors Based on Low-Rank Approximation |
Cross-modal Background Suppression for Audio-Visual Event Localization |
WebQA: Multihop and Multimodal QA |
Fairness-aware Adversarial Perturbation Towards Bias Mitigation for Deployed Deep Models |
Distribution-Aware Single-Stage Models for Multi-Person 3D Pose Estimation |
Active Learning for Open-set Annotation |
E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation |
Self-Supervised Arbitrary-Scale Point Clouds Upsampling via Implicit Neural Representation |
Relative Pose from a Calibrated and an Uncalibrated Smartphone Image |
Learning Optical Flow with Kernel Patch Attention |
Contrastive Learning for Unsupervised Video Highlight Detection |
ISNAS-DIP: Image-Specific Neural Architecture Search for Deep Image Prior |
MVSE: A Large-Scale Benchmark Dataset for Multi-Modal Videos Similarity Evaluation |
Discrete time convolution for fast event-based stereo |
Proper Reuse of Image Classification Features Improves Object Detection |
Object-Region Video Transformers |
Vision-Language Pre-Training for Boosting Scene Text Detectors |
Bandits for Structure Perturbation-based Black-box Attacks to Graph Neural Networks with Theoretical Guarantees |
Revisiting Large Kernel Design in Convolutional Neural Networks |
Generating High Fidelity Data from Low-density Regions using Diffusion Models |
Colar: Effective and Efficient Online Action Detection by Consulting Exemplars |
Learning Visual-Semantic Explanations of Deep Visual Latent Representations |
StyleMesh: Style Transfer for Indoor 3D Scene Reconstructions |
Probing Representation Forgetting in Supervised and Unsupervised Continual Learning |
Light Field Neural Rendering |
ROCA: Robust CAD Model Retrieval and Alignment from a Single Image |
Pix2NeRF: Unsupervised Conditional pi-GAN for Single Image to Neural Radiance Fields Translation |
Non-Iterative Recovery from Nonlinear Observations using Generative Models |
Forecasting from LiDAR via Future Object Detection |
Towards Total Recall in Industrial Anomaly Detection |
Low-Resource Adaptation for Personalized Co-Speech Gesture Generation |
Integrating Language Guidance into Vision-based Deep Metric Learning |
Non-isotropy Regularization for Proxy-based Deep Metric Learning |
Estimating Egocentric 3D Human Pose in the Wild with External Weak Supervision |
Less is More: Generating Grounded Navigation Instructions from Landmarks |
Automatic Synthesis of Diverse Weak Supervision Sources for Behavior Analysis |
Performance-Aware Mutual Knowledge Distillation for Improving Neural Architecture Search |
End-to-End Reconstruction-Classification Learning for Face Forgery Detection |
UKPGAN: A General Self-Supervised Keypoint Detector |
C2SLR: Consistency-enhanced Continuous Sign Language Recognition |
Boosting Black-Box Attack with Partially Transferred Conditional Adversarial Distribution |
Style Transformer for Image Inversion and Editing |
Uformer: A General U-Shaped Transformer for Image Restoration |
Speech Driven Tongue Animation |
DO-GAN: A Double Oracle Framework for Generative Adversarial Networks |
IntentVizor: Towards Generic Query Guided Interactive Video Summarization |
Self-supervised Deep Image Restoration via Adaptive Stochastic Gradient Langevin Dynamics |
Sound-Guided Semantic Image Manipulation |
Adaptive Gating for Single-Photon 3D Imaging |
Target-aware Dual Adversarial Learning and a Multi-scenario Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection |
GaTector: A Unified Framework for Gaze Object Prediction |
Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation |
Anomaly Detection via Reverse Distillation from One-Class Embedding |
Dynamic 3D Gaze from Afar: Deep Gaze Estimation from Temporal Eye-Head-Body Coordination |
Maximum Consensus by Weighted Influences of Monotone Boolean Functions |
Beyond Fixation: Dynamic Window Visual Transformer |
Dressing in the Wild by Watching Dance Videos |
Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers |
Contrastive Boundary Learning for Point Cloud Segmentation |
Proto2Proto: Can you recognize the car, the way I do? |
Bridged Transformer for Vision and Point Cloud 3D Object Detection |
V2C: Visual Voice Cloning |
An Efficient Training Approach for Very Large Scale Face Recognition |
SemanticStyleGAN: Learning Compositional Generative Priors for Controllable Image Synthesis and Editing |
SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition |
Task Discrepancy Maximization for Fine-grained Few-Shot Classification |
Reflection and Rotation Symmetry Detection via Equivariant Learning |
Self-Supervised Equivariant Learning for Oriented Keypoint Detection |
Improving the Transferability of Targeted Adversarial Examples through Object-Based Diverse Input |
3DeformRS: Certifying Spatial Deformations on Point Clouds |
DiGS : Divergence guided shape implicit neural representation for unoriented point clouds |
UNICON: Combating Label Noise Through Uniform Selection and Contrastive Learning |
Vision Transformer with Deformable Attention |
Diverse Plausible 360-Degree Image Outpainting for Efficient 3DCG Background Creation |
Industrial Style Transfer with Large-scale Geometric Warping and Content Preservation |
Hierarchical Modular Network for Video Captioning |
Optimal LED Spectral Multiplexing for NIR2RGB Translation |
Exploring Frequency Adversarial Attacks for Face Forgery Detection |
LAR-SR: A Local Autoregressive Model for Image Super Resolution |
What do navigation agents learn about their environment? |
HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation |
Entropy-based Active Learning for Object Detection with Progressive Diversity Constraint |
Class Similarity Weighted Knowledge Distillation for Continual Semantic Segmentation |
Swin Transformer V2: Scaling Up Capacity and Resolution |
Knowledge Distillation via the Target-aware Transformer |
Sparse Object-level Supervision for Instance Segmentation with Pixel Embeddings |
Open-Domain, Content-based, Multi-modal Fact-checking of Out-of-Context Images via Online Resources |
Exemplar-based Pattern Synthesis with Implicit Periodic Field Network |
RigidFlow: Self-Supervised Scene Flow Learning on Point Clouds by Local Rigidity Prior |
Weakly Supervised Segmentation on Outdoor 4D point clouds with Temporal Matching and Spatial Graph Propagation |
E^2(GO)MOTION: Motion Augmented Event Stream for Egocentric Action Recognition |
Ego4D: Around the World in 3,000 Hours of Egocentric Video |
Spiking Transformers for Event-based Single Object Tracking |
Few-Shot Incremental Learning for Label-to-Image Translation |
CD^2-pFed: Cyclic Distillation-guided Channel Decoupling for Model Personalization in Federated Learning |
OoD-Bench: Quantifying and Understanding Two Dimensions of Out-of-Distribution Generalization |
Speed up Object Detection on Gigapixel-level Image with Patch Arrangement |
Learning Adaptive Warping for Real-World Rolling Shutter Correction |
Robust and Accurate Superquadric Recovery: a Probabilistic Approach |
SimVP: Simpler yet Better Video Prediction |
Hyperspherical Consistency Regularization |
Dense Depth Priors for Neural Radiance Fields from Sparse Input Views |
HyperInverter: Improving StyleGAN Inversion via Hypernetwork |
Target-Relevant Knowledge Preservation for Multi-Source Domain Adaptive Object Detection |
Whose Hands are These? Hand Detection and Hand-Body Association in the Wild |
Blind Face Restoration via Integrating Face Shape and Generative Priors |
Multimodal Material Segmentation |
Do explanation methods explain? Model knows best |
Deep Hybrid Models for Out-of-Distribution Detection |
Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetics |
Detecting Camouflaged Object in Frequency Domain |
Exploring Structure-aware Transformer over Interaction Proposals for Human-Object Interaction Detection |
Appearance and Structure Aware Robust Deep Visual Graph Matching: Attack, Defense and Beyond |
PhoCaL: A Multi-Modal Dataset for Category-Level Object Pose Estimation with Photometrically Challenging Objects |
HINT: Hierarchical Neuron Concept Explainer |
Vox2Cortex: Fast Explicit Reconstruction of Cortical Surfaces from 3D MRI Scans with Geometric Deep Neural Networks |
Generative Cooperative Learning for Unsupervised Video Anomaly Detection |
Panoptic, Instance and Semantic Relations: A Relational Context Encoder to Enhance Panoptic Segmentation |
Object-Relation Reasoning Graph for Action Recognition |
Lifelong Graph Learning |
A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation |
Arch-Graph: Acyclic Architecture Relation Predictor for Task-Transferable Neural Architecture Search |
Rethinking Minimal Sufficient Representation in Contrastive Learning |
Physical Simulation Layer for Accurate 3D Modeling |
Image Animation with Perturbed Masks |
Sparse to Dense Dynamic 3D Facial Expression Generation |
AIM: an Auto-Augmenter for Images and Meshes |
PlanarRecon: Real-time 3D Plane Detection and Reconstruction from Posed Monocular Videos |
Modular Action Concept Grounding in Semantic Video Prediction |
Generating Representative Samples for Few-Shot Classification |
SurfEmb: Dense and Continuous Correspondence Distributions for Object Pose Estimation with Learnt Surface Embeddings |
Sequential Voting with Relational Box Fields for Active Object Detection |
Are Multimodal Transformers Robust to Missing Modality? |
Debiased Learning from Naturally Imbalanced Pseudo-Labels |
Weakly-Supervised Online Action Segmentation in Multi-View Instructional Videos |
Learning to deblur using light field generated and real defocus images |
TOAD: Topologically-Aware Deformation Fields for Single-view 3D Reconstruction |
An Empirical Study of Training End-to-End Vision-and-Language Transformers |
PLAD: Learning to Infer Shape Programs with Pseudo-Labels and Approximate Distributions |
The Neurally-Guided Shape Parser: Grammar-based Labeling of 3D Shape Regions with Approximate Inference |
Imposing Consistency for Optical Flow Estimation |
Generating Diverse 3D Reconstructions from a Single Occluded Face Image |
RecDis-SNN: Rectifying Membrane Potential Distribution for Directly Training Spiking Neural Networks |
3D Moments from Near-Duplicate Photos |
CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation |
MatteFormer: Transformer-Based Image Matting via Prior-Tokens |
Deformable ProtoPNet: An Interpretable Image Classifier Using Deformable Prototypes |
Learning Bayesian Sparse Networks with Full Experience Replay for Continual Learning |
Category-Aware Transformer Network for Better Human-Object Interaction Detection |
Segment, Magnify and Reiterate: Detecting Camouflaged Objects the Hard Way |
UNIST: Unpaired Neural Implicit Shape-to-Shape Translation |
REGTR: End-to-end Point Cloud Correspondences with Transformers |
Show, Deconfound and Tell: Image Captioning with Causal Inference |
DeepFake Disrupter: The Detector of DeepFake Is My Friend |
Lite Vision Transformer with Enhanced Self-Attention |
Bi-directional Object-context Prioritization Learning for Saliency Ranking |
OSKDet: Orientation-sensitive Keypoint Localization for Rotated Object Detection |
Dual Cross-Attention Learning for Fine-Grained Visual Categorization and Object Re-Identification |
Invariant Grounding for Video Question Answering |
Fine-tuning Global Model via Data-Free Knowledge Distillation for Non-IID Federated Learning |
Learning Robust Image-Based Rendering on Sparse Scene Geometry via Depth Completion |
FENeRF: Face Editing in Neural Radiance Fields |
A Probabilistic Graphical Model Based on Neural-symbolic Reasoning for Visual Relationship Detection |
CVNet: Contour Vibration Network for Building Extraction |
What to Look at and Where: Semantic and Spatial Refined Transformer for Detecting Human-Object Interactions |
Nested Hyperbolic Spaces for Dimensionality Reduction and Hyperbolic NN Design |
ABPN: Adaptive Blend Pyramid Network for Real-Time Local Retouching of Ultra High-Resolution Photo |
Does Robustness on ImageNet Transfer to Downstream Tasks? |
Crowd Counting in the Frequency Domain |
SimMIM: A Simple Framework for Masked Image Modeling |
GrainSpace: A Large-scale Dataset for Fine-grained and Domain-adaptive Recognition of Cereal Grains |
End-to-End Trajectory Distribution Prediction Based on Occupancy Grid Maps |
MPViT : Multi-Path Vision Transformer for Dense Prediction |
Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer |
ARCS: Accurate Rotation and Correspondence Search |
Ranking Distance Calibration for Cross-Domain Few-Shot Learning |
MetaFSCIL: A Meta-Learning Approach for Few-Shot Class Incremental Learning |
Fisher Information Guidance for Learned Time-of-Flight Imaging |
Joint Video Summarization and Moment Localization by Cross-Task Sample Transfer |
MotionAug: Augmentation with Physical Correction for Human Motion Prediction |
Deep Color Consistent Network for Low-Light Image Enhancement |
Non-Probability Sampling Network for Stochastic Human Trajectory Prediction |
GCFSR: a Generative and Controllable Face Super Resolution Method Without Facial and GAN Priors |
Improving Adversarial Transferability via Neuron Attribution-Based Attacks |
HiVT: Hierarchical Vector Transformer for Multi-Agent Motion Prediction |
Pooling Revisited: Your Receptive Field is Sub-optimal |
Compressing Models with Few Samples: Mimicking then Replacing |
Shape from Thermal Radiation: Passive Ranging Using Multi-spectral LWIR Measurements |
Layered Depth Refinement with Mask Guidance |
Highly-efficient Incomplete Large-scale Multi-view Clustering with Consensus Bipartite Graph |
Scaling Up Vision-Language Pretraining for Image Captioning |
Optimal Correction Cost for Object Detection Evaluation |
Deformable Video Transformer |
High-fidelity Monocular Human Reconstruction by Combining Implicit and Explicit Representations |
Nonlocal Sparse CRF |
Long-Short Temporal Contrastive Learning of Video Transformers |
QS-Attn: Query-Selected Attention for Contrastive Learning in I2I Translation |
All-In-One Image Restoration for Unknown Corruption |
Learning to Detect Scene Landmarks for Camera Localization |
WildNet: Learning Domain Generalized Semantic Segmentation from the Wild |
Pushing the Envelope of Gradient Boosting Forests via Globally-Optimized Oblique Trees |
Egocentric Scene Understanding via Multimodal Spatial Rectifier |
OSSGAN: Open-Set Semi-Supervised Image Generation |
Large-scale Video Panoptic Segmentation in the Wild: A Benchmark |
Unsupervised Representation Learning for Binary Networks by Joint Classifier Learning |
β-DARTS: Beta-Decay Regularization for Differentiable Architecture Search |
Stereo Depth from Events Cameras: Concentrate and Focus on the Future |
Transferable Sparse Adversarial Attack |
FAM: Visual Explanations for the Feature Representations from Deep Convolutional Networks |
Noise-Aware NeRFs for Burst-Denoising |
Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds |
Bayesian Invariant Risk Minimization |
Extracting Triangular 3D Models, Materials, and Lighting From Images |
RelTransformer: A Transformer-Based Long-Tail Visual Relationship Recognition |
Transformer-empowered Multi-scale Contextual Matching and Aggregation for Multi-contrast MRI Super-resolution |
SphericGAN: Semi-supervised Hyper-spherical Generative Adversarial Networks for Fine-grained Image Synthesis |
LD-ConGR: A Large RGB-D Video Dataset for Long-Distance Continuous Gesture Recognition |
Unifying Panoptic Segmentation for Autonomous Driving |
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning |
Interspace Pruning: Using Adaptive Filter Representations to Improve Training of Sparse CNNs |
NightLab: A Dual-level Architecture with Hardness Detection for Segmentation at Night |
Learning to Memorize Feature Hallucination for One-Shot Image Generation |
FedCorr: Multi-Stage Federated Learning for Label Noise Correction |
GeoNeRF: Generalizing NeRF with Geometry Priors |
Neural 3D Video Synthesis |
TransforMatcher: Match-to-Match Attention for Semantic Correspondence |
Represent, Compare, and Learn: A Similarity-Aware Framework for Class-Agnostic Counting |
AxIoU: An Axiomatically Justified Measure for Video Moment Retrieval |
Deep Safe Multi-view Clustering: Reducing the Risk of Clustering Performance Degradation Caused by View Increase. |
Burst Image Restoration and Enhancement |
Modeling Indirect Illumination for Inverse Rendering |
Knowledge Mining with Scene Text for Fine-Grained Recognition |
FlexIT: Towards Flexible Semantic Image Translation |
Surpassing the Human Accuracy: Detecting Gallbladder Cancer from USG Images with Curriculum Learning |
More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech |
Mimicking the Oracle: An Initial Phase Decorrelation Approach for Class Incremental Learning |
Multi-Person Extreme Motion Prediction |
Does text attract attention on e-commerce images: A novel saliency prediction dataset and method |
Instance-Aware Dynamic Neural Network Quantization |
Energy-based Latent Aligner for Incremental Learning |
Semi-supervised Video Paragraph Grounding with Contrastive Encoder |
Personalized Image Aesthetics Assessment with Rich Attributes |
Attention Concatenation Volume for Accurate and Efficient Stereo Matching |
Split Hierarchal Variational Compression |
MS2DG-Net: Progressive Correspondence Learning via Multi Sparse Semantic Dynamic Graph |
Large Loss Matters in Weakly Supervised Multi-Label Classification |
Recurring the Transformer for Video Action Recognition |
Look Closer to Supervise Better: One-Shot Font Generation via Component-Based Discriminator |
KG-SP: Knowledge Guided Simple Primitives for Open World Compositional Zero-Shot Learning |
Hyperbolic Vision Transformers: Combining Improvements in Metric Learning |
Camera Pose Estimation using Implicit Distortion Models |
A Structured Dictionary Perspective on Implicit Neural Representations |
ST-MFNet: A Spatio-Temporal Multi-Flow Network for Frame Interpolation |
Geometric Structure Preserving Warp for Natural Image Stitching |
Slimmable Domain Adaptation |
Meta Convolutional Neural Networks for Single Domain Generalization |
Label Matching Semi-Supervised Object Detection |
Sparse Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation Learning |
Abandoning the Bayer-Filter to See in the Dark |
Deep Hierarchical Semantic Segmentation |
MixFormer: End-to-End Tracking with Iterative Mixed Attention |
ContIG: Self-supervised Multimodal Contrastive Learning for Medical Imaging with Genetics |
Occlusion-robust Face Alignment using A Viewpoint-invariant Hierarchical Network Architecture |
Segment-Fusion: Hierarchical Context Fusion for Robust 3D Semantic Segmentation |
STRPM: A Spatiotemporal Residual Predictive Model for High-Resolution Video Prediction |
Boosting 3D Object Detection by Simulating Multimodality on Point Clouds |
RADU: Ray-Aligned Depth Update Convolutions for ToF Data Denoising |
Auto-Encoder is All You Need |
Whose Track Is It Anyway? Improving Robustness to Tracking Errors with Affinity-Based Prediction |
Multi-marginal Contrastive Learning for Multi-label Subcellular Protein Localization |
Stand-Alone Inter-Frame Attention in Video Models |
Hyperbolic Image Segmentation |
RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality |
Time3D: End-to-End Joint Monocular 3D Object Detection and Tracking for Autonomous Driving |
SWEM: Towards Real-Time Video Object Segmentation with Sequential Weighted Expectation-Maximization |
ART-Point: Improving Rotation Robustness of Point Cloud Classifiers via Adversarial Rotation |
Super-Fibonacci Spirals: Fast, Low-Discrepancy Sampling of SO(3) |
Learning to Learn and Remember Super Long Multi-Domain Task Sequence |
Noise Is Also Useful: Negative Correlation-Steered Latent Contrastive Learning |
FLOAT: Factorized Learning of Object Attributes for Improved Multi-object Multi-part Scene Parsing |
Surface-Aligned Neural Radiance Fields for Controllable 3D Human Synthesis |
Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model |
Real World Self-Supervised Multi-Image Super-Resolution for Multi-Exposure Push-Frame Satellites |
Knowledge Distillation with the Reused Teacher Classifier |
Geometry-Aware Guided Loss for Deep Crack Recognition |
AdaMixer: A Simple and Accurate Query-based Object Detector |
Learning Structured Gaussians to Approximate Deep Ensembles |
Input-level Inductive Biases for 3D Reconstruction |
BTS: A Bi-lingual Benchmark for Text Segmentation in the Wild |
Stereo Magnification with Multi-Layer Images |
Segment and Complete: Defending Object Detectors against Adversarial Patch Attacks with Robust Patch Detection |
Coherent Point Drift Revisited for Non-rigid Shape Matching and Registration |
Alleviating Semantics Distortion in Unsupervised Low-Level Image-to-Image Translation via Structure Consistency Constraint |
CNN Filter DB: An Empirical Investigation of Trained Convolutional Filters |
Text2Mesh: Text-Driven Neural Stylization for Meshes |
RFNet: Unsupervised Network for Mutually Reinforcing Multi-modal Image Registration and Fusion |
Image Dehazing Transformer with Transmission-Aware 3D Position Embedding |
Label Relation Graphs Enhanced Hierarchical Residual Network for Hierarchical Multi-Granularity Classification |
RGB-Multispectral Matching: Dataset, Learning Methodology, Evaluation |
Maintaining Reasoning Consistency in Compositional Visual Question Answering |
PolyWorld: Polygonal Building Extraction with Graph Neural Networks in Satellite Images |
Fast Algorithm for Low-rank Tensor Completion in Delay-embedded Space |
Dynamic Sparse R-CNN |
Improving Robustness Against Stealthy Weight Bit-Flip Attacks by Output Code Matching |
NPBG++: Accelerating Neural Point-Based Graphics |
Forward Compatible Few-Shot Class-Incremental Learning |
Weakly-supervised Metric Learning with Cross-Module Communications for the Classification of Anterior Chamber Angle Images |
Learning Canonical F-Correlation Projection for Compact Multiview Representation |
Learning Non-target Knowledge for Few-shot Semantic Segmentation |
Towards Low-Cost and Efficient Malaria Detection |
PoseTrack21: A Dataset for Person Search, Multi-Object Tracking and Multi-Person Pose Tracking |
NeuralHDHair: Automatic High-fidelity Hair Modeling from a Single Image Using Implicit Neural Representations |
ClusterGNN: Cluster-based Coarse-to-fine Graph Neural Network for Efficient Feature Matching |
An Iterative Quantum Approach for Transformation Estimation from Point Sets |
ATPFL: Automatic Trajectory Prediction Model Design under Federated Learning Framework |
Understanding and Increasing Efficiency of Frank-Wolfe Adversarial Training |
Targeted Supervised Contrastive Learning for Long-Tailed Recognition |
Optimizing Elimination Templates by Greedy Parameter Search |
M3T: three-dimensional Medical image classifier using Multi-plane and Multi-slice Transformer |
Projective Manifold Gradient Layer for Deep Rotation Regression |
PUMP: Pyramidal and Uniqueness Matching Priors for Unsupervised Learning of Local Descriptors |
Deep orientation-aware functional maps : Tackling symmetry issues in Shape Matching |
A Versatile Multi-View Framework for LiDAR-based 3D Object Detection with Guidance from Panoptic Segmentation |
Lite-MDETR: A Lightweight Multi-Modal Detector |
Cross Modal Retrieval with Querybank Normalisation |
On Learning Contrastive Representations for Learning with Noisy Labels |
Cross-view transformers for real-time map-view semantic segmentation |
Towards Data-Free Model Stealing in a Hard Label Setting |
The DEVIL is in the Details: A Diagnostic Evaluation Benchmark for Video Inpainting |
Unseen Classes at a Later Time? No Problem |
Channel Balancing for Accurate Quantization of Winograd Convolutions |
Instance masks are what you need: Segmentation parity from object boundaries |
TVConv: Efficient Translation Variant Convolution for Layout-aware Visual Processing |
Scanline Homographies for Rolling-Shutter Plane Absolute Pose |
Dual-Shutter Optical Vibration Sensing |
DoubleField: Bridging the Neural Surface and Radiance Fields for High-fidelity Human Reconstruction and Rendering |
Robust Structured Declarative Classifiers for 3D Point Clouds: Defending Adversarial Attacks with Implicit Gradients |
TubeR: Tubelet Transformer for Video Action Detection |
Data-Free Network Compression via Parametric Non-uniform Mixed Precision Quantization |
Contour-Hugging Heatmaps for Landmark Detection |
Local Attention Pyramid for Scene Image Generation |
Implicit Feature Decoupling with Depthwise Quantization |
InsetGAN for Full-Body Image Generation |
Recurrent Variational Network: A Deep Learning Inverse Problem Solver applied to the task of Accelerated MRI Reconstruction |
Robust Invertible Image Steganography |
Disentangling visual and written concepts in CLIP |
Causal CLIP Fine-tuning for Fashion Product Retrieval |
Accelerating Neural Network Optimization Through an Automated Control Theory Lens |
Comprehending and Ordering Semantics for Image Captioning |
Grounded Language-Image Pre-training |
Hierarchical Self-supervised Representation Learning for Movie Understanding |
RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution |
DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition |
Consistency driven Sequential Transformers Attention Model for Partially Observable Scenes |
Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention |
How Well Do Sparse ImageNet Models Transfer? |
Towards Principled Disentanglement for Domain Generalization |
Task-Adaptive Negative Class Envision for Few-Shot Open-Set Recognition |
Path-CNN: Topology-Aware Centerline Segmentation Using Sparse Annotation |
Image Based Reconstruction of Liquids from 2D Surface Detections |
Neural Convolutional Surfaces |
Graph-context Attention Networks for Size-varied Deep Graph Matching |
Learning to Solve Hard Minimal Problems |
Neural Mesh Simplification |
SPAct: Self-supervised Privacy Preservation for Action Recognition |
Towards Language-free Training for Text-to-Image Generation |
Rep-Net: Efficient On-Device Learning via Feature Reprogramming |
3D-VField: Learning to Adversarially Deform Point Clouds for Robust 3D Object Detection |
TrackFormer: Multi-Object Tracking with Transformers |
Deep 3D-to-2D Watermarking: Embedding Messages in 3D Meshes and Extracting Them from 2D Renderings |
A Comprehensive Study of Image Classification Model Sensitivity to Foregrounds, Backgrounds, and Visual Attributes |
EnvEdit: Environment Editing for Vision-and-Language Navigation |
DeepFace-EMD: Re-ranking using Patch-wise Earth Mover's Distance Improves Out-of-Distribution Face Identification |
Mega-NERF: Scalable Construction of Large-Scale NeRFs for Virtual Fly-Throughs |
MulT: An End-to-End Multitask Learning Transformer |
Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields |
Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection |
Use All The Labels: A Hierarchical Multi-Label Contrastive Learning Framework |
Plenoxels: Radiance Fields without Neural Networks |
Pushing the Limits of Simple Pipelines for Practical Few-Shot Learning |
PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning |
CO-SNE: Dimensionality Reduction and Visualization for Hyperbolic Data |
EASE: Unsupervised Discriminant Subspace Learning for Transductive Few-Shot Learning |
3D Photo Stylization: Learning to Generate Stylized Novel Views from a Single Image |
SIMBAR: Single Image-Based Scene Relighting For Effective Data Augmentation For Automated Driving Vision Tasks |
VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks |
VALHALLA: Visual Hallucination for Machine Translation |
Learning Pairwise Affinity for Open-World Instance Segmentation |
CAD: Co-Adapting Discriminative Features for Improved Few-Shot Classification |
Investigating the Impact of Multi-LiDAR Placement on Object Detection for Autonomous Driving |
Hypergraph-Induced Semantic Tuplet Loss for Deep Metric Learning |
Generalized Category Discovery |
Deep Image-based Illumination Harmonization |
Mixed Differential Privacy in Computer Vision |
MUSE-VAE: Multi-Scale VAE for Environment-Aware Long Term Trajectory Prediction |
UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog |
Weakly Supervised Rotation-Invariant Aerial Object Detection Network |
Evaluation-oriented Knowledge Distillation for Deep Face Recognition |
Robust Cross-Modal Representation Learning with Progressive Self-Distillation |
Transformer Tracking with Cyclic Shifting Window Attention |
LTP: Lane-based Trajectory Prediction for Autonomous Driving |
Generating 3D Bio-Printable Patches Using Wound Segmentation and Reconstruction to Treat Diabetic Foot Ulcers |
Multi-instance Point Cloud Registration by Efficient Correspondence Clustering |
AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition |
AutoLoss-GMS: Searching Generalized Margin-based Softmax Loss Function for Person Re-identification |
Convolution of Convolution: Let Kernels Spatially Collaborate |
DiffPoseNet: Direct Differentiable Camera Pose Estimation |
Modeling sRGB Camera Noise with Normalizing Flows |
Semantic-shape Adaptive Feature Modulation for Semantic Image Synthesis |
Federated Learning with Position-Aware Neurons |
Symmetry and Uncertainty-Aware Object SLAM for 6DoF Object Pose Estimation |
Point Density-Aware Voxels for LiDAR 3D Object Detection |
A Conservative Approach for Unbiased Learning on Unknown Biases |
The Majority Can Help the Minority: Context-rich Minority Oversampling for Long-tailed Classification |
Symmetry-aware Neural Architecture for Embodied Visual Exploration |
DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers |
Egocentric Prediction of Action Target in 3D |
What makes transfer learning work for medical images: feature reuse & other factors |
Alignment-Uniformity aware Representation Learning for Zero-shot Video Classification |
Unsupervised Learning of De-biased Representation with Pseudo-bias Attribute |
DECORE: Deep Compression with Reinforcement Learning |
RGB-Depth Fusion GAN for Indoor Depth Completion |
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound |
Class-Aware Contrastive Semi-Supervised Learning |
Learning to Prompt for Continual Learning |
DEFEAT: Deep Hidden Feature Backdoor Attacks by Imperceptible Perturbation and Latent Representation Constraints |
Self-Supervised Dense Consistency Regularization for Image-to-Image Translation |
Forward Compatible Training for Large-Scale Embedding Retrieval Systems |
Joint Forecasting of Panoptic Segmentations with Difference Attention |
Revisiting the Transferability of Supervised Pretraining: an MLP Perspective |
Disentangling Visual Embeddings for Attributes and Objects |
SeeThroughNet: Resurrection of Auxiliary Loss by Preserving Class Probability Information |
Neural Reflectance for Shape Recovery with Shadow Handling |
Topology-Preserving Shape Reconstruction and Registration via Neural Diffeomorphic Flow |
XYDeblur: Divide and Conquer for Single Image Deblurring |
ScePT: Scene-consistent, Policy-based Trajectory Predictions for Planning |
Visual Acoustic Matching |
Fair Contrastive Learning for Facial Attribute Classification |
Neural Prior for Trajectory Estimation |
AutoMine: An Unmanned Mine Dataset |
SMARTADAPT: Multi-branch Object Detection Framework for Videos on Mobiles |
Neural Face Identification in a 2D Wireframe Projection of a Manifold Object |
AlignMixup: Improving Representations By Interpolating Aligned Features |
Memory-Augmented Non-Local Attention for Video Super-Resolution |
ESCNet: Gaze Target Detection with the Understanding of 3D Scenes |
AdaptPose: Cross-Dataset Adaptation for 3D Human Pose Estimation by Learnable Motion Generation |
Distinguishing Unseen from Seen for Generalized Zero-shot Learning |
When Does Contrastive Visual Representation Learning Work? |
Privacy-preserving Online AutoML for Domain-Specific Face Detection |
Robust outlier detection by de-biasing VAE likelihoods |
GridShift: A Faster Mode-seeking Algorithm for Image Segmentation and Object Tracking |
Continual Learning with Lifelong Vision Transformer |
M2I: From Factored Marginal Trajectory Prediction to Interactive Prediction |
Stochastic Variance Reduced Ensemble Adversarial Attack for Boosting the Adversarial Transferability |
Representing 3D Shapes with Probabilistic Directed Distance Fields |
Restormer: Efficient Transformer for High-Resolution Image Restoration |
Learning with Twin Noisy Labels for Visible-Infrared Person Re-Identification |
Few-shot Learning with Noisy Labels |
Co-Domain Symmetry for Complex-Valued Deep Learning |
Pyramid Architecture for Multi-Scale Processing in Point Cloud Segmentation |
GCR: Gradient Coreset based Replay Buffer Selection for Continual Learning |
Domain Adaptation on Point Clouds via Geometry-Aware Implicits |
Ranking-Based Siamese Visual Tracking |
Coarse-to-Fine Disentangling Transformer for Human-Object Interaction Detection |
MDAN: Multi-level Dependent Attention Network for Visual Emotion Analysis |
AdaSTE: An Adaptive Straight-Through Estimator to Train Binary Neural Networks |
DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation |
DTA: Physical Camouflage Attacks using Differentiable Transformation Network |
Layer-wised Model Aggregation for Personalized Federated Learning |
Video Swin Transformer |
Online Continual Learning on a Contaminated Data Stream with Blurry Task Boundaries |
General Incremental Learning with Domain-aware Categorical Representations |
Crafting Better Contrastive Views for Siamese Representation Learning |
A Style-aware Discriminator for Controllable Image Translation |
BoosterNet: Improving Domain Generalization of Deep Neural Nets using Culpability-Ranked Features |
A Unified Framework for Implicit Sinkhorn Differentiation |
Brain-Supervised Image Editing |
Neural Shape Mating: Self-Supervised Object Assembly with Adversarial Shape Priors |
Multimodal Colored Point Cloud to Image Alignment |
Graph-based Spatial Transformer with Memory Replay for Multi-future Pedestrian Trajectory Prediction |
Multi-Objective Diverse Human Motion Prediction with Knowledge Distillation |
Two Coupled Rejection Metrics Can Tell Adversarial Examples Apart |
Autoregressive Image Generation using Residual Quantization |
SGTR: End-to-end Scene Graph Generation with Transformer |
Protecting Facial Privacy: Generating Adversarial Identity Masks via Style-robust Makeup Transfer |
PPDL: Predicate Probability Distribution based Loss for Unbiased Scene Graph Generation |
Localized Adversarial Domain Generalization |
Patch-level Representation Learning for Self-supervised Vision Transformers |
KNN Local Attention for Image Restoration |
Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation |
PILC: Practical Image Lossless Compression with an End-to-end GPU Oriented Neural Framework |
DAD-3DHeads: A Large-scale Dense, Accurate and Diverse Dataset for 3D Dense Head Alignment from a Single Image |
Is Mapping Necessary for Realistic PointGoal Navigation? |
Cross-Domain Correlation Distillation for Unsupervised Domain Adaptation in Nighttime Semantic Segmentation |
LiT: Zero-Shot Transfer with Locked-image text Tuning |
Scaling Vision Transformers |
Spatial Commonsense Graph for Object Localisation in Partial Scenes |
Trajectory Optimization for Physics-Based Reconstruction of 3d Human Pose from Monocular Video |
3MASSIV: Multilingual, Multimodal and Multi-Aspect dataset of Social Media Short Videos |
Upright-Net: Learning Upright Orientation for 3D Point Cloud |
D*-V2X: A Large-Scale Dataset for Vehicle-Infrastructure Cooperative 3D Object Detection |
Differentiable Dynamics for Articulated 3d Human Motion Reconstruction |
Clean Implicit 3D Structure from Noisy 2D STEM Images |
MPC: Multi-view Probabilistic Clustering |
Node-aligned Graph Convolutional Network for Whole-slide Image Representation and Classification |
Multidimensional Belief Quantification for Label-Efficient Meta-Learning |
Bayesian Nonparametric Submodular Video Partition for Robust Anomaly Detection |
Uni6D: A Unified CNN Framework without Projection Breakdown in 6D Pose Estimation |
Exploring Patch-wise Semantic Relation for Contrastive Learning in Image-to-Image Translation Tasks |
Enabling Equivariance for Arbitrary Lie Groups |
Multi-Scale Memory-Based Video Deblurring |
Privacy Preserving Partial Localization |
Towards Robust and Reproducible Active Learning using Neural Networks |
Marginal Contrastive Correspondence for Exemplar-based Image Translation |
TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repeated Action Counting |
Training High-Performance Low-Latency Spiking Neural Networks by Differentiation on Spike Representation |
FaceFormer: Speech-Driven 3D Facial Animation with Transformers |
LARGE: Latent-Based Regression Through GAN Semantics |
TransVPR: Transformer-Based Place Recognition with Multi-Level Attention Aggregation |
AR-NeRF: Unsupervised Learning of Depth and Defocus Effects from Natural Images with Aperture Rendering Neural Radiance Fields |
CAT-Det: Contrastively Augmented Transformer for Multi-modal 3D Object Detection |
SASIC: Stereo Image Compression with Latent Shifts and Stereo Attention |
Controllable Animation of Fluid Elements in Still Images |
Revisiting BatchNorm's Learnable Affines in Few-Shot Transfer Learning |
Learning Graph Regularisation for Guided Super-Resolution |
Topology Preserving Local Road Network Estimation from Single Onboard Camera Image |
Video-Text Representation Learning via Differentiable Weak Temporal Alignment |
BppAttack: Stealthy and Efficient Trojan Attacks against Deep Neural Networks via Image Quantization and Contrastive Adversarial Learning |
Face2Exp: Combating Data Biases for Facial Expression Recognition |
Leveraging Equivariant Features for Absolute Pose Regression |
Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut |
Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry |
ZZ-Net: A Universal Rotation Equivariant Architecture for 2D Point Clouds |
Interactive Disentanglement: Learning Concepts by Interacting with their Prototype Representations |
Incremental Learning in Semantic Segmentation from Image Labels |
Complex Backdoor Detection by Symmetric Feature Differencing |
Constrained Few-shot Class-incremental Learning |
HyperSegNAS: Bridging One-Shot Neural Architecture Search with 3D Medical Image Segmentation using HyperNet |
Amodal Panoptic Segmentation |
Not Just Selection, but Exploration: Online Class-Incremental Continual Learning via Dual View Consistency |
Coarse-to-Fine Q-attention: Efficient Learning for Visual Robotic Manipulation via Discretisation |
Learning ABCs: Approximate Bijective Correspondence for isolating factors of variation |
Pin the Memory: Learning to Generalize Semantic Segmentation |
Long-tailed Visual Recognition via Gaussian Clouded Logit Adjustment |
Knowledge distillation: A good teacher is patient and consistent |
Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language |
Searching the Deployable Convolution Neural Networks for GPUs |
MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing |
Condensing CNNs with Partial Differential Equations |
Adaptive Early-Learning Correction for Segmentation from Noisy Annotations |
Bounded Adversarial Attack on Deep Content Features |
Towards Driving-Oriented Metric for Lane Detection Models |
Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness |
Better Trigger Inversion Optimization in Backdoor Scanning |
Leveling Down in Computer Vision: Pareto Inefficiencies in Fair Deep Classifiers |
Towards Understanding and Simplifying MoCo: Dual Temperature Helps Contrastive Learning without Many Negative Samples |
Smooth Maximum Unit: Smooth Activation Function for Deep Networks using Smoothing Maximum Technique |
Text-to-Image Synthesis based on Object-Guided Joint-Decoding Transformer |
Image Segmentation Using Text and Image Prompts |
Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose Estimation |
Vision-Language Pre-Training with Triple Contrastive Learning |
Temporal Context Matters: Enhancing Single Image Prediction with Disease Progression Representations |
Globetrotter: Connecting Languages by Connecting Images |
Single-Stage 3D Geometry-Preserving Depth Estimation Model Training on Dataset Mixtures with Uncalibrated Stereo Data |
It’s Time for Artistic Correspondence in Music and Video |
Equivariant Point Set Analysis via Learning Orientations for Message Passing |
KeyTr: Keypoint Transporter for 3D Reconstruction of Deformable Objects in Videos |
P3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision |
GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes Prediction |
MatchFAME: Fast, Accurate and Memory-Efficient Multi-Object Matching |
Neural Emotion Director: Speech-preserving semantic control of facial expressions in “in-the-wild” videos |
Id-Free Person Similarity Learning |
Alleviating Emotional bias in Affective Image Captioning by Contrastive Data Collection |
A study on the distribution of social biases in self-supervised learning visual models |
Motron: Multimodal Probabilistic Human Motion Forecasting |
Gaussian Process Modeling of Approximate Inference Errors for Variational Autoencoders |
Real-time hyperspectral imaging in hardware via trained metasurface encoders |
SmartPortraits: Depth Powered Handheld Smartphone Dataset of Human Portraits for State Estimation, Reconstruction and Synthesis |
Improving Segmentation of the Inferior Alveolar Nerve through Deep Label Propagation |
SLIC: Self-Supervised Learning with Iterative Clustering for Human Action Videos |
Self-supervised Spatial Reasoning on Multi-View Line Drawings |
Contrastive Test-Time Adaptation |
Why Discard if You can Recycle?:A Recycling Max Pooling Module for 3D Point Cloud Analysis |
Do learned representations respect causal relationships? |
Zero-Query Transfer Attacks on Context-Aware Object Detectors |
Training Quantised Neural Networks with STE Variants: the Additive Noise Annealing Algorithm |
Contrastive Dual Gating: Learning Sparse Features With Contrastive Learning |
Efficient Maximal Coding Rate Reduction by Variational Forms |
Everything at Once - Multi-modal Fusion Transformer for Video Retrieval |
Towards Efficient and Scalable Sharpness-Aware Minimization |
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval |
Merry Go Round: Rotate a Frame and Fool a DNN |
Label-Only Model Inversion Attacks via Boundary Repulsion |
Style-Structure Disentangled Features and Normalizing Flows for Diverse Icon Colorization |
How Much More Data Do I Need? Estimating Requirements For Downstream Tasks |
A sampling-based approach for efficient clustering in large datasets |
Deep Equilibrium Optical Flow Estimation |
Polarity Sampling: Quality and Diversity Control of Pre-Trained Generative Networks via Singular Values |
Multi-label Iterated Learning for Image Classification with Label Ambiguity |
Cross-modal Map Learning for Vision and Language Navigation |
Learning with Neighbor Consistency for Noisy Labels |
Measuring Compositional Consistency for Video Question Answering |
Failure Modes of Domain Generalization Algorithms |
AutoRF: Learning 3D Object Radiance Fields from Single View Observations |
A Unified Model for Line Projections in Catadioptric Cameras |
OrphicX: A Causality-Inspired Latent Variable Model for Interpreting Graph Neural Networks |
Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning |
Cluster-guided Image Synthesis with Unconditional Models |
Self-supervised object detection from audio-visual correspondence |
Clipped Hyperbolic Classifiers Are Super-Hyperbolic Classifiers |
Local Learning Matters: Rethinking Data Heterogeneity in Federated Learning |
Weakly-Supervised Generation and Grounding of Visual Descriptions with Conditional Generative Models |
How much does input data type impact final face model accuracy? |
Certified Patch Robustness via Smoothed Vision Transformers |
PubTables-1M: Towards comprehensive table extraction from unstructured documents |
Fine-tuning Image Transformers using Learnable Memory |
GuideFormer: Transformers for Image Guided Depth Completion |
Motion-Adjustable Neural Implicit Video Representation |
LiDARCap: Long-range Marker-less 3D Human Motion Capture with LiDAR Point Clouds |
Multi-modal Alignment using Representation Codebook |
NOC-REK: Novel Object Captioning with Retrieved Vocabulary from External Knowledge |
Investigating Top-$k$ White-Box and Transferable Black-box Attack |
GPU-Based Homotopy Continuation for Minimal Problems in Computer Vision |
On the Instability of Relative Pose Estimation and RANSAC’s Role |
Dual Task Learning by Leveraging Both Dense Correspondence and Mis-Correspondence for Robust Change Detection With Imperfect Matches |
M3L: Language-based Video Editing via Multi-Modal Multi-Level Transformers |
Dynamic Scene Graph Generation via Anticipatory Pre-training |
ScanQA: 3D Question Answering for Spatial Scene Understanding |
PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures |
Large Images as Long Documents: Hierarchical ViTs with Self-Supervised Pretraining in Gigapixel Image Pyramids |
Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection |
On Guiding Visual Attention with Language Specification |
OnePose: One-Shot Object Pose Estimation without CAD Models |
Thin-Plate Spline Motion Model for Image Animation |
PokeBNN: A Binary Pursuit of Lightweight Accuracy |
Semi-Supervised Few-shot Learning via Multi-Factor Clustering |
FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback |
CLIPstyler: Image Style Transfer with a Single Text Condition |
Ithaca365: Dataset and Driving Perception under Repeated and Challenging Weather Conditions |
Out-of-distribution Generalization with Causal Invariant Transformations |
Zero-Shot Text-Guided Object Generation with Dream Fields |
Noise Distribution Adaptive Self-Supervised Image Denoising using Tweedie Distribution and Score Matching |
TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization |
NICGSlowDown: Evaluating the Efficiency Robustness of Neural Image Caption Generation Models |
Deep Unlearning via Randomized Conditionally Independent Hessians |
Multi-Modal Dynamic Graph Transformer for Visual Grounding |
Propagation Regularizer for Semi-supervised Learning with Extremely Scarce Labeled Samples |
Discrete Wasserstein Distributional Matching for Quantization in Image Hashing |
Robust fine-tuning of zero-shot models |
Probabilistic Representations for Video Contrastive Learning |
Come-Closer-Diffuse-Faster: Accelerating Conditional Diffusion Models for Inverse Problems through Stochastic Contraction |
Fine-Grained Object Classification via Self-Supervised Pose Alignment |
One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones |
A Framework for Learning Ante-hoc Explainable Models via Concepts |
Retrieval Augmented Classification for Long Tail Visual Recognition |
Deep Spectral Methods: A Surprisingly Strong Baseline for Unsupervised Semantic Segmentation and Localization |
Learning Video Representations of Human Motion from Synthetic Data |
Exploiting Pseudo Labels in a Self-Supervised Learning Framework for Improved Monocular Depth Estimation |
Efficient Deep Embedded Subspace Clustering |
Local-Adaptive Face Recognition via Graph-based Meta-Clustering and Regularized Adaptation |
GenDR: A Generalized Differentiable Renderer |
Fingerprinting Deep Neural Networks Globally via Universal Adversarial Perturbations |
Learning Multiple Adverse Weather Removal via Two-stage Knowledge Learning and Multi-contrastive Regularization: Toward a Unified Model |