A curated list of prompt/adapter learning methods for vision-language models (e.g., CLIP).
Use text-based prompts/adapters.
Use image-based prompts/adapters.
Use text- and image-based prompts/adapters.
Base-to-Novel Generalization. (ViT-B/16 CLIP)
Methods | Pub | Base | Novel | HM (main) | Code |
---|---|---|---|---|---|
CLIP | ICML 21 | 69.34 | 74.22 | 71.70 | Link |
CoOp | IJCV 22 | 82.69 | 63.22 | 71.66 | Link |
CoCoOp | CVPR 22 | 80.47 | 71.69 | 75.83 | Link |
ProDA | CVPR 22 | 81.56 | 72.30 | 76.65 | Link |
KgCoOp | CVPR 23 | 80.73 | 73.60 | 77.00 | Link |
RPO | ICCV 23 | 81.13 | 75.00 | 77.78 | Link |
MaPLe | CVPR 23 | 82.28 | 75.14 | 78.55 | Link |
DePT | CVPR 24 | 83.62 | 75.04 | 79.10 | Link |
TCP | CVPR 24 | 84.13 | 75.36 | 79.51 | Link |
MMA | CVPR 24 | 83.20 | 76.80 | 79.87 | Link |
PromptSRC | ICCV 23 | 84.26 | 76.10 | 79.97 | Link |
HPT | AAAI 24 | 84.32 | 76.86 | 80.23 | Link |
CoPrompt | ICLR 24 | 84.00 | 77.23 | 80.48 | Link |
CasPL | ECCV 24 | 86.11 | 79.54 | 82.69 | Link |
PromptKD | CVPR 24 | 86.96 | 80.73 | 83.73 | Link |
Table 1. Average results on 11 datasets. (Only works with open-source code will be listed.)
CoOp
Learning to Prompt for Vision-Language Models. IJCV 2022.CoCoOp
Conditional Prompt Learning for Vision-Language Models. CVPR 2022.ProDA
Prompt Distribution Learning. CVPR 2022.VPT
Visual Prompt Tuning. ECCV 2022.VP
Exploring Visual Prompts for Adapting Large-Scale Models. Arxiv 2022.MaPLe
MaPLe: Multi-modal Prompt Learning. CVPR 2023.KgCoOp
Visual-Language Prompt Tuningx with Knowledge-guided Context Optimization. CVPR 2023.LASP
LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models. CVPR 2023.DAM-VP
Diversity-Aware Meta Visual Prompting. CVPR 2023.TaskRes
Task Residual for Tuning Vision-Language Models. CVPR 2023.RPO
Read-only Prompt Optimization for Vision-Language Few-shot Learning. ICCV 2023.KAPT
Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models. ICCV 2023.CuPL
What does a platypus look like? Generating customized prompts for zero-shot image classification. ICCV 2023.ProGrad
Prompt-aligned Gradient for Prompt Tuning. ICCV 2023.PromptSRC
Self-regulating Prompts: Foundational Model Adaptation without Forgetting. ICCV 2023.DeFo
Learning to Decompose Visual Features with Latent Textual Prompts. ICLR 2023.PLOT
PLOT: Prompt Learning with Optimal Transport for Vision-Language Models. ICLR 2023.POMP
Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition. NeurIPS 2023.MetaPrompt
Learning Domain Invariant Prompt for Vision-Language Models. TIP 2024.ProVP
Progressive Visual Prompt Learning with Contrastive Feature Re-formation. IJCV 2024.SA2VP
SA2VP: Spatially Aligned-and-Adapted Visual Prompt. AAAI 2024.HPT
Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models. AAAI 2024.LaViP
LaViP: Language-Grounded Visual Prompts. AAAI 2024.CoPrompt
Consistency-guided Prompt Learning for Vision-Language Models. ICLR 2024.ProText
Learning to Prompt with Text Only Supervision for Vision-Language Models. arxiv 24.PromptKD
PromptKD: Unsupervised Prompt Distillation for Vision Language Models. CVPR 2024.DePT
DePT: Decoupled Prompt Tuning. CVPR 2024.ArGue
ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models. CVPR 2024.TCP
TCP: Textual-based Class-aware Prompt tuning for Visual-Language Model. CVPR 2024.MMA
MMA: Multi-Modal Adapter for Vision-Language Models. CVPR 2024.KDPL
Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation. ECCV 2024.CoCoLe
Conceptual Codebook Learning for Vision-Language Models. ECCV 2024.CasPL
Cascade Prompt Learning for Vision-Language Model Adaptation ECCV 2024.AWT
AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation. NeurIPS 2024.CPT
CPT: Colorful Prompt Tuning for pre-trained vision-language models Arxiv 2021.DetPro
Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model. CVPR 2022.PromptDet
PromptDet: Towards Open-vocabulary Detection using Uncurated Images. ECCV 2022.OVSeg
Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP. CVPR 2023.LoGoPrompt
LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models. ICCV 2023.RedCircle
What does CLIP know about a red circle? Visual prompt engineering for VLMs. ICCV 2023.FGVP
Fine-Grained Visual Prompting. NeurIPS 2023.SoM
Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. Arxiv 2023.Alpha-CLIP
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want. CVPR 2024.ViP-LLaVA
Making Large Multimodal Models Understand Arbitrary Visual Prompts. CVPR 2024.SSC
Segment, Select, Correct: A Framework for Weakly-Supervised Referring Segmentation. ECCV 2024.Methods | Pub | ImageNet | -A | -V2 | -R | -S | Avg. (main) | Code |
---|---|---|---|---|---|---|---|---|
CoOp | IJCV 22 | 71.51 | 49.71 | 64.20 | 75.21 | 47.99 | 59.28 | Link |
CoCoOp | CVPR 22 | 71.02 | 50.63 | 64.07 | 76.18 | 48.75 | 59.91 | Link |
TPT | NeurIPS 22 | 68.98 | 54.77 | 63.45 | 77.06 | 47.94 | 60.81 | Link |
TPT+CoOp | NeurIPS 22 | 73.61 | 57.95 | 66.83 | 77.27 | 49.29 | 62.84 | Link |
PromptAlign | NeurIPS 23 | --- | 59.37 | 65.29 | 79.33 | 59.37 | 63.55 | Link |
TPS+CoOp | Arxiv 24 | 73.73 | 60.49 | 66.84 | 77.44 | 49.08 | 65.52 | Link |
RLCF | ICLR 24 | 73.23 | 65.45 | 69.77 | 83.35 | 54.74 | 68.33 | Link |
RLCF+CoOp | ICLR 24 | 76.05 | 69.74 | 70.62 | 84.51 | 56.49 | 70.34 | Link |
Table 2. Test-time prompt tuning methods on OOD data.
TPT
Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models. NeurIPS 2022.SwapPrompt
SwapPrompt: Test-Time Prompt Adaptation for Vision-Language Models. NeurIPS 2023.PrompAlign
Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization. NeurIPS 2023.TPS
Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models. Arxiv 2024.RLCF
Test-time Adaptation with CLIP reward for zero-shot generalization in Vision-Language Models. ICLR 2024.InTTA
Invariant Test-Time Adaptation for Vision-Language Model Generalization. Arxiv 2024.CLIP-Adapter
CLIP-Adapter: Better Vision-Language Models with Feature Adapters. Arxiv 2021.Tip-Adapter
Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification. ECCV 2022.APE
Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement. ICCV 2023.CaFo
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners. CVPR 2023.Meta-Adapter
Meta-Adapter: An Online Few-shot Learner for Vision-Language Model. NeurIPS 2023.Efficient-Prompt
Prompting visual-language models for efficient video understanding. ECCV 2022.InTTA
Expanding Language-Image Pretrained Models for General Video Recognition. ECCV 2022.RePro
Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video Relation Detection. ICLR 2023.L2P
Learning to Prompt for Continual Learning. CVPR 2022.DualPrompt
DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning. ECCV 2022.EvoPrompt
Evolving Parameterized Prompt Memory for Continual Learning. AAAI 2024.CPrompt
Consistent Prompting for Rehearsal-Free Continual Learning. CVPR 2024.DIKI
Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models. ECCV 2024.