Awesome Image Captioning
A curated list of image captioning and related area. :-)
Contributing
Please feel free to send me pull requests or email (chihung.chan@outlook.com) to add links.
Markdown format:
- [Paper Name](link) - Author 1 et al, `Conference Year`. [[code]](link)
Change Log
- May 25 An up-to-date paper list about vision-and-language pre-training is available here.
Table of Contents
Papers
Survey
2015
CVPR 2015
- Show and Tell: A Neural Image Caption Generator - Vinyals O et al,
CVPR 2015
. [code] [code]
- Deep Visual-Semantic Alignments for Generating Image Descriptions - Karpathy A et al,
CVPR 2015
. [project web] [code]
- Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation - Chen X et al,
CVPR 2015
.
- Long-term Recurrent Convolutional Networks for Visual Recognition and Description - Donahue J et al,
CVPR 2015
. [code] [project web]
ICCV 2015
- Guiding the Long-Short Term Memory Model for Image Caption Generation - Jia X et al,
ICCV 2015
.
- Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images - Mao J et al,
ICCV 2015
. [code]
NIPS 2015
- Expressing an Image Stream with a Sequence of Natural Sentences - Park C C et al,
NIPS 2015
. [code]
ICML 2015
- Show, Attend and Tell: Neural Image Caption Generation with Visual Attention - Xu K et al,
ICML 2015
. [project] [code] [code]
arXiv preprint 2015
- Order-Embeddings of Images and Language - Vendrov I et al,
arXiv preprint 2015
. [code]
- Generating Images from Captions with Attention - Mansimov E et al,
arXiv preprint 2015
. [code]
- Learning FRAME Models Using CNN Filters for Knowledge Visualization - Lu Y, et al,
arXiv preprint 2015
. [code]
- Aligning where to see and what to tell: image caption with region-based attention and scene factorization - Jin J et al,
arXiv preprint 2015
.
2016
CVPR 2016
- Image captioning with semantic attention - You Q et al,
CVPR 2016
. [code]
- DenseCap: Fully Convolutional Localization Networks for Dense Captioning - Johnson J et al,
CVPR 2016
. [code]
- What value do explicit high level concepts have in vision to language problems? - Wu Q et al,
CVPR 2016
.
- Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data - Lisa Anne Hendricks et al,
CVPR 2016
. [code]
- SPICE: Semantic Propositional Image Caption Evaluation - Anderson P et al,
ECCV 2016
. [code]
ACMMM 2016
- Image Captioning with Deep Bidirectional LSTMs - Wang C et al,
ACMMM 2016
. [code]
ACL 2016
- Multimodal Pivots for Image Caption Translation - Hitschler J et al,
ACL 2016
.
arXiv preprint 2016
- Image Caption Generation with Text-Conditional Semantic Attention - Zhou L et al,
arXiv preprint 2016
. [code]
- DeepDiary: Automatic Caption Generation for Lifelogging Image Streams - Fan C et al,
arXiv preprint 2016
.
- Learning to generalize to new compositions in image understanding - Atzmon Y et al,
arXiv preprint 2016
.
- Generating captions without looking beyond objects - Heuer H et al,
arXiv preprint 2016
.
- Bootstrap, Review, Decode: Using Out-of-Domain Textual Data to Improve Image Captioning - Chen W et al,
arXiv preprint 2016
. [code]
- Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering - Liu H et al,
arXiv preprint 2016
.
- Recurrent Highway Networks with Language CNN for Image Captioning - Gu J et al,
arXiv preprint 2016
.
2017
CVPR 2017
- Captioning Images with Diverse Objects - Venugopalan S et al,
CVPR 2017
. [code]
- Top-down Visual Saliency Guided by Captions - Ramanishka V et al,
CVPR 2017
. [code]
- Self-Critical Sequence Training for Image Captioning - Steven J et al,
CVPR 2017
. [code]
- Dense Captioning with Joint Inference and Visual Context - Yang L et al,
CVPR 2017
. [code]
- Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition - Yufei W et al,
CVPR 2017
. [code]
- A Hierarchical Approach for Generating Descriptive Image Paragraphs - Krause J et al,
CVPR 2017
. [code]
- Deep Reinforcement Learning-based Image Captioning with Embedding Reward - Ren Z et al,
CVPR 2017
.
- Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects - Ting Y et al,
CVPR 2017
.
- Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning - Lu J et al,
CVPR 2017
. [code]
- Attend to You: Personalized Image Captioning with Context Sequence Memory Networks - CC Park et al,
CVPR 2017
. [code]
- SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning - Chen L et al,
CVPR 2017
. [code]
- Bidirectional Beam Search: Forward-Backward Inference in Neural Sequence Models for Fill-In-The-Blank Image Captioning - Qing S et al,
CVPR 2017
.
ICCV 2017
- Areas of Attention for Image Captioning - Pedersoli M et al,
ICCV 2017
.
- Boosting Image Captioning with Attributes - Yao T et al,
ICCV 2017
.
- An Empirical Study of Language CNN for Image Captioning - Gu J et al,
ICCV 2017
.
- Improved Image Captioning via Policy Gradient Optimization of SPIDEr - Liu S et al,
ICCV 2017
.
- Towards Diverse and Natural Image Descriptions via a Conditional GAN - Dai B et al,
ICCV 2017
. [code]
- Paying Attention to Descriptions Generated by Image Captioning Models - Tavakoliy H R et al,
ICCV 2017
.
- Show, Adapt and Tell: Adversarial Training of Cross-domain Image Captioner - Chen T H et al,
ICCV 2017
. [code]
AAAI 2017
- Image Caption with Global-Local Attention - Li L et al,
AAAI 2017
.
- Reference Based LSTM for Image Captioning - Chen M et al,
AAAI 2017
.
- Attention Correctness in Neural Image Captioning - Liu C et al,
AAAI 2017
.
- Text-guided Attention Model for Image Captioning - Mun J et al,
AAAI 2017
. [code]
NIPS 2017
- Contrastive Learning for Image Captioning - Dai B et al,
NIPS 2017
. [code]
TPAMI 2017
- Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge - Vinyals O et al,
TPAMI 2017
. [code]
arXiv preprint 2017
- MAT: A Multimodal Attentive Translator for Image Captioning - Liu C et al,
arXiv preprint 2017
.
- Actor-Critic Sequence Training for Image Captioning - Zhang L et al,
arXiv preprint 2017
.
- What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator? - Tanti M et al,
arXiv preprint 2017
. [code]
- Self-Guiding Multimodal LSTM - when we do not have a perfect training dataset for image captioning - Xian Y et al,
arXiv preprint 2017
.
- Phrase-based Image Captioning with Hierarchical LSTM Model - Tan Y H et al,
arXiv preprint 2017
.
- Show-and-Fool: Crafting Adversarial Examples for Neural Image Captioning - Chen H et al,
arXiv preprint 2017
.
2018
CVPR 2018
- Neural Baby Talk - Lu J et al,
CVPR 2018
. [code]
- Convolutional Image Captioning - Aneja J et al,
CVPR 2018
.
- Learning to Evaluate Image Captioning - Cui Y et al,
CVPR 2018
. [code]
- Discriminability Objective for Training Descriptive Captions - Luo R et al,
CVPR 2018
. [code]
- SemStyle: Learning to Generate Stylised Image Captions using Unaligned Text - Mathews A et al,
CVPR 2018
.
- Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering - Anderson P et al,
CVPR 2018
. [code]
- GroupCap: Group-Based Image Captioning With Structured Relevance and Diversity Constraints
- Chen F et al,
CVPR 2018
.
ECCV 2018
- Unpaired Image Captioning by Language Pivoting - Gu J et al,
ECCV 2018
.
- Recurrent Fusion Network for Image Captioning - Jiang W et al,
ECCV 2018
.
- Exploring Visual Relationship for Image Captioning - Yao T et al,
ECCV 2018
.
- Rethinking the Form of Latent States in Image Captioning - Dai B et al,
ECCV 2018
. [code]
- Boosted Attention: Leveraging Human Attention for Image Captioning - Chen S et al,
ECCV 2018
.
- "Factual" or "Emotional": Stylized Image Captioning with Adaptive Learning and Attention - Chen T et al,
ECCV 2018
.
AAAI 2018
- Learning to Guide Decoding for Image Captioning - Jiang W et al,
AAAI 2018
.
- Stack-Captioning: Coarse-to-Fine Learning for Image Captioning - Gu J et al,
AAAI 2018
. [code]
- Temporal-difference Learning with Sampling Baseline for Image Captioning - Chen H et al,
AAAI 2018
.
NeurIPS 2018
- Partially-Supervised Image Captioning - Anderson P et al,
NeurIPS 2018
.
- A Neural Compositional Paradigm for Image Captioning - Dai B et al,
NeurIPS 2018
.
NAACL 2018
- Defoiling Foiled Image Captions - Wang J et al,
NAACL 2018
.
- Punny Captions: Witty Wordplay in Image Descriptions - Chandrasekaran A et al,
NAACL 2018
. [code]
- Object Counts! Bringing Explicit Detections Back into Image Captioning - Aneja J et al,
NAACL 2018
.
ACL 2018
- Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning - Sharma P et al,
ACL 2018
. [code]
- Attacking visual language grounding with adversarial examples: A case study on neural image captioning - Chen H et al,
ACL 2018
. [code]
EMNLP 2018
- simNet: Stepwise Image-Topic Merging Network for Generating Detailed and Comprehensive Image Captions - Liu et al,
EMNLP 2018
. [code]
arXiv preprint 2018
- Improved Image Captioning with Adversarial Semantic Alignment - Melnyk I et al,
arXiv preprint 2018
.
- Improving Image Captioning with Conditional Generative Adversarial Nets - Chen C et al,
arXiv preprint 2018
.
- CNN+CNN: Convolutional Decoders for Image Captioning - Wang Q et al,
arXiv preprint 2018
.
- Diverse and Controllable Image Captioning with Part-of-Speech Guidance - Deshpande A et al,
arXiv preprint 2018
.
2019
CVPR 2019
- Unsupervised Image Captioning - Yang F et al,
CVPR 2019
. [code]
- Engaging Image Captioning Via Personality - Shuster K et al,
CVPR 2019
.
- Pointing Novel Objects in Image Captioning - Li Y et al,
CVPR 2019
.
- Auto-Encoding Scene Graphs for Image Captioning - Yang X et al,
CVPR 2019
.
- Context and Attribute Grounded Dense Captioning - Yin G et al,
CVPR 2019
.
- Look Back and Predict Forward in Image Captioning - Qin Y et al,
CVPR 2019
.
- Self-critical n-step Training for Image Captioning - Gao J et al,
CVPR 2019
.
- Intention Oriented Image Captions with Guiding Objects - Zheng Y et al,
CVPR 2019
.
- Describing like humans: on diversity in image captioning - Wang Q et al,
CVPR 2019
.
- Adversarial Semantic Alignment for Improved Image Captions - Dognin P et al,
CVPR 2019
.
- MSCap: Multi-Style Image Captioning With Unpaired Stylized Text - Gao L et al,
CVPR 2019
.
- Fast, Diverse and Accurate Image Captioning Guided By Part-of-Speech - Aditya D et al,
CVPR 2019
.
- Good News, Everyone! Context driven entity-aware captioning for news images - Biten A F et al,
CVPR 2019
. [code]
- CapSal: Leveraging Captioning to Boost Semantics for Salient Object Detection - Zhang L et al,
CVPR 2019
. [code]
- Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning - Kim D et al,
CVPR 2019
. [code]
- Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions - Cornia M et al,
CVPR 2019
. [code]
- Exact Adversarial Attack to Image Captioning via Structured Output Learning With Latent Variables - Xu Y et al,
CVPR 2019
.
AAAI 2019
- Meta Learning for Image Captioning - Li N et al,
AAAI 2019
.
- Learning Object Context for Dense Captioning - Li X et al,
AAAI 2019
.
- Hierarchical Attention Network for Image Captioning - Wang W et al,
AAAI 2019
.
- Deliberate Residual based Attention Network for Image Captioning - Gao L et al,
AAAI 2019
.
- Improving Image Captioning with Conditional Generative Adversarial Nets - Chen C et al,
AAAI 2019
.
- Connecting Language to Images: A Progressive Attention-Guided Network for Simultaneous Image Captioning and Language Grounding - Song L et al,
AAAI 2019
.
ACL 2019
- Dense Procedure Captioning in Narrated Instructional Videos - Shi B et al,
ACL 2019
.
- Informative Image Captioning with External Sources of Information - Zhao S et al,
ACL 2019
.
- Bridging by Word: Image Grounded Vocabulary Construction for Visual Captioning - Fan Z et al,
ACL 2019
.
BMVC 2019
- Image Captioning with Unseen Objects - Demirel et al,
BMVC 2019
.
- Look and Modify: Modification Networks for Image Captioning - Sammani et al,
BMVC 2019
. [code]
- Show, Infer and Tell: Contextual Inference for Creative Captioning - Khare et al,
BMVC 2019
. [code]
- SC-RANK: Improving Convolutional Image Captioning with Self-Critical Learning and Ranking Metric-based Reward - Yan et al,
BMVC 2019
.
ICCV 2019
- Hierarchy Parsing for Image Captioning - Yao T et al,
ICCV 2019
.
- Entangled Transformer for Image Captioning - Li G et al,
ICCV 2019
.
- Attention on Attention for Image Captioning - Huang L et al,
ICCV 2019
. [code]
- Reflective Decoding Network for Image Captioning - Ke L at al,
ICCV 2019
.
- Learning to Collocate Neural Modules for Image Captioning - Yang X et al,
ICCV 2019
.
NeurIPS 2019
- Image Captioning: Transforming Objects into Words - Herdade S et al,
NeurIPS 2019
.
- Adaptively Aligned Image Captioning via Adaptive Attention Time - Huang L et al,
NeurIPS 2019
. [code]
- Variational Structured Semantic Inference for Diverse Image Captioning - Chen F et al,
NeurIPS 2019
.
- Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations - Liu F et al,
NeurIPS 2019
. [code]
IJCAI 2019
- Image Captioning with Compositional Neural Module Networks - Tian J et al,
IJCAI 2019
.
- Exploring and Distilling Cross-Modal Information for Image Captioning - Liu F et al,
IJCAI 2019
.
- Swell-and-Shrink: Decomposing Image Captioning by Transformation and Summarization - Wang H et al,
IJCAI 2019
.
- Hornet: a hierarchical offshoot recurrent network for improving person re-ID via image captioning - Yan S et al,
IJCAI 2019
.
EMNLP 2019
- Image Captioning with Very Scarce Supervised Data: Adversarial Semi-Supervised Learning Approach - Kim D J et al,
EMNLP 2019
.
- TIGEr: Text-to-Image Grounding for Image Caption Evaluation - Jiang M et al,
EMNLP 2019
.
- REO-Relevance, Extraness, Omission: A Fine-grained Evaluation for Image Captioning - Jiang M et al,
EMNLP 2019
.
- Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic Labels Improve Image Captioning and Visual Question Answering - Changpinyo S et al,
EMNLP 2019
.
CoNLL 2019
- Compositional Generalization in Image Captioning - Nikolaus M et al,
CoNLL 2019
. [code]
2020
AAAI 2020
- MemCap: Memorizing Style Knowledge for Image Captioning - Zhao et al,
AAAI 2020
.
- Unified Vision-Language Pre-Training for Image Captioning and VQA - Zhou L et al,
AAAI 2020
.
- Show, Recall, and Tell: Image Captioning with Recall Mechanism - Wang L et al,
AAAI 2020
.
- Reinforcing an Image Caption Generator using Off-line Human Feedback - Hongsuck Seo P et al,
AAAI 2020
.
- Interactive Dual Generative Adversarial Networks for Image Captioning - Liu et al,
AAAI 2020
.
- Feature Deformation Meta-Networks in Image Captioning of Novel Objects - Cao et al,
AAAI 2020
.
- Joint Commonsense and Relation Reasoning for Image and Video Captioning - Hou et al,
AAAI 2020
.
- Learning Long- and Short-Term User Literal-Preference with Multimodal Hierarchical Transformer Network
for Personalized Image Caption - Zhang et al,
AAAI 2020
.
CVPR 2020
ACL 2020
ECCV 2020
- Length-Controllable Image Captioning - Deng C et al,
ECCV 2020
.
- Captioning Images Taken by People Who Are Blind - Gurari D et al,
ECCV 2020
.
- Towards Unique and Informative Captioning of Images - Wang Z et al,
ECCV 2020
.
- Learning Visual Representations with Caption Annotations - Sariyildiz M et al,
ECCV 2020
.
- Comprehensive Image Captioning via Scene Graph Decomposition - Zhong Y et al,
ECCV 2020
.
- SODA: Story Oriented Dense Video Captioning Evaluation Framework - Fujita S et al,
ECCV 2020
.
- TextCaps: a Dataset for Image Captioning with Reading Comprehension - Sidorov O et al,
ECCV 2020
.
- Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets - Wang J et al,
ECCV 2020
.
- Learning to Generate Grounded Visual Captions without Localization Supervision - Ma C et al,
ECCV 2020
.
- Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards - Yang X et al,
ECCV 2020
.
- Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos - Chen S et al,
ECCV 2020
.
EMNLP 2020
NeurIPS 2020
Dataset
Image Captioning Challenge
Popular Implementations
PyTorch
Licenses
To the extent possible under law, Zhihong Chen has waived all copyright and related or neighboring rights to this work.