Awesome-Multimodal-Large-Language-Models

his is a repository for organizing articles related to Multimodal Large Language Models, Large Language Models, and Diffusion Models; Most papers are linked to my reading notes. Feel free to visit my personal homepage and contact me for collaboration and discussion.

About Me :high_brightness:

I'm a third-year Ph.D. student at the State Key Laboratory of Pattern Recognition, the University of Chinese Academy of Sciences, advised by Prof. Tieniu Tan. I have also spent time at Microsoft, advised by Prof. Jingdong Wang, alibaba DAMO Academy, work with Prof. Rong Jin.

🔥 Updated 2024-09-22

Recent advanced Multimodal Large Language Models have been updated, such as Qwen2-VL, Show-o, Transfusion etc.
Our benchmark MME-RealWorld has been released, the most difficult and largest pure manual annotation image perception benchmark so far. [Code] [Reading Notes]
Our model SliME has been released, a high-resolution MLLM that can also be extend to video analysis. [Code] [Reading Notes]
Our paper Debiasing Multimodal Large Language Models has been released. [Code] [Reading Notes]

Table of Contents (ongoing)

Multimodal Large Language Models
BenchMark and Dataset
Unify Multimodal Understanding and Generation
MLLM Alignment
Alignment With Human Preference

Survey and Outlook

Multimodal Large Language Models

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution(精细的动态分辨率策略+多模态旋转位置嵌入)
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture(在单个A100 80GB GPU上可以处理近千张图像)
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?(最难多模态Benchmark. QwenVL-2第一但未及格！)
VITA: Towards Open-Source Interactive Omni Multimodal LLM(VITA : 首个开源支持自然人机交互的全能多模态大语言模型)
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models(高效处理高分辨率图像的多模态大模型)
Matryoshka Multimodal Models(如何在正确回答视觉问题的同时使用最少的视觉标记？)
Chameleon: Mixed-Modal Early-Fusion Foundation Models(meta: 所有模态都回到token regreesion以达到灵活的理解/生成)
Flamingo: a Visual Language Model for Few-Shot Learning(LLM每一层创建额外的block处理视觉信息)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models(q-former融合视觉-语言信息)
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning(qformer+instruction tuning)
Visual Instruction Tuning(MLP对齐特征，gpt4v生成instruction tuning数据)
Improved Baselines with Visual Instruction Tuning(对于llava数据集以及模型大小的初步scaling)
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge(分辨率*4，数据集更大)
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models(一种端到端的优化方案，通过轻量级适配器连接图像编码器和LLM)
MIMIC-IT: Multi-Modal In-Context Instruction Tuning( MIMIC-IT包含多个图片或视频的输入数据，并支持多模态上下文信息)
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding(使用公开可用的OCR工具在LAION数据集的422K个文本丰富的图像上收集结果)
SVIT: Scaling up Visual Instruction Tuning(一个包含420万个视觉指导调整数据点的数据集)
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond(cross attention对齐特征，更大的第一阶段训练数据)
NExT-GPT: Any-to-Any Multimodal LLM(端到端通用的任意对任意MM-LLM（Multimodal-Large Language Model）系统)
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition(视觉信息的压缩采样)
CogVLM: Visual Expert for Pretrained Language Models(在LLM的各层添加visual expert，它具有独立的QKV和FFN相关的参数)
OtterHD: A High-Resolution Multi-modality Model(专门设计用于以细粒度精度解释高分辨率视觉输入)
Monkey : Image Resolution and Text Label Are Important Things for Large Multi-modal Models(Monkey模型提出了一种有效地提高输入分辨率的方法，最高可达 896 x 1344 像素)
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models(LLaMA-VID赋予现有框架支持长达一小时的视频，并通过额外的上下文标记推动了它们的上限)
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models(解决了多模态稀疏学习中的性能下降问题)
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images(高效处理任何纵横比和高分辨率的图像)
Yi-VL(Yi-VL采用了LLaVA架构，经过全面的三阶段训练过程，以将视觉信息与Yi LLM的语义空间良好对齐：)
Mini-Gemini(双视觉编码器，使用低分辨率的视觉编码器特征作为query，将高分辨率特征作为key 和value进行token mining)
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding(采用了一组动态视觉tokens来统一表示图像和视频。使模型能够高效利用有限数量的视觉tokens，同时捕捉图像所需的空间细节和视频所需的全面时间关系。)
VILA: On Pre-training for Visual Language Models(交错的预训练数据是有益的，而单纯的图像-文本对并非最佳选择。)
ST-LLM: Large Language Models Are Effective Temporal Learners(ST-LLM提出了一种动态掩码策略，并设计了定制的训练目标。此外，针对特别长的视频，设计了一个全局-局部输入模块，以平衡效率和效果。)
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection(用视频特有的encoder提升视频理解能力而非image encoder)

BenchMark and Dataset

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?(最难多模态Benchmark. QwenVL-2第一但未及格！)
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark(MMMU的进阶版，更注重图像的感知对问题的影响)
From Pixels to Prose: A Large Dataset of Dense Image Captions(1600万生成的image-text pair，利用尖端的视觉语言模型(Gemini 1.0 Pro Vision)进行详细和准确的描述。)
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions(40k from gpt4-v, 4814k生成于自己训练的模型)
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents(141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens)
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning(在数据层面，以细粒度片段级更正的形式收集人类反馈；在方法层面，我们提出了密集直接偏好优化(DDPO))
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model(在数据层面, 通过代码作为媒介合成抽象图表,并且 benchmarking 了当前多模态模型在抽象图的理解上的不足.)

Unify Multimodal Understanding and Generation
Chameleon: Mixed-Modal Early-Fusion Foundation Models(“早期融合”的方法使得模型能够跨模态推理和生成真正的混合文档。)
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation(文本作为离散标记进行自回归建模，而连续图像像素则使用去噪扩散建模。)
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model(采用了文本的下一个标记预测和图像的扩散作为目标函数,型在不增加计算成本的前提下，实现了更好的模态整合与生成效果。)

MLLM Alignment
Aligning Large Multimodal Models with Factually Augmented RLHF

Alignment With Human Preference

ChatGLM-Math：Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline(ChatGLM-Math: Self-Critique迭代对齐显著提升数学能力)
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization(大语言模型的多目标对齐)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model(直接偏好优化克服RLHF不稳定的问题)
KTO: Model Alignment as Prospect Theoretic Optimization(不需要成对数据的偏好优化)
Direct Preference Optimization with an Offset(带偏移的DPO, 要求首选响应和不受欢迎响应之间的可能性差异大于一个偏移值)
Contrastive preference learning: Learning from human feedback without reinforcement learning(对比偏好学习（CPL）算法，该算法用于从偏好中学习最优策略而无需学习奖励函数，从而避免了对RL的需求)
Statistical Rejection Sampling Improves Preference Optimization(使用拒绝抽样从目标最优策略中获取偏好数据，从而更准确地估计最优策略)
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study(在所有实验中，PPO始终优于DPO。特别是在最具挑战性的代码竞赛任务中，PPO实现了最先进的结果)
Fine-tuning Aligned Language Models Compromises Safety(微调对齐的语言模型会损害安全性)
ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline(reward model, Rejective Fine-tuning, then DPO迭代提升模型数学性能)
SimPO: Simple Preference Optimization with a Reference-Free Reward(length reg+去掉ref model)
towards analyzing and understanding the limitations of dpo: a theoretical perspective(DPO的实际优化过程对SFT后的LLMs对齐能力的初始条件为什么敏感)
Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level(表明迭代 DPO (iDPO)可以通过精心设计将 7B 模型的 LC win rate 增强到 GPT-4 水平)
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs(出了一种有效且经济的 pipeline 来收集成对数学问题偏好数据。引入了 Step-DPO，最大化下一个推理步骤正确的概率，最小化其错误的概率)
CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs(使用预训练的 CLIP 模型对 LVLM 自生成的标题进行排序，以构建 DPO 的正负对)
ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models(选择了一种动态生成方法来创建一个 open-set benchmark，引入了开放集动态评估协议(ODE)，专门用于评估 MLLM 中的对象存在幻觉)

yfzhang114 / Awesome-Multimodal-Large-Language-Models

readme