yfzhang114 / Awesome-Multimodal-Large-Language-Models

Reading notes about Multimodal Large Language Models, Large Language Models, and Diffusion Models
111 stars 5 forks source link

Awesome-Multimodal-Large-Language-Models

his is a repository for organizing articles related to Multimodal Large Language Models, Large Language Models, and Diffusion Models; Most papers are linked to my reading notes. Feel free to visit my personal homepage and contact me for collaboration and discussion.

About Me :high_brightness:

I'm a third-year Ph.D. student at the State Key Laboratory of Pattern Recognition, the University of Chinese Academy of Sciences, advised by Prof. Tieniu Tan. I have also spent time at Microsoft, advised by Prof. Jingdong Wang, alibaba DAMO Academy, work with Prof. Rong Jin.

🔥 Updated 2024-09-22

Table of Contents (ongoing)

Survey and Outlook

  1. 万字长文总结多模态大模型最新进展(Modality Bridging篇)
  2. 万字长文总结多模态大模型最新进展(Video篇)
  3. Aligning Large Language Models with Human

Multimodal Large Language Models

  1. Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution(精细的动态分辨率策略+多模态旋转位置嵌入)
  2. LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture(在单个A100 80GB GPU上可以处理近千张图像)
  3. MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?(最难多模态Benchmark. QwenVL-2第一但未及格!)
  4. VITA: Towards Open-Source Interactive Omni Multimodal LLM(VITA : 首个开源支持自然人机交互的全能多模态大语言模型)
  5. Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models(高效处理高分辨率图像的多模态大模型)
  6. Matryoshka Multimodal Models(如何在正确回答视觉问题的同时使用最少的视觉标记?)
  7. Chameleon: Mixed-Modal Early-Fusion Foundation Models(meta: 所有模态都回到token regreesion以达到灵活的理解/生成)
  8. Flamingo: a Visual Language Model for Few-Shot Learning(LLM每一层创建额外的block处理视觉信息)
  9. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models(q-former融合视觉-语言信息)
  10. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning(qformer+instruction tuning)
  11. Visual Instruction Tuning(MLP对齐特征,gpt4v生成instruction tuning数据)
  12. Improved Baselines with Visual Instruction Tuning(对于llava数据集以及模型大小的初步scaling)
  13. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge(分辨率*4,数据集更大)
  14. Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models(一种端到端的优化方案,通过轻量级适配器连接图像编码器和LLM)
  15. MIMIC-IT: Multi-Modal In-Context Instruction Tuning( MIMIC-IT包含多个图片或视频的输入数据,并支持多模态上下文信息)
  16. LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding(使用公开可用的OCR工具在LAION数据集的422K个文本丰富的图像上收集结果)
  17. SVIT: Scaling up Visual Instruction Tuning(一个包含420万个视觉指导调整数据点的数据集)
  18. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond(cross attention对齐特征,更大的第一阶段训练数据)
  19. NExT-GPT: Any-to-Any Multimodal LLM(端到端通用的任意对任意MM-LLM(Multimodal-Large Language Model)系统)
  20. InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition(视觉信息的压缩采样)
  21. CogVLM: Visual Expert for Pretrained Language Models(在LLM的各层添加visual expert,它具有独立的QKV和FFN相关的参数)
  22. OtterHD: A High-Resolution Multi-modality Model(专门设计用于以细粒度精度解释高分辨率视觉输入)
  23. Monkey : Image Resolution and Text Label Are Important Things for Large Multi-modal Models(Monkey模型提出了一种有效地提高输入分辨率的方法,最高可达 896 x 1344 像素)
  24. LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models(LLaMA-VID赋予现有框架支持长达一小时的视频,并通过额外的上下文标记推动了它们的上限)
  25. MoE-LLaVA: Mixture of Experts for Large Vision-Language Models(解决了多模态稀疏学习中的性能下降问题)
  26. LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images(高效处理任何纵横比和高分辨率的图像)
  27. Yi-VL(Yi-VL采用了LLaVA架构,经过全面的三阶段训练过程,以将视觉信息与Yi LLM的语义空间良好对齐:)
  28. Mini-Gemini(双视觉编码器,使用低分辨率的视觉编码器特征作为query,将高分辨率特征作为key 和value进行token mining)
  29. Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding(采用了一组动态视觉tokens来统一表示图像和视频。使模型能够高效利用有限数量的视觉tokens,同时捕捉图像所需的空间细节和视频所需的全面时间关系。)
  30. VILA: On Pre-training for Visual Language Models(交错的预训练数据是有益的,而单纯的图像-文本对并非最佳选择。)
  31. ST-LLM: Large Language Models Are Effective Temporal Learners(ST-LLM提出了一种动态掩码策略,并设计了定制的训练目标。此外,针对特别长的视频,设计了一个全局-局部输入模块,以平衡效率和效果。)
  32. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection(用视频特有的encoder提升视频理解能力而非image encoder)

BenchMark and Dataset

  1. MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?(最难多模态Benchmark. QwenVL-2第一但未及格!)

  2. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark(MMMU的进阶版,更注重图像的感知对问题的影响)

  3. From Pixels to Prose: A Large Dataset of Dense Image Captions(1600万生成的image-text pair,利用尖端的视觉语言模型(Gemini 1.0 Pro Vision)进行详细和准确的描述。)

  4. ShareGPT4Video: Improving Video Understanding and Generation with Better Captions(40k from gpt4-v, 4814k生成于自己训练的模型)

  5. OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents(141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens)

  6. Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning(在数据层面,以细粒度片段级更正的形式收集人类反馈;在方法层面,我们提出了密集直接偏好优化(DDPO))

  7. Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model(在数据层面, 通过代码作为媒介合成抽象图表,并且 benchmarking 了当前多模态模型在抽象图的理解上的不足.)

    Unify Multimodal Understanding and Generation

  8. Chameleon: Mixed-Modal Early-Fusion Foundation Models(“早期融合”的方法使得模型能够跨模态推理和生成真正的混合文档。)

  9. Show-o: One Single Transformer to Unify Multimodal Understanding and Generation(文本作为离散标记进行自回归建模,而连续图像像素则使用去噪扩散建模。)

  10. Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model(采用了文本的下一个标记预测和图像的扩散作为目标函数,型在不增加计算成本的前提下,实现了更好的模态整合与生成效果。)

    MLLM Alignment

  11. Aligning Large Multimodal Models with Factually Augmented RLHF

Alignment With Human Preference

  1. ChatGLM-Math:Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline(ChatGLM-Math: Self-Critique迭代对齐显著提升数学能力)
  2. Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization(大语言模型的多目标对齐)
  3. Direct Preference Optimization: Your Language Model is Secretly a Reward Model(直接偏好优化克服RLHF不稳定的问题)
  4. KTO: Model Alignment as Prospect Theoretic Optimization(不需要成对数据的偏好优化)
  5. Direct Preference Optimization with an Offset(带偏移的DPO, 要求首选响应和不受欢迎响应之间的可能性差异大于一个偏移值)
  6. Contrastive preference learning: Learning from human feedback without reinforcement learning(对比偏好学习(CPL)算法,该算法用于从偏好中学习最优策略而无需学习奖励函数,从而避免了对RL的需求)
  7. Statistical Rejection Sampling Improves Preference Optimization(使用拒绝抽样从目标最优策略中获取偏好数据,从而更准确地估计最优策略)
  8. Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study(在所有实验中,PPO始终优于DPO。特别是在最具挑战性的代码竞赛任务中,PPO实现了最先进的结果)
  9. Fine-tuning Aligned Language Models Compromises Safety(微调对齐的语言模型会损害安全性)
  10. ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline(reward model, Rejective Fine-tuning, then DPO迭代提升模型数学性能)
  11. SimPO: Simple Preference Optimization with a Reference-Free Reward(length reg+去掉ref model)
  12. towards analyzing and understanding the limitations of dpo: a theoretical perspective(DPO的实际优化过程对SFT后的LLMs对齐能力的初始条件为什么敏感)
  13. Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level(表明迭代 DPO (iDPO)可以通过精心设计将 7B 模型的 LC win rate 增强到 GPT-4 水平)
  14. Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs(出了一种有效且经济的 pipeline 来收集成对数学问题偏好数据。引入了 Step-DPO,最大化下一个推理步骤正确的概率,最小化其错误的概率)
  15. CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs(使用预训练的 CLIP 模型对 LVLM 自生成的标题进行排序,以构建 DPO 的正负对)
  16. ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models(选择了一种动态生成方法来创建一个 open-set benchmark,引入了开放集动态评估协议(ODE),专门用于评估 MLLM 中的对象存在幻觉)