yuecao0119 / MMInstruct

The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity". The MMInstruct dataset includes 973K instructions from 24 domains and four instruction types.
Apache License 2.0
34 stars 2 forks source link
dataset mllm vlm

MMInstruct

The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity".

The dataset is available on Hugging Face at 🤗 yuecao0119/MMInstruct.

📣 News

Todo List

Introduction

Vision-language supervised fine-tuning effectively enhances VLLM performance, but existing visual instruction tuning datasets have limitations:

  1. Instruction Annotation Quality: Despite strong performance, advanced VLLMs may generate instructions with inaccuracies, such as hallucinations.
  2. Instruction and Image Diversity: Limited instruction types and lack of diverse image data impact the model's ability to generate varied and realistic outputs.

MMInstruct Dataset

To address these challenges, we created the MMInstruct dataset, featuring:

image

The open source datasets on Hugging Face 🤗 yuecao0119/MMInstruct include:

We also expand MMInstruct with other open-source data, including:

Domain Dataset
mathematics datasets GEOS; UniGeo; GeoQA+; Geometry3k; CLEVR-Math; Supre-CLEVR; TabMWP
charts and plots DVQA (100K); FigureQA
scientific figure TQA
map chart MapQA

Data Engine

We developed an instruction generation data engine leveraging GPT-4V, GPT-3.5, and manual correction. This engine allows semi-automatic, low-cost, multi-domain instruction generation at 1/6 the cost of manual construction.

image

As described in our paper, we mainly proposed a semi-automatic and low-cost instruction generation data engine using GPT-4V, GPT-3.5 and manual correction. Our data engine consists of six steps: (a) image collection, (b) image caption generation, (c) seed question collection, (d) automatic instruction generation, (e) dataset expansion and (f) manual correction.

(a) First, we collect a large number of different images from various sources, which are mainly obtained through some selected source images, and then retrieved by crawlers and clips, etc., as shown in image_retrieval_bing_spider.py and image_retrieval_clip.py.

(b) And use GPT-4V to generate detailed image captions, as shown in gpt4v_caption.py.

(c) Then experts designed corresponding seed questions for different fields.

(d) We use image captions and seed questions to automatically generate a rich and diverse set of instruction data through GPT-3.5, as shown in gpt35_qa.py.

(e), (f) In addition, we also use various methods to expand our dataset. Finally, manual correction is performed to ensure data quality and accuracy.

Performance

image

Citation

If this work is helpful for your research, please consider citing the following BibTeX entry.

@article{liu2024mminstruct,
  title={MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity},
  author={Liu, Yangzhou and Cao, Yue and Gao, Zhangwei and Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Tian, Hao and Lu, Lewei and Zhu, Xizhou and Lu, Tong and others},
  journal={arXiv preprint arXiv:2407.15838},
  year={2024}
}