The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity".
The dataset is available on Hugging Face at 🤗 yuecao0119/MMInstruct.
Vision-language supervised fine-tuning effectively enhances VLLM performance, but existing visual instruction tuning datasets have limitations:
To address these challenges, we created the MMInstruct dataset, featuring:
The open source datasets on Hugging Face 🤗 yuecao0119/MMInstruct include:
caption_cn
: 144K English detailed image caption data generated using gpt-4-vision-preview.caption_en
: 18.2K Chinese detailed image caption data generated using gpt-4-vision-preview.qa_en
: 216K instruction data generated using GPT-3.5-turbo, including 161K multi-round long questions and answers and 55K manually corrected instruction data from 23 fields, as shown in the figure below.We also expand MMInstruct with other open-source data, including:
Domain | Dataset |
---|---|
mathematics datasets | GEOS; UniGeo; GeoQA+; Geometry3k; CLEVR-Math; Supre-CLEVR; TabMWP |
charts and plots | DVQA (100K); FigureQA |
scientific figure | TQA |
map chart | MapQA |
We developed an instruction generation data engine leveraging GPT-4V, GPT-3.5, and manual correction. This engine allows semi-automatic, low-cost, multi-domain instruction generation at 1/6 the cost of manual construction.
As described in our paper, we mainly proposed a semi-automatic and low-cost instruction generation data engine using GPT-4V, GPT-3.5 and manual correction. Our data engine consists of six steps: (a) image collection, (b) image caption generation, (c) seed question collection, (d) automatic instruction generation, (e) dataset expansion and (f) manual correction.
(a) First, we collect a large number of different images from various sources, which are mainly obtained through some selected source images, and then retrieved by crawlers and clips, etc., as shown in image_retrieval_bing_spider.py and image_retrieval_clip.py.
(b) And use GPT-4V to generate detailed image captions, as shown in gpt4v_caption.py.
(c) Then experts designed corresponding seed questions for different fields.
(d) We use image captions and seed questions to automatically generate a rich and diverse set of instruction data through GPT-3.5, as shown in gpt35_qa.py.
(e), (f) In addition, we also use various methods to expand our dataset. Finally, manual correction is performed to ensure data quality and accuracy.
If this work is helpful for your research, please consider citing the following BibTeX entry.
@article{liu2024mminstruct,
title={MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity},
author={Liu, Yangzhou and Cao, Yue and Gao, Zhangwei and Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Tian, Hao and Lu, Lewei and Zhu, Xizhou and Lu, Tong and others},
journal={arXiv preprint arXiv:2407.15838},
year={2024}
}