Skip to content

A collection of visual instruction tuning datasets.

License

Notifications You must be signed in to change notification settings

BAAI-DCAI/DataOptim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 

Repository files navigation

DataOptim

DataOptim is a data repository designed to offer an optimized solution for utilizing training data in Multimodal Large Language Models (MLLMs) efficiently.

News

  • [2024.03.14] Data of TextOCR-GPT4V is now available!
  • [2023.12.15] Data of ShareGPT4V is now available!
  • [2023.11.06] Data of LLaVA-v1.5 is now available!
  • [2023.10.26] VGQA, DocVQA and DVQA are now available!
  • [2023.10.17] ScienceQA is now available!

Introduction

Currently, the visual instruction tuning data contain 20 public datasets. More datasets are coming in the future! 🔥🔥🔥

Category Dataset Images Samples Split
Image captioning COCO 82783 414113 train
Image captioning Flickr30K 29000 145000 Karpathy train split
Image captioning TextCaps 21953 109765 train
Image captioning TextOCR-GPT4V 25114 25114 train
Visual question answering VQAv2 82783 443757 train
Visual question answering OKVQA 8998 9009 train
Visual question answering OCRVQA 166041 801673 train
Visual question answering GQA 72140 943000 train
Visual question answering TextVQA 21953 34602 train
Visual question answering A-OKVQA 16540 17056 train
Visual question answering ScienceQA 6218 6218 train
Visual question answering Visual Genome QA (VGQA) 99280 1445322 -
Visual question answering DocVQA 10194 39463 train
Visual question answering DVQA 200000 2325316 train
Grounding RefCOCO/RefCOCO+/RefCOCOg 24407 287604 train
Grounding Shikra-RD 883 5922 train
GPT-4 generated LLaVA-Instruct-150K 81479 157712 -
GPT-4 generated SVIT 108076 2992799 -
GPT-4V generated ShareGPT-4V 87296 102025 -
Mixed LLaVA-v1.51 291684 665298 -
Total 974K2 11.2M

1 The bounding boxes in LLaVA-v1.5 are based on the padded image. You can find the discussion here.

2 The number of images are counted based on image IDs. There might be duplicate images across different image sources.

We use different strategies to collect the prompts for different tasks.

  • Image captioning. We carefully collect 5 manually written instructions and randomly sample one as the prompt for each caption. The fourth and fifth instructions are from InstructBLIP.
  • Open-ended VQA. As the answers in VQA datasets are generally short, we add an instruction after the question to ask the model to provide answers with a short sentence or phrase.
  • Multiple-choice VQA. For A-OKVQA, we add an instruction before the question to ask the model to provide answers with correct options. For ScienceQA, we use the instructions and templates designed by M3IT and randomly sample one to format the prompt. Only data with image context are involved.
  • Grounding. For RefCOCO/RefCOCO+/RefCOCOg, we use the data and templates in Shikra and randomly sample one to format the prompt.
  • GPT-4/GPT-4V generated & mixed datasets. We keep the prompts unchanged.
Category Data Prompts
Image captioning COCO, Flickr30K, TextCaps, TextOCR-GPT4V Describe the image as simply as possible with a sentence or phrase.
Give a brief summary of what you see.
Provide a short description of the image.
Write a short description for the image.
Briefly describe the content of the image.
Open-ended VQA VQAv2, OKVQA, OCRVQA, GQA, TextVQA, VGQA, DocVQA, DVQA question Answer the question directly with a short sentence or phrase.
Multiple-choice VQA A-OKVQA Choose the correct option for the following question: question

Quickstart

For the images, you can download the images from our HuggingFace repository or the original websites. If you already have the images, you can skip this process as the image IDs and file names are not changed.

Then unzip and organize the images in following structure.

|- images
  |- coco
    |- COCO_train2014_000000000009.jpg
    |- ...
  |- coco_2017
    |- 000000274591.jpg
    |- ...
  |- docvqa
    |- ffbf0023_4.png
    |- ...
  |- dvqa
    |- ...
  |- filckr30k
    |- 36979.jpg
    |- ...
  |- llava
    |- llava_pretrain
      |- images
  |- ocrvqa
    |- 13714.jpg
    |- ...
  |- open_images
    |- 0a0bc91825468c45.jpg
    |- ...
  |- sam
    |- images
  |- scienceqa
    |- 1
      |- image.png
    |- 2
      |- image.png
    |- ...
  |- share_textvqa
    |- images
  |- visual_genome
    |- 1.jpg
    |- ...
  |- web-celebrity
    |- images
  |- web-landmark
    |- images
  |- wikiart
    |- images

After that, you can use this diretory as the --image_folder in LLaVA's training script.

For the visual instruction tuning QAs, all of the data mentioned above are already converted to the training format of LLaVA in our HuggingFace repository. You can download them directly from HuggingFace.

For referring QAs, the bounding box is in the form of [x1, y1, x2, y2], corresponding to the top left x, top left y, bottom right x and bottom right y. The values are float numbers normalized to [0, 1], based on the size of original images, except LLaVA-v1.5, which is based on the padded image (see more discussion here). We provide a script here to expand the bounding boxes to square.

Contact

If you have any questions, you can open an issue in the GitHub repository or contact zhaobo@baai.ac.cn for more information.

About

A collection of visual instruction tuning datasets.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages