Skip to content

Latest commit

 

History

History
16 lines (13 loc) · 3.17 KB

datasets.md

File metadata and controls

16 lines (13 loc) · 3.17 KB

Datasets

We provide links to download the raw datasets in our paper (the Data Availability section), and we share our preprocessed python scripts in the ./scripts/preprocess/ folder. Before processing data, we need to put the downloaded compressed file into ./datasets/ and uncompress it (change the folder's name if required). You can also process the data on your own according to the instructions given by OFA . There are several useful notes below. Additionally, for convenience, we also provide some preprocessed data for fine-tuning and evaluation. However, for restricted datasets such as MIMIC-CXR, please follow our processing codes to handle them by yourselves.

Pre-processed Downstream Datasets (tsv files)

Data Preparation Notes

  • Before preprocessing the VQA-RAD dataset, it's necessary to inspect the data and search for any instances of \t. These instances might cause issues and it's recommended to manually remove them. For instance, changing instances like slee\t n to sleen. Neglecting this step and proceeding with preprocessing could lead to errors during training.
  • For preprocessing the MedMNIST dataset, the following steps are employed: First, the .npy files are converted to .png images using the command python medmnist.py --mode 0. Subsequently, these .png images are converted into a .tsv file using the command --mode 1.
  • For pretraining, we provide sample codes for preprocessing image infilling, text-only, captioning, and vqa, respectively. You can process any data you want via following the logic, and remember to concatenate captioning and vqa datasets as vision_language.tsv. Shuffling is a good choice for pretraining, e.g., shuf -o your_file.tsv your_file.tsv.