We provide links to download the raw datasets in our paper (the Data Availability section), and we share our preprocessed python scripts in the ./scripts/preprocess/
folder. Before processing data, we need to put the downloaded compressed file into ./datasets/
and uncompress it (change the folder's name if required). You can also process the data on your own according to the instructions given by OFA . There are several useful notes below. Additionally, for convenience, we also provide some preprocessed data for fine-tuning and evaluation. However, for restricted datasets such as MIMIC-CXR, please follow our processing codes to handle them by yourselves.
- VQA: PathVQA , SLAKE , VQA-RAD
- Image Captioning: IU X-Ray , Peir Gross
- Image Classification: MedMNIST (224*224)
- Conversation Summarization: HealthcareMagic , MeQSum
- Text Understanding: TREC'22 (clinical trial matching)
- Before preprocessing the VQA-RAD dataset, it's necessary to inspect the data and search for any instances of
\t
. These instances might cause issues and it's recommended to manually remove them. For instance, changing instances likeslee\t n
tosleen
. Neglecting this step and proceeding with preprocessing could lead to errors during training. - For preprocessing the
MedMNIST
dataset, the following steps are employed: First, the.npy
files are converted to.png
images using the commandpython medmnist.py --mode 0
. Subsequently, these.png
images are converted into a.tsv
file using the command--mode 1
. - For pretraining, we provide sample codes for preprocessing image infilling, text-only, captioning, and vqa, respectively. You can process any data you want via following the logic, and remember to concatenate captioning and vqa datasets as
vision_language.tsv
. Shuffling is a good choice for pretraining, e.g.,shuf -o your_file.tsv your_file.tsv
.