Skip to content

Latest commit

 

History

History
138 lines (114 loc) · 8.34 KB

README.md

File metadata and controls

138 lines (114 loc) · 8.34 KB

Medical AI for Synthetic Imaging (MAISI) Data Preparation

Disclaimer: We are not the hosts of the data. Please make sure to read the requirements and usage policies of the data and give credit to the authors of the datasets!

1 VAE training Data

For the released Foundation autoencoder model weights in MAISI, we used 37243 CT training data and 1963 CT validation data from chest, abdomen, head and neck region; and 17887 MRI training data and 940 MRI validation data from brain, skull-stripped brain, chest, and below-abdomen region. The training data come from TCIA Covid 19 Chest CT, TCIA Colon Abdomen CT, MSD03 Liver Abdomen CT, LIDC chest CT, TCIA Stony Brook Covid Chest CT, NLST Chest CT, TCIA Upenn GBM Brain MR, Aomic Brain MR, QTIM Brain MR, TCIA Acrin Chest MR, TCIA Prostate MR Below-Abdomen MR.

In total, we included:

Index Dataset Name Number of Training Data Number of Validation Data
1 Covid 19 Chest CT 722 49
2 TCIA Colon Abdomen CT 1522 77
3 MSD03 Liver Abdomen CT 104 0
4 LIDC chest CT 450 24
5 TCIA Stony Brook Covid Chest CT 2644 139
6 NLST Chest CT 31801 1674
7 TCIA Upenn GBM Brain MR (skull-stripped) 2550 134
8 Aomic Brain MR 2630 138
9 QTIM Brain MR 1275 67
10 Acrin Chest MR 6599 347
11 TCIA Prostate MR Below-Abdomen MR 928 49
12 Aomic Brain MR, skull-stripped 2630 138
13 QTIM Brain MR, skull-stripped 1275 67
Total CT 37243 1963
Total MRI 17887 940

2 Diffusion model training Data

The training dataset for the Diffusion model used in MAISI comprises 10,277 CT volumes from 24 distinct datasets, encompassing various body regions and disease patterns.

The table below provides a summary of the number of volumes for each dataset.

Index Dataset name Number of volumes
1 AbdomenCT-1K 789
2 AeroPath 15
3 AMOS22 240
4 autoPET23 200
5 Bone-Lesion 223
6 BTCV 48
7 COVID-19 524
8 CRLM-CT 158
9 CT-ORG 94
10 CTPelvic1K-CLINIC 94
11 LIDC 422
12 MSD Task03 88
13 MSD Task06 50
14 MSD Task07 224
15 MSD Task08 235
16 MSD Task09 33
17 MSD Task10 87
18 Multi-organ-Abdominal-CT 65
19 NLST 3109
20 Pancreas-CT 51
21 StonyBrook-CT 1258
22 TCIA_Colon 1437
23 TotalSegmentatorV2 654
24 VerSe 179

3 ControlNet model training Data

3.1 Example preprocessed dataset

We provide the preprocessed subset of C4KC-KiTS dataset used in the finetuning config environment_maisi_controlnet_train.json. The dataset and corresponding JSON data list can be downloaded and should be saved in maisi/dataset/ folder.

The structure of example folder in the preprocessed dataset is:

            |-*arterial*.nii.gz               # original image
            |-*arterial_emb*.nii.gz           # encoded image embedding
KiTS-000* --|-mask*.nii.gz                    # original labels
            |-mask_pseudo_label*.nii.gz       # pseudo labels
            |-mask_combined_label*.nii.gz     # combined mask of original and pseudo labels

An example combined mask of original and pseudo labels is shown below: example_combined_mask

Please note that the label of Kidney Tumor is mapped to index 129 in this preprocessed dataset. The encoded image embedding is generated by provided Autoencoder in ./models/autoencoder_epoch273.pt during preprocessing to save memory usage for training. The pseudo labels are generated by VISTA 3D. In addition, the dimension of each volume and corresponding pseudo label is resampled to the closest multiple of 128 (e.g., 128, 256, 384, 512, ...).

The training workflow requires one JSON file to specify the image embedding and segmentation pairs. The example file is located in the maisi/dataset/C4KC-KiTS_subset.json.

The JSON file has the following structure:

{
    "training": [
        {
            "image": "*/*arterial_emb*.nii.gz",  # relative path to the image embedding file
            "label": "*/mask_combined_label*.nii.gz",  # relative path to the combined label file
            "dim": [512, 512, 512],  # the dimension of image
            "spacing": [1.0, 1.0, 1.0],  # the spacing of image
            "top_region_index": [0, 1, 0, 0],  # the top region index of the image
            "bottom_region_index": [0, 0, 0, 1],  # the bottom region index of the image
            "fold": 0  # fold index for cross validation, fold 0 is used for training
        },

        ...
    ]
}

3.2 Controlnet full training datasets

The ControlNet training dataset used in MAISI contains 6330 CT volumes (5058 and 1272 volumes are used for training and validation, respectively) across 20 datasets and covers different body regions and diseases.

The table below summarizes the number of volumes for each dataset.

Index Dataset name Number of volumes
1 AbdomenCT-1K 789
2 AeroPath 15
3 AMOS22 240
4 Bone-Lesion 237
5 BTCV 48
6 CT-ORG 94
7 CTPelvic1K-CLINIC 94
8 LIDC 422
9 MSD Task03 105
10 MSD Task06 50
11 MSD Task07 225
12 MSD Task08 235
13 MSD Task09 33
14 MSD Task10 101
15 Multi-organ-Abdominal-CT 64
16 Pancreas-CT 51
17 StonyBrook-CT 1258
18 TCIA_Colon 1436
19 TotalSegmentatorV2 654
20 VerSe 179

4. Questions and bugs

  • For questions relating to the use of MONAI, please use our Discussions tab on the main repository of MONAI.
  • For bugs relating to MONAI functionality, please create an issue on the main repository.
  • For bugs relating to the running of a tutorial, please create an issue in this repository.

Reference

[1] Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models." CVPR 2022.