This repository contains the instructions of how to download the diagnostic slides for the lung portion of the TCGA dataset. It will require ~800GB of space.
TCGA lung also has tissue slides which are were not diagnostic. Experimental strategy can be Tissue Slide (non-diagnostic) or/and Diagnostic Slide.
Important note
- Patient ID is the first 12 characters of the slide name, e.g.
TCGA-50-5066
- Case ID is the first 15 characters of the slide name, e.g.
TCGA-50-5066-01
orTCGA-50-5066-02
- Slide name, e.g.
TCGA-50-5066-01Z-00-DX1.e161df31-84a4-40a4-a6a2-748b60820f77
contains the slide nameTCGA-50-5066-01Z-00-DX1
and some uide161df31-84a4-40a4-a6a2-748b60820f77
; all slide names in the downloaded dataset are unique and so are the uid's so that there is a one-to-one mapping between the slide names and the uid's.
Source: https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/#creating-barcodes
This explains why the web page refers to 478 cases for LUAD and 478 cases for LUSC, while the manifest files contain 479 cases for LUAD and 478 cases for LUSC. The web page really refers to the patients, while the manifest files refer to the cases.
Patient TCGA-50-5066
has 2 cases for LUAD. The case IDs are:
TCGA-50-5066-01
with diagnostic slide:TCGA-50-5066-01Z-00-DX1
TCGA-50-5066-02
with diagnostic slide:TCGA-50-5066-02Z-00-DX1
Every other patient has only 1 case per patient.
- Make sure you have enough disk space: ~800GB.
- Download the
gdc-client
Data Transfer Tool binaries from https://gdc.cancer.gov/access-data/gdc-data-transfer-tool and add it to your PATH or into/usr/local/bin
if you have this directory (it's usually already added to the PATH). I did not have any problems with it on Linux, however on Apple devices you can run into "MacOS cannot verify app is free from malware" which can be solved as described here: https://gadgetstouse.com/blog/2021/04/08/fix-macos-cannot-verify-app-is-free-from-malware/ - Clone this repository and
cd
into it.
If you want to make sure that the manifests have not changed, download new ones from TCGA data portal and check them. Example of checking the versions from 2023-10-03 vs 2021-11-03. The manifests have not changed.
cd tcga-download
for file in gdc_manifest.2023-10-03-TCGA-LUSC.txt gdc_manifest.2023-10-03-TCGA-LUAD.txt gdc_manifest.2021-11-03-TCGA-LUSC.txt gdc_manifest.2021-11-03-TCGA-LUAD.txt; do
sorted_file="${file%.txt}-sorted.txt"
echo -e "id\tfilename\tmd5\tsize\tstate" > "$sorted_file"
tail -n +2 "$file" | sort -k2,2 >> "$sorted_file"
done
diff gdc_manifest.2023-10-03-TCGA-LUAD-sorted.txt gdc_manifest.2021-11-03-TCGA-LUAD-sorted.txt
diff gdc_manifest.2023-10-03-TCGA-LUSC-sorted.txt gdc_manifest.2021-11-03-TCGA-LUSC-sorted.txt
cd ..
- Create
./WSI/LUSC/
and./WSI/LUAD
folders. - If you choose to change the folder structure, make changes to
./tcga-download/config-LUSC.dtt
./tcga-download/config-LUAD.dtt
- Run:
bash ./0-download-LUSC-and-LUAD.sh
to download the files. It will take a while. Restarting the download is not advisable. I am not sure, but I think the manifest file will need to be modified: already downloaded files fill need to be excluded. See: https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload/#resuming-a-failed-download
Tip: I used tmux
for the process to continue on a remote surver after I
closed the connection. See this
how-to
on StackExchange.
- Check that the downloaded slides were not currupted during the download:
md5sum ./WSI/*/*.svs > downloaded_md5sum_hashes.txt
The hashes should match the ones in ./tcga_download/
manifest files for LUAD and LUSC.
The code to parse the manifest files and downloaded_md5sum_hashes.txt
and check the matches is in 2-check-names.ipynb.
If you already have a file with md5 checksums, you can use a trick shown here: https://askubuntu.com/questions/318530/generate-md5-checksum-for-all-files-in-a-directory
-
./tcga-download/ folder was originally copied from https://github.com/binli123/dsmil-wsi/tree/master/tcga-download. Some names present in the manifest files were not available for download with the
gdc-client
. So new manifest files were downloaded from these web pages on on 03/11/2021 (date is in the names). The manifest files are:- TCGA-LUAD
- manifest on TCGA portal
- downloaded manifest from 2021-11-03: 541 slides from 478 patients with 479 cases
- TCGA-LUSC
- manifest on TCGA portal
- downloaded manifest from 2021-11-03: 512 slides from 478 patients with 478 cases
- TCGA-LUAD
-
./0-download-LUSC-and-LUAD.sh contains commands to download 3 diagnostic slides (check that everything is fine first) from both the LUAD (https://portal.gdc.cancer.gov/projects/TCGA-LUAD) and the LUSC (https://portal.gdc.cancer.gov/projects/TCGA-LUSC) sets of the TCGA into
./WSI/LUSC/
and./WSI/LUAD/
respectively. To download all files, remove "-pilot" from the commands in./0-download-LUSC-and-LUAD.sh
. The destinations can be changed in the configuration files:./tcga-download/config-LUSC.dtt
./tcga-download/config-LUAD.dtt
-
./WSI/
folder contains 2 subfolders./WSI/LUSC/
and./WSI/LUAD
, which in turn contain the diagnostic slides. These folders are not present in this repository and will have to be made. -
./dsmil-split/ directory contains the information from the DSMIL-WSI (paper, code) on this dataset. See section "Corrupted Slides Excluded in DSMIL-WSI work" of this README for more details.
-
./2-check-names.ipynb contains code to check that the downloaded slides are not corrupted and that the names of the slides match the names in the manifest files. It also creates ./classes_extended_info.csv file.
-
./classes_extended_info.csv was created using ./2-check-names.ipynb contains the patient ID, case ID, slide ID, slide md5sum, for each slide. The file was created by combining
- list of the downloaded slides
- md5sum hashes of the downloaded slides
- manifest files for LUAD and LUSC
Corrupted Slides Excluded in DSMIL-WSI work
There seem to be some corrupted files that were excluded from the dataset in DSMIL-WSI work. see issue that gives a Google Drive Link to the TCGA-lung dataset. When using the code from the dsmil-wsi repo to download pre-trained features for TCGA-lung, the excluded set is different. The names of the folders within the google drive folder have changed, however, the slide names contain the patient ID (first 12 characters) and case ID (first 15 characters). See ./classes_extended_info.csv. Use ./2-check-names.ipynb code to investigate and choose which of the slides you want to exclude.
My investigation results:
-
All of the slides have a significantly darker background around the tissue.
-
In Google Drive version, 11 LUAD parients 1 case per patient and 1 slide per case were excluded
patient_id | case_id | slide_id_short |
---|---|---|
TCGA-05-4384 | TCGA-05-4384-01 | TCGA-05-4384-01Z-00-DX1 |
TCGA-05-4390 | TCGA-05-4390-01 | TCGA-05-4390-01Z-00-DX1 |
TCGA-05-4410 | TCGA-05-4410-01 | TCGA-05-4410-01Z-00-DX1 |
TCGA-05-4425 | TCGA-05-4425-01 | TCGA-05-4425-01Z-00-DX1 |
TCGA-05-5420 | TCGA-05-5420-01 | TCGA-05-5420-01Z-00-DX1 |
TCGA-05-5423 | TCGA-05-5423-01 | TCGA-05-5423-01Z-00-DX1 |
TCGA-05-5425 | TCGA-05-5425-01 | TCGA-05-5425-01Z-00-DX1 |
TCGA-05-5428 | TCGA-05-5428-01 | TCGA-05-5428-01Z-00-DX1 |
TCGA-05-5429 | TCGA-05-5429-01 | TCGA-05-5429-01Z-00-DX1 |
TCGA-05-5715 | TCGA-05-5715-01 | TCGA-05-5715-01Z-00-DX1 |
TCGA-44-7661 | TCGA-44-7661-01 | TCGA-44-7661-01Z-00-DX1 |
- In GitHub version, 7 out of the Google Drive's 11 patients with their 7 cases and 7 slides were exluded. The remaining 4 patients with 4 cases and 4 slides were not excluded.
patient_id | case_id | slide_id_short |
---|---|---|
TCGA-05-4384 | TCGA-05-4384-01 | TCGA-05-4384-01Z-00-DX1 |
TCGA-05-4410 | TCGA-05-4410-01 | TCGA-05-4410-01Z-00-DX1 |
TCGA-05-4425 | TCGA-05-4425-01 | TCGA-05-4425-01Z-00-DX1 |
TCGA-05-5420 | TCGA-05-5420-01 | TCGA-05-5420-01Z-00-DX1 |
TCGA-05-5423 | TCGA-05-5423-01 | TCGA-05-5423-01Z-00-DX1 |
TCGA-05-5425 | TCGA-05-5425-01 | TCGA-05-5425-01Z-00-DX1 |
TCGA-05-5715 | TCGA-05-5715-01 | TCGA-05-5715-01Z-00-DX1 |
- Test set form google drive has 1 slide that is also in the excluded set from google drive:
TCGA-05-4390-01Z-00-DX1
. However, all slides from the test set on google drive are included in the slides on GitHub.
Decision: I will use the GitHub version of the dataset (excludes the 7 patients with 7 cases and 7 slides). I will use the test set from google drive as the test set to be able to make a direct comparison to the DSMIL-WSI results since this test set is fully included in the slides on GitHub.