-
Notifications
You must be signed in to change notification settings - Fork 14
Scripts PanCancer Specific
These scripts are only relevant for users who are members of the ICGC/TCGA PanCancer working groups.
This tools allows you to selectively download either results or alignment data from the GNOS repositories (provided you have the relevant access keys).
It uses the elastic search json file donor_p*.jsonl.gz
hosted on pancancer.info as a source of information. You can override the default if you want to use a specific version during large scale pulls.
The main utility of this tool is the ability to subset and filter what you are retrieving by setting filters in an ini file. The example included in the codebase details all the available options and filters. For example:
[COMPOSITE_FILTERS]
multi_tumour=1
Results in retrieval of variant calling data for all donors that have multiple tumour files:
$ ./gnos_pull.pl -o $SCRATCH -c .../gnos_pull.ini -a CALLS -i
Retained donors: 21
Rejected donors: 2471
Project Distribution
CMDI-UK: 16 (avg ? GB)
EOPC-DE: 3 (avg ? GB)
PACA-CA: 2 (avg ? GB)
NOTE: (avg ? GB as data not populated in jsonl at time of writing)
Alternatively setting the command line arg -a
to ALIGNMENTS
will give details regarding the BAM files:
$ ./gnos_pull.pl -o $SCRATCH -c .../gnos_pull.ini -a ALIGNMENTS -i
Retained donors: 54
Rejected donors: 2438
Project Distribution
CMDI-UK: 31 (avg 376 GB)
EOPC-DE: 8 (avg 679 GB)
LIRI-JP: 10 (avg 333 GB)
PACA-CA: 2 (avg 444 GB)
PRAD-UK: 3 (avg 756 GB)
For data hosted in multiple GNOS repositories you can give the script a list of the repositories in descending transfer rate (at present this has to be constructed by hand using the appropriate transfer table on pancancer.info):
[TRANSFER]
order=<<EOT
https://gtrepo-dkfz.annailabs.com/
https://gtrepo-ebi.annailabs.com/
https://gtrepo-etri.annailabs.com/
EOT
Please see the command line help for the most current details of the command line options:
./gnos_pull.pl -h
NOTE: If GNOS hangs you will have to kill the process and restart the script, it will resume from the last completed dataset.
Generates submission XML files required to submit data to the PanCancer GNOS instances. Details of how to prepare your BAM/FASTQ data can be found on the OICR-PanCancer Wiki (requires login).
Regenerates *.bas
file from analysisFull.xml
URI. PCAP::Bam::Stats
is able to parse this correctly regardless column ordering (which may be inconsistent).
e.g.
$: xml_to_bas.pl -d https://gtrepo-ebi.annailabs.com/cghub/metadata/analysisFull/4e183691-ba1f-4103-a517-948f363928b8
read_group_id #_divergent_bases #_divergent_bases_r1 #_divergent_bases_r2 #_duplicate_reads #_gc_bases_r1 #_gc_bases_r2 #_mapped_bases #_mapped_bases_r1 #_mapped_bases_r2 #_mapped_reads #_mapped_reads_properly_paired #_mapped_reads_r1 #_mapped_reads_r2 #_total_reads #_total_reads_r1 #_total_reads_r2 bam_filename insert_size_sd library mean_insert_size median_insert_size platform platform_unit read_length_r1 read_length_r2 readgroup sample
WTSI25941 109656822 53400157 56256665 0 4900621926 4913073665 23957071382 12092611993 11864459389 242992645 91741465 121667967 121324678 243357940 121678970 121678970 out_4.bam 83.731 WGS:WTSI:12490 408.841 400.000 ILLUMINA WTSI:7119_5 100 100 WTSI25941 f8467ec8-2d5e-ba21-e040-11ac0c483584
WTSI25938 105448163 50962194 54485969 0 5059285020 5071727818 24756788121 12492499609 12264288512 251054082 94753203 125696465 125357617 251410540 125705270 125705270 out_1.bam 83.591 WGS:WTSI:12490 409.087 400.000 ILLUMINA WTSI:7119_2 100 100 WTSI25938 f8467ec8-2d5e-ba21-e040-11ac0c483584
WTSI25937 124757189 68266593 56490596 0 5003807799 5007016723 24418124464 12326534108 12091590356 247931492 93471492 124212116 123719376 248447342 124223671 124223671 out_0.bam 83.328 WGS:WTSI:12490 408.817 400.000 ILLUMINA WTSI:7119_1 100 100 WTSI25937 f8467ec8-2d5e-ba21-e040-11ac0c483584
WTSI25940 120711275 58333598 62377677 0 4926375001 4938756287 24057793106 12144140521 11913652585 244147184 92150919 122241408 121905776 244501826 122250913 122250913 out_3.bam 83.724 WGS:WTSI:12490 408.665 400.000 ILLUMINA WTSI:7119_4 100 100 WTSI25940 f8467ec8-2d5e-ba21-e040-11ac0c483584
WTSI25939 110785952 52341213 58444739 0 5110036344 5124651899 24997394847 12618022324 12379372523 253555020 95683446 126945749 126609271 253911232 126955616 126955616 out_2.bam 83.557 WGS:WTSI:12490 408.982 400.000 ILLUMINA WTSI:7119_3 100 100 WTSI25939 f8467ec8-2d5e-ba21-e040-11ac0c483584