Simplify spark_eval scripts and improve documentation. (#3580)

broadinstitute · Sep 27, 2017 · 847048d · 847048d
1 parent 61542a0
commit 847048d
Show file tree

Hide file tree

Showing 35 changed files with 314 additions and 308 deletions.
diff --git a/scripts/spark_eval/README.md b/scripts/spark_eval/README.md
@@ -1,69 +1,121 @@
 # Spark Evaluation
 
-This directory contains scripts for testing Spark command lines. It is based on this [test definition document](https://docs.google.com/document/d/1OEfV2XNXdbGQQdW-gRaQYY2QRgGsWHUNKJUSAj3qFlE/edit), but also has scripts for running full pipelines.
+This directory contains scripts for testing GATK pipelines on Spark - either on a dedicated cluster or on Google Cloud Dataproc.
+
+## TL;DR
+
+```bash
+export API_KEY=...
+export GCS_CLUSTER=...
+
+# Sanity check on small data (a few mins)
+./run_gcs_cluster.sh small_reads-pipeline_gcs.sh
+
+# Run on exome (<1hr)
+nohup ./run_gcs_cluster.sh exome_reads-pipeline_gcs.sh &
+
+# Run on genome (a few hrs)
+NUM_WORKERS=20 nohup ./run_gcs_cluster.sh copy_genome_to_hdfs_on_gcs.sh genome_md-bqsr-hc_hdfs.sh &
+
+# Check results
+cat results/*
+```
 
 ## Obtaining and preparing the data
 
-Most of the data can be obtained from the [GATK resource bundle](https://software.broadinstitute.org/gatk/download/bundle).
+There are three main datasets of increasing size: _small_ , _exome_, and _genome_ (WGS). The _small_ data is useful for sanity checking command lines before running them on the larger _exome_ and whole _genome_ datasets.
 
-There is also some data in a GCS bucket for this evaluation: _gs://hellbender/q4_spark_eval/_.
+The datasets are stored in GCS buckets, so if you run using GCS input and output then there is no initial data preparation.
 
 To copy the exome data into the cluster run:
 
 ```bash
-./prep_data_exome_gcs.sh
+./copy_exome_to_hdfs.sh
 ```
 
 For the whole genome data, run:
 
 ```bash
-./prep_data_genome_gcs.sh
+./copy_genome_to_hdfs.sh
 ```
 
 By default these copy the data into a directory in the current user's directory. To copy to different directory, add an argument like this:
 
 ```bash
-./prep_data_genome_gcs.sh /data/shared/spark_eval
+./copy_genome_to_hdfs.sh /data/shared/spark_eval
 ```
 
-## Running test cases
-
-If you want to run tests from the [test definition document](https://docs.google.com/document/d/1OEfV2XNXdbGQQdW-gRaQYY2QRgGsWHUNKJUSAj3qFlE/edit), then run a command like the following:
+Most of the data was obtained from the [GATK resource bundle](https://software.broadinstitute.org/gatk/download/bundle).
 
-```bash
-nohup ./test_case_2.sh &
-```
-
-The output is saved to a CSV file (one per test case type), which can be analysed using _spark_eval.R_ to create plots.
+There is also some data in a GCS bucket for this evaluation: _gs://hellbender/q4_spark_eval/_.
 
 ## Running pipelines
 
-The following shows how to run pipelines - from aligned reads to variants.
+The following shows how to run pipelines - from aligned reads to variants. The scripts follow a naming convention to make it easier to understand what they do:
 
-### Running the exome pipeline on GCS (with data in HDFS)
+```
+<dataset>_<GATK tools>_<source/sink>.sh
+```
+
+So `small_reads-pipeline_gcs.sh` will run `ReadsPipelineSpark` on the `small` dataset in GCS (and writing output to GCS).
 
-For exome data, try `n1-standard-16` GCS worker instances, which have 60GB of memory, and 16 vCPUs. 2000GB of disk space per worker should be sufficient. Use 1 master + 5 workers. The master has lower resource requirements so `n1-standard-4`, 500GB disk is enough.
+To run on Dataproc, make sure you set `API_KEY` and `GCS_CLUSTER` environment variables:
 
 ```bash
 export API_KEY=...
 export GCS_CLUSTER=...
+```
 
-nohup ./exome_pipeline_gcs_hdfs.sh &
+### Running the exome pipeline on Dataproc (with data in HDFS)
+
+For exome data, try `n1-standard-16` Dataproc worker instances, which have 60GB of memory, and 16 vCPUs. 2000GB of disk space per worker should be sufficient. Use 1 master + 5 workers. The master has lower resource requirements so `n1-standard-4`, 500GB disk is enough.
+
+```bash
+nohup ./exome_md-bqsr-hc_hdfs.sh &
 ```
 
 This will take less than an hour.
 
-### Running the whole genome pipeline on GCS (with data in HDFS)
+### Running the whole genome pipeline on Dataproc (with data in HDFS)
 
 For whole genome data, use the same instance types but try 10 workers.
 
 ```bash
-export API_KEY=...
-export GCS_CLUSTER=...
-
-nohup ./genome_md_gcs_hdfs.sh &
-nohup ./genome_bqsr_gcs_hdfs.sh &
-nohup ./genome_hc_gcs_hdfs.sh &
+nohup ./genome_md-bqsr-hc_hdfs.sh &
 ```
 
 This will take a few hours.
+
+### Running end-to-end
+
+The following starts a GCS cluster, runs the given pipeline, then deletes the cluster.
+
+```bash
+nohup ./run_gcs_cluster.sh small_reads-pipeline_gcs.sh &
+```
+
+To copy the dataset to HDFS use a copy script first:
+
+```bash
+nohup ./run_gcs_cluster.sh copy_small_to_hdfs_on_gcs.sh small_reads-pipeline_hdfs.sh &
+```
+
+### More examples
+
+```bash
+# Exome ReadsSparkPipeline on HDFS
+nohup ./run_gcs_cluster.sh copy_exome_to_hdfs_on_gcs.sh exome_reads-pipeline_hdfs.sh &
+
+# Genome Mark Duplicates, BQSR, Haplotype Caller on HDFS using 20 workers
+NUM_WORKERS=20 nohup ./run_gcs_cluster.sh copy_genome_to_hdfs_on_gcs.sh genome_md-bqsr-hc_hdfs.sh &
+```
+
+## Running test cases
+
+If you want to run tests from the [test definition document](https://docs.google.com/document/d/1OEfV2XNXdbGQQdW-gRaQYY2QRgGsWHUNKJUSAj3qFlE/edit), then run a command like the following:
+
+```bash
+nohup ./test_case_2.sh &
+```
+
+The output is saved to a CSV file (one per test case type), which can be analysed using _spark_eval.R_ to create plots.
diff --git a/scripts/spark_eval/copy_exome_to_hdfs.sh b/scripts/spark_eval/copy_exome_to_hdfs.sh
@@ -0,0 +1,38 @@
+#!/usr/bin/env bash
+
+# Download all required data for exomes and store in HDFS. Use this for non-GCS clusters.
+
+TARGET_DIR=${1:-exome_spark_eval}
+
+hadoop fs -stat $TARGET_DIR > /dev/null 2>&1
+if [ $? -eq 0 ]; then
+  echo "$TARGET_DIR already exists. Delete it and try again."
+  exit 1
+fi
+
+set -e
+set -x
+
+# Create data directory in HDFS
+hadoop fs -mkdir -p $TARGET_DIR
+
+# Download exome BAM (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20101201_cg_NA12878/)
+gsutil cp gs://broad-spark-eval-test-data/data/NA12878.ga2.exome.maq.raw.bam - | hadoop fs -put - $TARGET_DIR/NA12878.ga2.exome.maq.raw.bam
+
+# Download reference (hg18) (ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg18/)
+gsutil cp gs://broad-spark-eval-test-data/data/Homo_sapiens_assembly18.2bit - | hadoop fs -put - $TARGET_DIR/Homo_sapiens_assembly18.2bit
+gsutil cp gs://broad-spark-eval-test-data/data/Homo_sapiens_assembly18.dict - | hadoop fs -put - $TARGET_DIR/Homo_sapiens_assembly18.dict
+gsutil cp gs://broad-spark-eval-test-data/data/Homo_sapiens_assembly18.fasta.fai - | hadoop fs -put - $TARGET_DIR/Homo_sapiens_assembly18.fasta.fai
+gsutil cp gs://broad-spark-eval-test-data/data/Homo_sapiens_assembly18.fasta - | hadoop fs -put - $TARGET_DIR/Homo_sapiens_assembly18.fasta
+
+# (Code for generating 2bit)
+#hadoop fs -get $TARGET_DIR/Homo_sapiens_assembly18.fasta
+#curl -O http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/faToTwoBit
+#chmod +x faToTwoBit
+#./faToTwoBit Homo_sapiens_assembly18.fasta Homo_sapiens_assembly18.2bit
+
+# Download known sites VCF (hg18) (ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg18/)
+gsutil cp gs://broad-spark-eval-test-data/data/dbsnp_138.hg18.vcf - | hadoop fs -put - $TARGET_DIR/dbsnp_138.hg18.vcf
+
+# List data
+hadoop fs -ls -h $TARGET_DIR
diff --git a/scripts/spark_eval/copy_exome_to_hdfs_on_gcs.sh b/scripts/spark_eval/copy_exome_to_hdfs_on_gcs.sh
@@ -0,0 +1,11 @@
+#!/usr/bin/env bash
+
+# Copy exome data to HDFS on a GCS cluster.
+
+${GATK_HOME:-../..}/gatk-launch ParallelCopyGCSDirectoryIntoHDFSSpark \
+    --inputGCSPath gs://broad-spark-eval-test-data/exome/ \
+    --outputHDFSDirectory hdfs://${GCS_CLUSTER}-m:8020/user/$USER/exome_spark_eval \
+    -apiKey $API_KEY \
+    -- \
+    --sparkRunner GCS \
+    --cluster $GCS_CLUSTER
diff --git a/scripts/spark_eval/copy_genome_to_hdfs.sh b/scripts/spark_eval/copy_genome_to_hdfs.sh
@@ -0,0 +1,44 @@
+#!/usr/bin/env bash
+
+# Download all required data for genomes and store in HDFS. Use this for non-GCS clusters.
+
+TARGET_DIR=${1:-q4_spark_eval}
+
+hadoop fs -stat $TARGET_DIR > /dev/null 2>&1
+if [ $? -eq 0 ]; then
+  echo "$TARGET_DIR already exists. Delete it and try again."
+  exit 1
+fi
+
+set -e
+set -x
+
+# Create data directory in HDFS
+hadoop fs -mkdir -p $TARGET_DIR
+
+# Download WGS BAM
+#gsutil cp gs://hellbender/q4_spark_eval/WGS-G94982-NA12878.bam - | hadoop fs -put - $TARGET_DIR/WGS-G94982-NA12878.bam
+#gsutil cp gs://hellbender/q4_spark_eval/WGS-G94982-NA12878.bai - | hadoop fs -put - $TARGET_DIR/WGS-G94982-NA12878.bai
+# BAM with NC_007605 reads removed since this contig is not in the reference
+gsutil gs://broad-spark-eval-test-data/genome/WGS-G94982-NA12878-no-NC_007605.bam - | hadoop fs -put - /user/tom/q4_spark_eval/WGS-G94982-NA12878-no-NC_007605.bam
+
+# Download reference (b37) (ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/)
+gsutil gs://broad-spark-eval-test-data/genome/human_g1k_v37.2bit - | hadoop fs -put - $TARGET_DIR/human_g1k_v37.2bit
+gsutil gs://broad-spark-eval-test-data/genome/human_g1k_v37.dict - | hadoop fs -put - $TARGET_DIR/human_g1k_v37.dict
+gsutil gs://broad-spark-eval-test-data/genome/human_g1k_v37.fasta - | hadoop fs -put - $TARGET_DIR/human_g1k_v37.fasta
+gsutil gs://broad-spark-eval-test-data/genome/human_g1k_v37.fasta.fai - | hadoop fs -put - $TARGET_DIR/human_g1k_v37.fasta.fai
+
+# (Code for generating 2bit)
+#hadoop fs -get $TARGET_DIR/human_g1k_v37.fasta
+#curl -O http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/faToTwoBit
+#chmod +x faToTwoBit
+#./faToTwoBit human_g1k_v37.fasta human_g1k_v37.2bit
+
+# Download known sites VCF (b37) (ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/)
+gsutil gs://broad-spark-eval-test-data/genome/dbsnp_138.b37.vcf - | hadoop fs -put - $TARGET_DIR/dbsnp_138.b37.vcf
+
+# Download exome intervals
+gsutil cp gs://broad-spark-eval-test-data/genome/Broad.human.exome.b37.interval_list Broad.human.exome.b37.interval_list
+
+# List data
+hadoop fs -ls -h $TARGET_DIR
diff --git a/scripts/spark_eval/copy_genome_to_hdfs_on_gcs.sh b/scripts/spark_eval/copy_genome_to_hdfs_on_gcs.sh
@@ -0,0 +1,11 @@
+#!/usr/bin/env bash
+
+# Copy genome data to HDFS on a GCS cluster.
+
+${GATK_HOME:-../..}/gatk-launch ParallelCopyGCSDirectoryIntoHDFSSpark \
+    --inputGCSPath gs://broad-spark-eval-test-data/genome/ \
+    --outputHDFSDirectory hdfs://${GCS_CLUSTER}-m:8020/user/$USER/q4_spark_eval \
+    -apiKey $API_KEY \
+    -- \
+    --sparkRunner GCS \
+    --cluster $GCS_CLUSTER
diff --git a/scripts/spark_eval/copy_small_to_hdfs.sh b/scripts/spark_eval/copy_small_to_hdfs.sh
@@ -0,0 +1,33 @@
+#!/usr/bin/env bash
+
+# Download all required data for small BAM and store in HDFS. Use this for non-GCS clusters.
+
+TARGET_DIR=${1:-small_spark_eval}
+
+hadoop fs -stat $TARGET_DIR > /dev/null 2>&1
+if [ $? -eq 0 ]; then
+  echo "$TARGET_DIR already exists. Delete it and try again."
+  exit 1
+fi
+
+set -e
+set -x
+
+# Create data directory in HDFS
+hadoop fs -mkdir -p $TARGET_DIR
+
+# Download exome BAM
+gsutil cp gs://broad-spark-eval-test-data/small/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam - | hadoop fs -put - $TARGET_DIR/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam
+gsutil cp gs://broad-spark-eval-test-data/small/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam.bai - | hadoop fs -put - $TARGET_DIR/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam.bai
+
+# Download reference
+gsutil cp gs://broad-spark-eval-test-data/small/human_g1k_v37.20.21.2bit - | hadoop fs -put - $TARGET_DIR/human_g1k_v37.20.21.2bit
+gsutil cp gs://broad-spark-eval-test-data/small/human_g1k_v37.20.21.dict - | hadoop fs -put - $TARGET_DIR/human_g1k_v37.20.21.dict
+gsutil cp gs://broad-spark-eval-test-data/small/human_g1k_v37.20.21.fasta.fai - | hadoop fs -put - $TARGET_DIR/human_g1k_v37.20.21.fasta.fai
+gsutil cp gs://broad-spark-eval-test-data/small/human_g1k_v37.20.21.fasta - | hadoop fs -put - $TARGET_DIR/human_g1k_v37.20.21.fasta
+
+# Download known sites VCF
+gsutil cp gs://broad-spark-eval-test-data/small/dbsnp_138.b37.20.21.vcf - | hadoop fs -put - $TARGET_DIR/dbsnp_138.b37.20.21.vcf
+
+# List data
+hadoop fs -ls -h $TARGET_DIR
diff --git a/scripts/spark_eval/copy_small_to_hdfs_on_gcs.sh b/scripts/spark_eval/copy_small_to_hdfs_on_gcs.sh
@@ -0,0 +1,11 @@
+#!/usr/bin/env bash
+
+# Copy small data to HDFS on a GCS cluster.
+
+${GATK_HOME:-../..}/gatk-launch ParallelCopyGCSDirectoryIntoHDFSSpark \
+    --inputGCSPath gs://broad-spark-eval-test-data/small/ \
+    --outputHDFSDirectory hdfs://${GCS_CLUSTER}-m:8020/user/$USER/small_spark_eval \
+    -apiKey $API_KEY \
+    -- \
+    --sparkRunner GCS \
+    --cluster $GCS_CLUSTER
diff --git a/scripts/spark_eval/exome_pipeline.sh → scripts/spark_eval/exome_md-bqsr-hc_hdfs.sh b/scripts/spark_eval/exome_pipeline.sh → scripts/spark_eval/exome_md-bqsr-hc_hdfs.sh
@@ -1,9 +1,9 @@
 #!/usr/bin/env bash
 
-# Run the pipeline (Mark Duplicates, BQSR, Haplotype Caller) on exome data on a Spark cluster.
+# Run the pipeline (Mark Duplicates, BQSR, Haplotype Caller) on exome data in HDFS.
 
 . utils.sh
 
-time_gatk "MarkDuplicatesSpark -I hdfs:///user/$USER/exome_spark_eval/NA12878.ga2.exome.maq.raw.bam -O hdfs:///user/$USER/exome_spark_eval/out/markdups-sharded --shardedOutput true" 48 1 4g 4g
-time_gatk "BQSRPipelineSpark -I hdfs:///user/$USER/exome_spark_eval/out/markdups-sharded -O hdfs:///user/$USER/exome_spark_eval/out/bqsr-sharded --shardedOutput true -R hdfs:///user/$USER/exome_spark_eval/Homo_sapiens_assembly18.2bit --knownSites hdfs:///user/$USER/exome_spark_eval/dbsnp_138.hg18.vcf --joinStrategy OVERLAPS_PARTITIONER" 4 8 32g 4g
-time_gatk "HaplotypeCallerSpark -I hdfs:///user/$USER/exome_spark_eval/out/bqsr-sharded -R hdfs:///user/$USER/exome_spark_eval/Homo_sapiens_assembly18.2bit -O hdfs:///user/$USER/exome_spark_eval/out/NA12878.ga2.exome.maq.raw.vcf -pairHMM AVX_LOGLESS_CACHING" 48 1 4g 4g
+time_gatk "MarkDuplicatesSpark -I hdfs:///user/$USER/exome_spark_eval/NA12878.ga2.exome.maq.raw.bam -O hdfs:///user/$USER/exome_spark_eval/out/markdups-sharded --shardedOutput true" 96 1 4g 4g
+time_gatk "BQSRPipelineSpark -I hdfs:///user/$USER/exome_spark_eval/out/markdups-sharded -O hdfs:///user/$USER/exome_spark_eval/out/bqsr-sharded --shardedOutput true -R hdfs:///user/$USER/exome_spark_eval/Homo_sapiens_assembly18.2bit --knownSites hdfs://${HDFS_HOST_PORT}/user/$USER/exome_spark_eval/dbsnp_138.hg18.vcf" 8 8 32g 4g
+time_gatk "HaplotypeCallerSpark -I hdfs:///user/$USER/exome_spark_eval/out/bqsr-sharded -R hdfs:///user/$USER/exome_spark_eval/Homo_sapiens_assembly18.2bit -O hdfs://${HDFS_HOST_PORT}/user/$USER/exome_spark_eval/out/NA12878.ga2.exome.maq.raw.vcf -pairHMM AVX_LOGLESS_CACHING -maxReadsPerAlignmentStart 10" 64 1 6g 4g
diff --git a/scripts/spark_eval/exome_pipeline_gcs_hdfs.sh b/scripts/spark_eval/exome_pipeline_gcs_hdfs.sh
diff --git a/scripts/spark_eval/exome_pipeline_single_gcs_hdfs.sh b/scripts/spark_eval/exome_pipeline_single_gcs_hdfs.sh
diff --git a/...s/spark_eval/exome_pipeline_single_gcs.sh → ...ts/spark_eval/exome_reads-pipeline_gcs.sh b/...s/spark_eval/exome_pipeline_single_gcs.sh → ...ts/spark_eval/exome_reads-pipeline_gcs.sh
@@ -1,7 +1,7 @@
 #!/usr/bin/env bash
 
-# Run the pipeline (ReadsPipelineSpark) on exome data on a GCS Dataproc cluster. Data is in GCS.
+# Run the pipeline (ReadsPipelineSpark) on exome data in GCS.
 
 . utils.sh
 
-time_gatk "ReadsPipelineSpark -I gs://gatk-tom-testdata-exome/NA12878.ga2.exome.maq.raw.bam -O gs://gatk-tom-testdata-exome/NA12878.ga2.exome.maq.raw.vcf -R gs://gatk-tom-testdata-exome/Homo_sapiens_assembly18.2bit --knownSites gs://gatk-tom-testdata-exome/dbsnp_138.hg18.vcf -pairHMM AVX_LOGLESS_CACHING -maxReadsPerAlignmentStart 10" 4 8 32g 4g
+time_gatk "ReadsPipelineSpark -I gs://gatk-tom-testdata-exome/NA12878.ga2.exome.maq.raw.bam -O gs://gatk-tom-testdata-exome/NA12878.ga2.exome.maq.raw.vcf -R gs://gatk-tom-testdata-exome/Homo_sapiens_assembly18.2bit --knownSites gs://gatk-tom-testdata-exome/dbsnp_138.hg18.vcf -pairHMM AVX_LOGLESS_CACHING -maxReadsPerAlignmentStart 10" 8 8 32g 4g
diff --git a/scripts/spark_eval/exome_reads-pipeline_hdfs.sh b/scripts/spark_eval/exome_reads-pipeline_hdfs.sh
@@ -0,0 +1,7 @@
+#!/usr/bin/env bash
+
+# Run the pipeline (ReadsPipelineSpark) on exome data in HDFS.
+
+. utils.sh
+
+time_gatk "ReadsPipelineSpark -I hdfs:///user/$USER/exome_spark_eval/NA12878.ga2.exome.maq.raw.bam -O hdfs://${HDFS_HOST_PORT}/user/$USER/exome_spark_eval/out/NA12878.ga2.exome.maq.raw.vcf -R hdfs:///user/$USER/exome_spark_eval/Homo_sapiens_assembly18.2bit --knownSites hdfs://${HDFS_HOST_PORT}/user/$USER/exome_spark_eval/dbsnp_138.hg18.vcf -pairHMM AVX_LOGLESS_CACHING -maxReadsPerAlignmentStart 10" 8 8 32g 4g
diff --git a/scripts/spark_eval/genome_bqsr.sh b/scripts/spark_eval/genome_bqsr.sh
diff --git a/scripts/spark_eval/genome_bqsr_gcs_hdfs.sh b/scripts/spark_eval/genome_bqsr_gcs_hdfs.sh
diff --git a/scripts/spark_eval/genome_sanity_check.sh → scripts/spark_eval/genome_count-reads_gcs.sh b/scripts/spark_eval/genome_sanity_check.sh → scripts/spark_eval/genome_count-reads_gcs.sh
@@ -1,5 +1,7 @@
 #!/usr/bin/env bash
 
+# Run count reads on genome data in GCS.
+
 . utils.sh
 
 time_gatk "CountReadsSpark -I gs://hellbender/q4_spark_eval/WGS-G94982-NA12878.bam" 4 4 4g 4g
diff --git a/scripts/spark_eval/genome_hc_gcs_hdfs.sh b/scripts/spark_eval/genome_hc_gcs_hdfs.sh