Skip to content

Commit

Permalink
Simplify spark_eval scripts and improve documentation. (#3580)
Browse files Browse the repository at this point in the history
  • Loading branch information
tomwhite authored Sep 27, 2017
1 parent 61542a0 commit 847048d
Show file tree
Hide file tree
Showing 35 changed files with 314 additions and 308 deletions.
102 changes: 77 additions & 25 deletions scripts/spark_eval/README.md
Original file line number Diff line number Diff line change
@@ -1,69 +1,121 @@
# Spark Evaluation

This directory contains scripts for testing Spark command lines. It is based on this [test definition document](https://docs.google.com/document/d/1OEfV2XNXdbGQQdW-gRaQYY2QRgGsWHUNKJUSAj3qFlE/edit), but also has scripts for running full pipelines.
This directory contains scripts for testing GATK pipelines on Spark - either on a dedicated cluster or on Google Cloud Dataproc.

## TL;DR

```bash
export API_KEY=...
export GCS_CLUSTER=...

# Sanity check on small data (a few mins)
./run_gcs_cluster.sh small_reads-pipeline_gcs.sh

# Run on exome (<1hr)
nohup ./run_gcs_cluster.sh exome_reads-pipeline_gcs.sh &

# Run on genome (a few hrs)
NUM_WORKERS=20 nohup ./run_gcs_cluster.sh copy_genome_to_hdfs_on_gcs.sh genome_md-bqsr-hc_hdfs.sh &

# Check results
cat results/*
```

## Obtaining and preparing the data

Most of the data can be obtained from the [GATK resource bundle](https://software.broadinstitute.org/gatk/download/bundle).
There are three main datasets of increasing size: _small_ , _exome_, and _genome_ (WGS). The _small_ data is useful for sanity checking command lines before running them on the larger _exome_ and whole _genome_ datasets.

There is also some data in a GCS bucket for this evaluation: _gs://hellbender/q4_spark_eval/_.
The datasets are stored in GCS buckets, so if you run using GCS input and output then there is no initial data preparation.

To copy the exome data into the cluster run:

```bash
./prep_data_exome_gcs.sh
./copy_exome_to_hdfs.sh
```

For the whole genome data, run:

```bash
./prep_data_genome_gcs.sh
./copy_genome_to_hdfs.sh
```

By default these copy the data into a directory in the current user's directory. To copy to different directory, add an argument like this:

```bash
./prep_data_genome_gcs.sh /data/shared/spark_eval
./copy_genome_to_hdfs.sh /data/shared/spark_eval
```

## Running test cases

If you want to run tests from the [test definition document](https://docs.google.com/document/d/1OEfV2XNXdbGQQdW-gRaQYY2QRgGsWHUNKJUSAj3qFlE/edit), then run a command like the following:
Most of the data was obtained from the [GATK resource bundle](https://software.broadinstitute.org/gatk/download/bundle).

```bash
nohup ./test_case_2.sh &
```

The output is saved to a CSV file (one per test case type), which can be analysed using _spark_eval.R_ to create plots.
There is also some data in a GCS bucket for this evaluation: _gs://hellbender/q4_spark_eval/_.

## Running pipelines

The following shows how to run pipelines - from aligned reads to variants.
The following shows how to run pipelines - from aligned reads to variants. The scripts follow a naming convention to make it easier to understand what they do:

### Running the exome pipeline on GCS (with data in HDFS)
```
<dataset>_<GATK tools>_<source/sink>.sh
```

So `small_reads-pipeline_gcs.sh` will run `ReadsPipelineSpark` on the `small` dataset in GCS (and writing output to GCS).

For exome data, try `n1-standard-16` GCS worker instances, which have 60GB of memory, and 16 vCPUs. 2000GB of disk space per worker should be sufficient. Use 1 master + 5 workers. The master has lower resource requirements so `n1-standard-4`, 500GB disk is enough.
To run on Dataproc, make sure you set `API_KEY` and `GCS_CLUSTER` environment variables:

```bash
export API_KEY=...
export GCS_CLUSTER=...
```

nohup ./exome_pipeline_gcs_hdfs.sh &
### Running the exome pipeline on Dataproc (with data in HDFS)

For exome data, try `n1-standard-16` Dataproc worker instances, which have 60GB of memory, and 16 vCPUs. 2000GB of disk space per worker should be sufficient. Use 1 master + 5 workers. The master has lower resource requirements so `n1-standard-4`, 500GB disk is enough.

```bash
nohup ./exome_md-bqsr-hc_hdfs.sh &
```

This will take less than an hour.

### Running the whole genome pipeline on GCS (with data in HDFS)
### Running the whole genome pipeline on Dataproc (with data in HDFS)

For whole genome data, use the same instance types but try 10 workers.

```bash
export API_KEY=...
export GCS_CLUSTER=...

nohup ./genome_md_gcs_hdfs.sh &
nohup ./genome_bqsr_gcs_hdfs.sh &
nohup ./genome_hc_gcs_hdfs.sh &
nohup ./genome_md-bqsr-hc_hdfs.sh &
```

This will take a few hours.

### Running end-to-end

The following starts a GCS cluster, runs the given pipeline, then deletes the cluster.

```bash
nohup ./run_gcs_cluster.sh small_reads-pipeline_gcs.sh &
```

To copy the dataset to HDFS use a copy script first:

```bash
nohup ./run_gcs_cluster.sh copy_small_to_hdfs_on_gcs.sh small_reads-pipeline_hdfs.sh &
```

### More examples

```bash
# Exome ReadsSparkPipeline on HDFS
nohup ./run_gcs_cluster.sh copy_exome_to_hdfs_on_gcs.sh exome_reads-pipeline_hdfs.sh &

# Genome Mark Duplicates, BQSR, Haplotype Caller on HDFS using 20 workers
NUM_WORKERS=20 nohup ./run_gcs_cluster.sh copy_genome_to_hdfs_on_gcs.sh genome_md-bqsr-hc_hdfs.sh &
```

## Running test cases

If you want to run tests from the [test definition document](https://docs.google.com/document/d/1OEfV2XNXdbGQQdW-gRaQYY2QRgGsWHUNKJUSAj3qFlE/edit), then run a command like the following:

```bash
nohup ./test_case_2.sh &
```

The output is saved to a CSV file (one per test case type), which can be analysed using _spark_eval.R_ to create plots.
38 changes: 38 additions & 0 deletions scripts/spark_eval/copy_exome_to_hdfs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
#!/usr/bin/env bash

# Download all required data for exomes and store in HDFS. Use this for non-GCS clusters.

TARGET_DIR=${1:-exome_spark_eval}

hadoop fs -stat $TARGET_DIR > /dev/null 2>&1
if [ $? -eq 0 ]; then
echo "$TARGET_DIR already exists. Delete it and try again."
exit 1
fi

set -e
set -x

# Create data directory in HDFS
hadoop fs -mkdir -p $TARGET_DIR

# Download exome BAM (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20101201_cg_NA12878/)
gsutil cp gs://broad-spark-eval-test-data/data/NA12878.ga2.exome.maq.raw.bam - | hadoop fs -put - $TARGET_DIR/NA12878.ga2.exome.maq.raw.bam

# Download reference (hg18) (ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg18/)
gsutil cp gs://broad-spark-eval-test-data/data/Homo_sapiens_assembly18.2bit - | hadoop fs -put - $TARGET_DIR/Homo_sapiens_assembly18.2bit
gsutil cp gs://broad-spark-eval-test-data/data/Homo_sapiens_assembly18.dict - | hadoop fs -put - $TARGET_DIR/Homo_sapiens_assembly18.dict
gsutil cp gs://broad-spark-eval-test-data/data/Homo_sapiens_assembly18.fasta.fai - | hadoop fs -put - $TARGET_DIR/Homo_sapiens_assembly18.fasta.fai
gsutil cp gs://broad-spark-eval-test-data/data/Homo_sapiens_assembly18.fasta - | hadoop fs -put - $TARGET_DIR/Homo_sapiens_assembly18.fasta

# (Code for generating 2bit)
#hadoop fs -get $TARGET_DIR/Homo_sapiens_assembly18.fasta
#curl -O http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/faToTwoBit
#chmod +x faToTwoBit
#./faToTwoBit Homo_sapiens_assembly18.fasta Homo_sapiens_assembly18.2bit

# Download known sites VCF (hg18) (ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg18/)
gsutil cp gs://broad-spark-eval-test-data/data/dbsnp_138.hg18.vcf - | hadoop fs -put - $TARGET_DIR/dbsnp_138.hg18.vcf

# List data
hadoop fs -ls -h $TARGET_DIR
11 changes: 11 additions & 0 deletions scripts/spark_eval/copy_exome_to_hdfs_on_gcs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/usr/bin/env bash

# Copy exome data to HDFS on a GCS cluster.

${GATK_HOME:-../..}/gatk-launch ParallelCopyGCSDirectoryIntoHDFSSpark \
--inputGCSPath gs://broad-spark-eval-test-data/exome/ \
--outputHDFSDirectory hdfs://${GCS_CLUSTER}-m:8020/user/$USER/exome_spark_eval \
-apiKey $API_KEY \
-- \
--sparkRunner GCS \
--cluster $GCS_CLUSTER
44 changes: 44 additions & 0 deletions scripts/spark_eval/copy_genome_to_hdfs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#!/usr/bin/env bash

# Download all required data for genomes and store in HDFS. Use this for non-GCS clusters.

TARGET_DIR=${1:-q4_spark_eval}

hadoop fs -stat $TARGET_DIR > /dev/null 2>&1
if [ $? -eq 0 ]; then
echo "$TARGET_DIR already exists. Delete it and try again."
exit 1
fi

set -e
set -x

# Create data directory in HDFS
hadoop fs -mkdir -p $TARGET_DIR

# Download WGS BAM
#gsutil cp gs://hellbender/q4_spark_eval/WGS-G94982-NA12878.bam - | hadoop fs -put - $TARGET_DIR/WGS-G94982-NA12878.bam
#gsutil cp gs://hellbender/q4_spark_eval/WGS-G94982-NA12878.bai - | hadoop fs -put - $TARGET_DIR/WGS-G94982-NA12878.bai
# BAM with NC_007605 reads removed since this contig is not in the reference
gsutil gs://broad-spark-eval-test-data/genome/WGS-G94982-NA12878-no-NC_007605.bam - | hadoop fs -put - /user/tom/q4_spark_eval/WGS-G94982-NA12878-no-NC_007605.bam

# Download reference (b37) (ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/)
gsutil gs://broad-spark-eval-test-data/genome/human_g1k_v37.2bit - | hadoop fs -put - $TARGET_DIR/human_g1k_v37.2bit
gsutil gs://broad-spark-eval-test-data/genome/human_g1k_v37.dict - | hadoop fs -put - $TARGET_DIR/human_g1k_v37.dict
gsutil gs://broad-spark-eval-test-data/genome/human_g1k_v37.fasta - | hadoop fs -put - $TARGET_DIR/human_g1k_v37.fasta
gsutil gs://broad-spark-eval-test-data/genome/human_g1k_v37.fasta.fai - | hadoop fs -put - $TARGET_DIR/human_g1k_v37.fasta.fai

# (Code for generating 2bit)
#hadoop fs -get $TARGET_DIR/human_g1k_v37.fasta
#curl -O http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/faToTwoBit
#chmod +x faToTwoBit
#./faToTwoBit human_g1k_v37.fasta human_g1k_v37.2bit

# Download known sites VCF (b37) (ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/)
gsutil gs://broad-spark-eval-test-data/genome/dbsnp_138.b37.vcf - | hadoop fs -put - $TARGET_DIR/dbsnp_138.b37.vcf

# Download exome intervals
gsutil cp gs://broad-spark-eval-test-data/genome/Broad.human.exome.b37.interval_list Broad.human.exome.b37.interval_list

# List data
hadoop fs -ls -h $TARGET_DIR
11 changes: 11 additions & 0 deletions scripts/spark_eval/copy_genome_to_hdfs_on_gcs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/usr/bin/env bash

# Copy genome data to HDFS on a GCS cluster.

${GATK_HOME:-../..}/gatk-launch ParallelCopyGCSDirectoryIntoHDFSSpark \
--inputGCSPath gs://broad-spark-eval-test-data/genome/ \
--outputHDFSDirectory hdfs://${GCS_CLUSTER}-m:8020/user/$USER/q4_spark_eval \
-apiKey $API_KEY \
-- \
--sparkRunner GCS \
--cluster $GCS_CLUSTER
33 changes: 33 additions & 0 deletions scripts/spark_eval/copy_small_to_hdfs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#!/usr/bin/env bash

# Download all required data for small BAM and store in HDFS. Use this for non-GCS clusters.

TARGET_DIR=${1:-small_spark_eval}

hadoop fs -stat $TARGET_DIR > /dev/null 2>&1
if [ $? -eq 0 ]; then
echo "$TARGET_DIR already exists. Delete it and try again."
exit 1
fi

set -e
set -x

# Create data directory in HDFS
hadoop fs -mkdir -p $TARGET_DIR

# Download exome BAM
gsutil cp gs://broad-spark-eval-test-data/small/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam - | hadoop fs -put - $TARGET_DIR/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam
gsutil cp gs://broad-spark-eval-test-data/small/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam.bai - | hadoop fs -put - $TARGET_DIR/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam.bai

# Download reference
gsutil cp gs://broad-spark-eval-test-data/small/human_g1k_v37.20.21.2bit - | hadoop fs -put - $TARGET_DIR/human_g1k_v37.20.21.2bit
gsutil cp gs://broad-spark-eval-test-data/small/human_g1k_v37.20.21.dict - | hadoop fs -put - $TARGET_DIR/human_g1k_v37.20.21.dict
gsutil cp gs://broad-spark-eval-test-data/small/human_g1k_v37.20.21.fasta.fai - | hadoop fs -put - $TARGET_DIR/human_g1k_v37.20.21.fasta.fai
gsutil cp gs://broad-spark-eval-test-data/small/human_g1k_v37.20.21.fasta - | hadoop fs -put - $TARGET_DIR/human_g1k_v37.20.21.fasta

# Download known sites VCF
gsutil cp gs://broad-spark-eval-test-data/small/dbsnp_138.b37.20.21.vcf - | hadoop fs -put - $TARGET_DIR/dbsnp_138.b37.20.21.vcf

# List data
hadoop fs -ls -h $TARGET_DIR
11 changes: 11 additions & 0 deletions scripts/spark_eval/copy_small_to_hdfs_on_gcs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/usr/bin/env bash

# Copy small data to HDFS on a GCS cluster.

${GATK_HOME:-../..}/gatk-launch ParallelCopyGCSDirectoryIntoHDFSSpark \
--inputGCSPath gs://broad-spark-eval-test-data/small/ \
--outputHDFSDirectory hdfs://${GCS_CLUSTER}-m:8020/user/$USER/small_spark_eval \
-apiKey $API_KEY \
-- \
--sparkRunner GCS \
--cluster $GCS_CLUSTER
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
#!/usr/bin/env bash

# Run the pipeline (Mark Duplicates, BQSR, Haplotype Caller) on exome data on a Spark cluster.
# Run the pipeline (Mark Duplicates, BQSR, Haplotype Caller) on exome data in HDFS.

. utils.sh

time_gatk "MarkDuplicatesSpark -I hdfs:///user/$USER/exome_spark_eval/NA12878.ga2.exome.maq.raw.bam -O hdfs:///user/$USER/exome_spark_eval/out/markdups-sharded --shardedOutput true" 48 1 4g 4g
time_gatk "BQSRPipelineSpark -I hdfs:///user/$USER/exome_spark_eval/out/markdups-sharded -O hdfs:///user/$USER/exome_spark_eval/out/bqsr-sharded --shardedOutput true -R hdfs:///user/$USER/exome_spark_eval/Homo_sapiens_assembly18.2bit --knownSites hdfs:///user/$USER/exome_spark_eval/dbsnp_138.hg18.vcf --joinStrategy OVERLAPS_PARTITIONER" 4 8 32g 4g
time_gatk "HaplotypeCallerSpark -I hdfs:///user/$USER/exome_spark_eval/out/bqsr-sharded -R hdfs:///user/$USER/exome_spark_eval/Homo_sapiens_assembly18.2bit -O hdfs:///user/$USER/exome_spark_eval/out/NA12878.ga2.exome.maq.raw.vcf -pairHMM AVX_LOGLESS_CACHING" 48 1 4g 4g
time_gatk "MarkDuplicatesSpark -I hdfs:///user/$USER/exome_spark_eval/NA12878.ga2.exome.maq.raw.bam -O hdfs:///user/$USER/exome_spark_eval/out/markdups-sharded --shardedOutput true" 96 1 4g 4g
time_gatk "BQSRPipelineSpark -I hdfs:///user/$USER/exome_spark_eval/out/markdups-sharded -O hdfs:///user/$USER/exome_spark_eval/out/bqsr-sharded --shardedOutput true -R hdfs:///user/$USER/exome_spark_eval/Homo_sapiens_assembly18.2bit --knownSites hdfs://${HDFS_HOST_PORT}/user/$USER/exome_spark_eval/dbsnp_138.hg18.vcf" 8 8 32g 4g
time_gatk "HaplotypeCallerSpark -I hdfs:///user/$USER/exome_spark_eval/out/bqsr-sharded -R hdfs:///user/$USER/exome_spark_eval/Homo_sapiens_assembly18.2bit -O hdfs://${HDFS_HOST_PORT}/user/$USER/exome_spark_eval/out/NA12878.ga2.exome.maq.raw.vcf -pairHMM AVX_LOGLESS_CACHING -maxReadsPerAlignmentStart 10" 64 1 6g 4g
10 changes: 0 additions & 10 deletions scripts/spark_eval/exome_pipeline_gcs_hdfs.sh

This file was deleted.

7 changes: 0 additions & 7 deletions scripts/spark_eval/exome_pipeline_single_gcs_hdfs.sh

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#!/usr/bin/env bash

# Run the pipeline (ReadsPipelineSpark) on exome data on a GCS Dataproc cluster. Data is in GCS.
# Run the pipeline (ReadsPipelineSpark) on exome data in GCS.

. utils.sh

time_gatk "ReadsPipelineSpark -I gs://gatk-tom-testdata-exome/NA12878.ga2.exome.maq.raw.bam -O gs://gatk-tom-testdata-exome/NA12878.ga2.exome.maq.raw.vcf -R gs://gatk-tom-testdata-exome/Homo_sapiens_assembly18.2bit --knownSites gs://gatk-tom-testdata-exome/dbsnp_138.hg18.vcf -pairHMM AVX_LOGLESS_CACHING -maxReadsPerAlignmentStart 10" 4 8 32g 4g
time_gatk "ReadsPipelineSpark -I gs://gatk-tom-testdata-exome/NA12878.ga2.exome.maq.raw.bam -O gs://gatk-tom-testdata-exome/NA12878.ga2.exome.maq.raw.vcf -R gs://gatk-tom-testdata-exome/Homo_sapiens_assembly18.2bit --knownSites gs://gatk-tom-testdata-exome/dbsnp_138.hg18.vcf -pairHMM AVX_LOGLESS_CACHING -maxReadsPerAlignmentStart 10" 8 8 32g 4g
7 changes: 7 additions & 0 deletions scripts/spark_eval/exome_reads-pipeline_hdfs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/usr/bin/env bash

# Run the pipeline (ReadsPipelineSpark) on exome data in HDFS.

. utils.sh

time_gatk "ReadsPipelineSpark -I hdfs:///user/$USER/exome_spark_eval/NA12878.ga2.exome.maq.raw.bam -O hdfs://${HDFS_HOST_PORT}/user/$USER/exome_spark_eval/out/NA12878.ga2.exome.maq.raw.vcf -R hdfs:///user/$USER/exome_spark_eval/Homo_sapiens_assembly18.2bit --knownSites hdfs://${HDFS_HOST_PORT}/user/$USER/exome_spark_eval/dbsnp_138.hg18.vcf -pairHMM AVX_LOGLESS_CACHING -maxReadsPerAlignmentStart 10" 8 8 32g 4g
10 changes: 0 additions & 10 deletions scripts/spark_eval/genome_bqsr.sh

This file was deleted.

7 changes: 0 additions & 7 deletions scripts/spark_eval/genome_bqsr_gcs_hdfs.sh

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#!/usr/bin/env bash

# Run count reads on genome data in GCS.

. utils.sh

time_gatk "CountReadsSpark -I gs://hellbender/q4_spark_eval/WGS-G94982-NA12878.bam" 4 4 4g 4g
7 changes: 0 additions & 7 deletions scripts/spark_eval/genome_hc_gcs_hdfs.sh

This file was deleted.

Loading

0 comments on commit 847048d

Please sign in to comment.