Add scripts to manage SV spark jobs and copy result #3370

TedBrookings · 2017-07-27T18:14:50Z

Add new scripts to gatk/scripts/sv/ folder, and alter action (but not
passed parameters) of older scripts to make running sv spark jobs
more convenient.
Added:
-copy_sv_results.sh: copy files to time and git-stamped folder on
google cloud storage
-> results folder on cluster
-> command line arguments to SV discover pipeline
-> console log file (if present)

-manage_sv_pipeline.sh: create cluster, run job, copy results, and
delete cluster. Manage cluster naming, time and git-stamping,
and log file production.

Altered:
-create_cluster.sh: control GCS zone and numbers of workers via
environmental variables. Defaults to previous hard-coded values.

-runWholePipeline.sh: accept command-line arguments for sv
discovery pipeline, work with clusters having NUM_WORKERS != 10

Add new scripts to gatk/scripts/sv/ folder, and alter action (but not passed parameters) of older scripts to make running sv spark jobs more convenient. Added: -copy_sv_results.sh: copy files to time and git-stamped folder on google cloud storage -> results folder on cluster -> command line arguments to SV discover pipeline -> console log file (if present) -manage_sv_pipeline.sh: create cluster, run job, copy results, and delete cluster. Manage cluster naming, time and git-stamping, and log file production. Altered: -create_cluster.sh: control GCS zone and numbers of workers via environmental variables. Defaults to previous hard-coded values. -runWholePipeline.sh: accept command-line arguments for sv discovery pipeline, work with clusters having NUM_WORKERS != 10

codecov-io · 2017-07-27T18:58:01Z

Codecov Report

Merging #3370 into master will increase coverage by 0.048%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##              master     #3370       +/-   ##
===============================================
+ Coverage     80.495%   80.542%   +0.048%     
- Complexity     17509     17583       +74     
===============================================
  Files           1173      1173               
  Lines          63368     63605      +237     
  Branches        9876      9945       +69     
===============================================
+ Hits           51008     51229      +221     
- Misses          8411      8420        +9     
- Partials        3949      3956        +7

Impacted Files	Coverage Δ	Complexity Δ
...rk/pathseq/PSPathogenReferenceTaxonProperties.java	`90% <0%> (-10%)`	`13% <0%> (+12%)`
...ols/walkers/contamination/ContaminationRecord.java	`87.302% <0%> (-2.698%)`	`8% <0%> (+3%)`
...oadinstitute/hellbender/utils/gcs/BucketUtils.java	`72.368% <0%> (-1.974%)`	`36% <0%> (ø)`
...bender/tools/walkers/vqsr/VariantRecalibrator.java	`60.584% <0%> (-0.148%)`	`58% <0%> (ø)`
...s/spark/pathseq/PSBuildReferenceTaxonomyUtils.java	`90.541% <0%> (+1.579%)`	`46% <0%> (+7%)`	⬆️
...er/tools/walkers/variantutils/VariantsToTable.java	`95.968% <0%> (+1.885%)`	`114% <0%> (+41%)`	⬆️
.../walkers/contamination/CalculateContamination.java	`94.872% <0%> (+3.963%)`	`26% <0%> (+7%)`	⬆️
...s/spark/pathseq/PathSeqBuildReferenceTaxonomy.java	`75% <0%> (+6.818%)`	`9% <0%> (+4%)`	⬆️

SHuang-Broad

@TedBrookings done for the moment. Concern is about the shifts. I am going to run the script once.

SHuang-Broad · 2017-07-28T19:08:21Z

scripts/sv/runWholePipeline.sh

+eval "SV_ARGS=\"${SV_ARGS}\""
+
+# Choose NUM_EXECUTORS = 2 * NUM_WORKERS
+NUM_WORKERS=$(gcloud compute instances list --filter="name ~ ${CLUSTER_NAME}-[sw].*" | grep RUNNING | wc -l)


I got warnings saying "This argument is deprecated".
How about

gcloud dataproc clusters list | grep -F "${CLUSTER_NAME}" | awk '{print $2}'

That would eliminate the warning. I think the warning itself is actually a bug (it corresponds to passing the NAME argument, not using a filter). I've avoided listing all instances and using grep because that corresponds to potentially downloading a large number of instances, whereas this in principle might filter before download, saving data. If the warning is disconcerting, I can make the fix you suggest and then revert once google gets their act together on this.

I found a filter command for the gcloud dataproc clusters list command, so I changed and pushed that. It has the downside of not seeing preemptible instances, but preemptible instances actually crash the pipeline right now, so it's not a super-important problem.

SHuang-Broad · 2017-07-28T19:15:21Z