Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scripts to manage SV spark jobs and copy result #3370

Merged
merged 6 commits into from
Aug 2, 2017

Conversation

TedBrookings
Copy link
Contributor

Add new scripts to gatk/scripts/sv/ folder, and alter action (but not
passed parameters) of older scripts to make running sv spark jobs
more convenient.
Added:
-copy_sv_results.sh: copy files to time and git-stamped folder on
google cloud storage
-> results folder on cluster
-> command line arguments to SV discover pipeline
-> console log file (if present)

-manage_sv_pipeline.sh: create cluster, run job, copy results, and
delete cluster. Manage cluster naming, time and git-stamping,
and log file production.

Altered:
-create_cluster.sh: control GCS zone and numbers of workers via
environmental variables. Defaults to previous hard-coded values.

-runWholePipeline.sh: accept command-line arguments for sv
discovery pipeline, work with clusters having NUM_WORKERS != 10

Add new scripts to gatk/scripts/sv/ folder, and alter action (but not
passed parameters) of older scripts to make running sv spark jobs
more convenient.
Added:
  -copy_sv_results.sh: copy files to time and git-stamped folder on
   google cloud storage
     -> results folder on cluster
     -> command line arguments to SV discover pipeline
     -> console log file (if present)

  -manage_sv_pipeline.sh: create cluster, run job, copy results, and
   delete cluster. Manage cluster naming, time and git-stamping,
   and log file production.

Altered:
  -create_cluster.sh: control GCS zone and numbers of workers via
   environmental variables. Defaults to previous hard-coded values.

  -runWholePipeline.sh: accept command-line arguments for sv
   discovery pipeline, work with clusters having NUM_WORKERS != 10
@codecov-io
Copy link

codecov-io commented Jul 27, 2017

Codecov Report

Merging #3370 into master will increase coverage by 0.048%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##              master     #3370       +/-   ##
===============================================
+ Coverage     80.495%   80.542%   +0.048%     
- Complexity     17509     17583       +74     
===============================================
  Files           1173      1173               
  Lines          63368     63605      +237     
  Branches        9876      9945       +69     
===============================================
+ Hits           51008     51229      +221     
- Misses          8411      8420        +9     
- Partials        3949      3956        +7
Impacted Files Coverage Δ Complexity Δ
...rk/pathseq/PSPathogenReferenceTaxonProperties.java 90% <0%> (-10%) 13% <0%> (+12%)
...ols/walkers/contamination/ContaminationRecord.java 87.302% <0%> (-2.698%) 8% <0%> (+3%)
...oadinstitute/hellbender/utils/gcs/BucketUtils.java 72.368% <0%> (-1.974%) 36% <0%> (ø)
...bender/tools/walkers/vqsr/VariantRecalibrator.java 60.584% <0%> (-0.148%) 58% <0%> (ø)
...s/spark/pathseq/PSBuildReferenceTaxonomyUtils.java 90.541% <0%> (+1.579%) 46% <0%> (+7%) ⬆️
...er/tools/walkers/variantutils/VariantsToTable.java 95.968% <0%> (+1.885%) 114% <0%> (+41%) ⬆️
.../walkers/contamination/CalculateContamination.java 94.872% <0%> (+3.963%) 26% <0%> (+7%) ⬆️
...s/spark/pathseq/PathSeqBuildReferenceTaxonomy.java 75% <0%> (+6.818%) 9% <0%> (+4%) ⬆️

Copy link
Contributor

@SHuang-Broad SHuang-Broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TedBrookings done for the moment. Concern is about the shifts. I am going to run the script once.

eval "SV_ARGS=\"${SV_ARGS}\""

# Choose NUM_EXECUTORS = 2 * NUM_WORKERS
NUM_WORKERS=$(gcloud compute instances list --filter="name ~ ${CLUSTER_NAME}-[sw].*" | grep RUNNING | wc -l)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got warnings saying "This argument is deprecated".
How about

gcloud dataproc clusters list | grep -F "${CLUSTER_NAME}" | awk '{print $2}'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would eliminate the warning. I think the warning itself is actually a bug (it corresponds to passing the NAME argument, not using a filter). I've avoided listing all instances and using grep because that corresponds to potentially downloading a large number of instances, whereas this in principle might filter before download, saving data. If the warning is disconcerting, I can make the fix you suggest and then revert once google gets their act together on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a filter command for the gcloud dataproc clusters list command, so I changed and pushed that. It has the downside of not seeing preemptible instances, but preemptible instances actually crash the pipeline right now, so it's not a super-important problem.

[4] GCS path to reference fasta (required)
OPTIONAL arguments:
[5] path to initialization script (local or GCS, defaults to
${GATK_DIR}/scripts/sv/default_init.sh if omitted or empty)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a concern here: if the 5th and 6th argument are omitted, is the shift 6 statement below going to skip two arguments that are intended to be sent to java be "shifted" away?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user would have to pass empty strings (signaling default values) for argument 4 and 5 to then pass arguments 6 and above. I added a description of this to the documentation and pushed the fix.

PATH="${GATK_DIR}/scripts/sv:${PATH}"

# configure caching .jar files
export GATK_GCS_STAGING="gs://${PROJECT_NAME}/${GCS_USER}/staging/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a check if an environment variable has been set or not, because a user may already have this set in my shell profile ( I personally do), and preferably would like the jars being cached there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'll fix that.

export GATK_GCS_STAGING="gs://${PROJECT_NAME}/${GCS_USER}/staging/"

# set cluster name based on user and target bam file
# (NOTE: can override by defining SV_CLUSTER_NAME)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be exposed to the user (i.e. it is hidden knowledge).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I somehow missed this comment. I just added a line to inform the user what the CLUSTER_NAME was set to. I'll push again.

CLUSTER_NAME=${SV_CLUSTER_NAME:-"${GCS_USER}-${SANITIZED_BAM}"}

# update gcloud
gcloud components update
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requires interaction from user, may be a surprise to someone.

# copy results into gcloud
while true; do
echo "#############################################################" 2>&1 | tee -a ${LOCAL_LOG_FILE}
read -p "Copy results? (yes/no/cancel)" yn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for being edge-case picky: if a user answered no to run whole pipeline, this would end up copying nothing, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends. They may have run previously (perhaps not using this script, or perhaps there was an error). If there's data, the copy script should copy it off, and regardless, it should copy any log files or command line arguments.

GCS_USER=${4:-${USER}}
LOCAL_LOG_FILE=${5:-"/dev/null"}

shift 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern: if we only provide 3 required arguments and the rest is java arguments, are 4 and 5 "shifted" away.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well, I just added a description of how to specify later values and leave early ones as default.

@SHuang-Broad
Copy link
Contributor

Turns out a typo prevents running the "manage_sv_pipeline" script, saying GATK_DIR is an unbound variable. Please fix.

@TedBrookings
Copy link
Contributor Author

Unbound variable bug fixed.

@SHuang-Broad
Copy link
Contributor

I cannot seem to be able to run the pipeline with the following way of running it:

./manage_sv_pipeline.sh \
    /Users/shuang/GATK \
    broad-dsde-methods \
    gs://broad-dsde-methods/sv/samples/G94797_CHM_MIX/WGS1/G94794.CHMI_CHMI3_WGS1.cram.bam \
    gs://broad-dsde-methods/sv/reference/GRCh38/Homo_sapiens_assembly38.fasta \
    gs://broad-dsde-methods/shuang/tmp/gatk-jars/default_init.sh

And if I run with the following way (only adding a user name)

./manage_sv_pipeline.sh \
    /Users/shuang/GATK \
    broad-dsde-methods \
    gs://broad-dsde-methods/sv/samples/G94797_CHM_MIX/WGS1/G94794.CHMI_CHMI3_WGS1.cram.bam \
    gs://broad-dsde-methods/sv/reference/GRCh38/Homo_sapiens_assembly38.fasta \
    gs://broad-dsde-methods/shuang/tmp/gatk-jars/default_init.sh \
    shuang

The script runs as expected.

I believe line 47 is the culprit:

SV_ARGS=${*:-${SV_ARGS:-""}} && SV_ARGS=${SV_ARGS:+" ${SV_ARGS}"}

@SHuang-Broad
Copy link
Contributor

My test run has successfully finished. Awesome, @TedBrookings !

Two suggestions that you could decide when to address them:

  1. This script requires a lot of user interaction (gcloud update, whether to create a cluster, whether to copy results, whether to delete cluster), I could imagine this being inconvenient for some, so we could have some upfront arguments specifying answers to these questions when the script is launched.
  2. The bucket to which the results are copied is not overridable by caller of script. A user might want to copy the results to a specific location.

@TedBrookings
Copy link
Contributor Author

I addressed the crash (caused by calling shift with too large a number and set-ing strict error checks in bash).

I also added the requested options. Due to the larger number of options I decided that using only positional arguments would be impractical (requiring counting the number of empty "" default arguments) so I added a unix-like option parsing scheme to manage_sv_pipeline.sh
--quiet or -q will cause the script to run without any user prompting
--save [bucket/path] or -s [bucket/path] will save results and initialization scripts within the specified bucket/path

@SHuang-Broad
Copy link
Contributor

Tested to be working. Thanks!
Please merge when you are ready.

@TedBrookings TedBrookings merged commit bdaef5a into master Aug 2, 2017
@TedBrookings TedBrookings deleted the tb_sv_convenience_scripts branch August 2, 2017 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants