-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scripts to manage SV spark jobs and copy result #3370
Conversation
Add new scripts to gatk/scripts/sv/ folder, and alter action (but not passed parameters) of older scripts to make running sv spark jobs more convenient. Added: -copy_sv_results.sh: copy files to time and git-stamped folder on google cloud storage -> results folder on cluster -> command line arguments to SV discover pipeline -> console log file (if present) -manage_sv_pipeline.sh: create cluster, run job, copy results, and delete cluster. Manage cluster naming, time and git-stamping, and log file production. Altered: -create_cluster.sh: control GCS zone and numbers of workers via environmental variables. Defaults to previous hard-coded values. -runWholePipeline.sh: accept command-line arguments for sv discovery pipeline, work with clusters having NUM_WORKERS != 10
Codecov Report
@@ Coverage Diff @@
## master #3370 +/- ##
===============================================
+ Coverage 80.495% 80.542% +0.048%
- Complexity 17509 17583 +74
===============================================
Files 1173 1173
Lines 63368 63605 +237
Branches 9876 9945 +69
===============================================
+ Hits 51008 51229 +221
- Misses 8411 8420 +9
- Partials 3949 3956 +7
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TedBrookings done for the moment. Concern is about the shift
s. I am going to run the script once.
scripts/sv/runWholePipeline.sh
Outdated
eval "SV_ARGS=\"${SV_ARGS}\"" | ||
|
||
# Choose NUM_EXECUTORS = 2 * NUM_WORKERS | ||
NUM_WORKERS=$(gcloud compute instances list --filter="name ~ ${CLUSTER_NAME}-[sw].*" | grep RUNNING | wc -l) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got warnings saying "This argument is deprecated".
How about
gcloud dataproc clusters list | grep -F "${CLUSTER_NAME}" | awk '{print $2}'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would eliminate the warning. I think the warning itself is actually a bug (it corresponds to passing the NAME argument, not using a filter). I've avoided listing all instances and using grep because that corresponds to potentially downloading a large number of instances, whereas this in principle might filter before download, saving data. If the warning is disconcerting, I can make the fix you suggest and then revert once google gets their act together on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a filter command for the gcloud dataproc clusters list command, so I changed and pushed that. It has the downside of not seeing preemptible instances, but preemptible instances actually crash the pipeline right now, so it's not a super-important problem.
scripts/sv/manage_sv_pipeline.sh
Outdated
[4] GCS path to reference fasta (required) | ||
OPTIONAL arguments: | ||
[5] path to initialization script (local or GCS, defaults to | ||
${GATK_DIR}/scripts/sv/default_init.sh if omitted or empty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a concern here: if the 5th and 6th argument are omitted, is the shift 6
statement below going to skip two arguments that are intended to be sent to java be "shifted" away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user would have to pass empty strings (signaling default values) for argument 4 and 5 to then pass arguments 6 and above. I added a description of this to the documentation and pushed the fix.
scripts/sv/manage_sv_pipeline.sh
Outdated
PATH="${GATK_DIR}/scripts/sv:${PATH}" | ||
|
||
# configure caching .jar files | ||
export GATK_GCS_STAGING="gs://${PROJECT_NAME}/${GCS_USER}/staging/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer a check if an environment variable has been set or not, because a user may already have this set in my shell profile ( I personally do), and preferably would like the jars being cached there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I'll fix that.
scripts/sv/manage_sv_pipeline.sh
Outdated
export GATK_GCS_STAGING="gs://${PROJECT_NAME}/${GCS_USER}/staging/" | ||
|
||
# set cluster name based on user and target bam file | ||
# (NOTE: can override by defining SV_CLUSTER_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to be exposed to the user (i.e. it is hidden knowledge).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I somehow missed this comment. I just added a line to inform the user what the CLUSTER_NAME was set to. I'll push again.
scripts/sv/manage_sv_pipeline.sh
Outdated
CLUSTER_NAME=${SV_CLUSTER_NAME:-"${GCS_USER}-${SANITIZED_BAM}"} | ||
|
||
# update gcloud | ||
gcloud components update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requires interaction from user, may be a surprise to someone.
scripts/sv/manage_sv_pipeline.sh
Outdated
# copy results into gcloud | ||
while true; do | ||
echo "#############################################################" 2>&1 | tee -a ${LOCAL_LOG_FILE} | ||
read -p "Copy results? (yes/no/cancel)" yn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for being edge-case picky: if a user answered no to run whole pipeline, this would end up copying nothing, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends. They may have run previously (perhaps not using this script, or perhaps there was an error). If there's data, the copy script should copy it off, and regardless, it should copy any log files or command line arguments.
scripts/sv/copy_sv_results.sh
Outdated
GCS_USER=${4:-${USER}} | ||
LOCAL_LOG_FILE=${5:-"/dev/null"} | ||
|
||
shift 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same concern: if we only provide 3 required arguments and the rest is java arguments, are 4 and 5 "shifted" away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here as well, I just added a description of how to specify later values and leave early ones as default.
Turns out a typo prevents running the "manage_sv_pipeline" script, saying GATK_DIR is an unbound variable. Please fix. |
Unbound variable bug fixed. |
I cannot seem to be able to run the pipeline with the following way of running it: ./manage_sv_pipeline.sh \
/Users/shuang/GATK \
broad-dsde-methods \
gs://broad-dsde-methods/sv/samples/G94797_CHM_MIX/WGS1/G94794.CHMI_CHMI3_WGS1.cram.bam \
gs://broad-dsde-methods/sv/reference/GRCh38/Homo_sapiens_assembly38.fasta \
gs://broad-dsde-methods/shuang/tmp/gatk-jars/default_init.sh And if I run with the following way (only adding a user name) ./manage_sv_pipeline.sh \
/Users/shuang/GATK \
broad-dsde-methods \
gs://broad-dsde-methods/sv/samples/G94797_CHM_MIX/WGS1/G94794.CHMI_CHMI3_WGS1.cram.bam \
gs://broad-dsde-methods/sv/reference/GRCh38/Homo_sapiens_assembly38.fasta \
gs://broad-dsde-methods/shuang/tmp/gatk-jars/default_init.sh \
shuang The script runs as expected. I believe line 47 is the culprit: SV_ARGS=${*:-${SV_ARGS:-""}} && SV_ARGS=${SV_ARGS:+" ${SV_ARGS}"} |
My test run has successfully finished. Awesome, @TedBrookings ! Two suggestions that you could decide when to address them:
|
I addressed the crash (caused by calling shift with too large a number and set-ing strict error checks in bash). I also added the requested options. Due to the larger number of options I decided that using only positional arguments would be impractical (requiring counting the number of empty "" default arguments) so I added a unix-like option parsing scheme to manage_sv_pipeline.sh |
Tested to be working. Thanks! |
Add new scripts to gatk/scripts/sv/ folder, and alter action (but not
passed parameters) of older scripts to make running sv spark jobs
more convenient.
Added:
-copy_sv_results.sh: copy files to time and git-stamped folder on
google cloud storage
-> results folder on cluster
-> command line arguments to SV discover pipeline
-> console log file (if present)
-manage_sv_pipeline.sh: create cluster, run job, copy results, and
delete cluster. Manage cluster naming, time and git-stamping,
and log file production.
Altered:
-create_cluster.sh: control GCS zone and numbers of workers via
environmental variables. Defaults to previous hard-coded values.
-runWholePipeline.sh: accept command-line arguments for sv
discovery pipeline, work with clusters having NUM_WORKERS != 10