-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add manifest summary file to GvsExtractCallset #7457
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks. We should definitely check with spec ops for their thoughts; I think this should be generally useful but there may be other things they want to see or not see here.
@@ -188,16 +195,30 @@ task ExtractTask { | |||
~{true='--emit-pls' false='' emit_pls} \ | |||
${FILTERING_ARGS} | |||
|
|||
du -b ~{output_file} | cut -f1 > vcf_bytes.txt | |||
du -b ~{output_file}.tbi | cut -f1 > vcf_index_bytes.txt | |||
INTERVAL_NUMBER=$(echo ~{output_file} | grep -oEi '[0-9]+\.vcf\.gz' | cut -d'.' -f1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is complicated and also breaks layers of abstraction by assuming a specific filename format. If you need this, I would add a new input parameter like interval_index
|
||
# Drop trailing slash if one exists | ||
OUTPUT_GCS_DIR=$(echo ~{output_gcs_dir} | sed 's/\/$//') | ||
|
||
if [ -n "${OUTPUT_GCS_DIR}" ]; then | ||
gsutil cp ~{output_file} ${OUTPUT_GCS_DIR}/ | ||
gsutil cp ~{output_file}.tbi ${OUTPUT_GCS_DIR}/ | ||
OUTPUT_FILE_DEST=${OUTPUT_GCS_DIR}/~{output_file} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, existing issue: a lot of these variable replacements should probably be quoted. This prevents potential word splitting bugs, an possibly security vulnerabilities.
|
||
# Parent Task will collect manifest lines and create a joined file | ||
# Currently, the schema is `[interval_number], [output_file_location], [output_file_size_bytes], [output_file_index_location], [output_file_size_bytes]` | ||
echo ${INTERVAL_NUMBER},${OUTPUT_FILE_DEST},${OUTPUT_FILE_BYTES},${OUTPUT_FILE_INDEX_DEST},${OUTPUT_FILE_INDEX_BYTES} >> manifest.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not totally sold on this column's necessity. What are your thoughts? Is it needed in order to produce an ordered/merged manifest, i.e. for the sort -n
later?
If we do need it, then I would use more generic terminology here, e.g. shard_index
or output_index
. I think the fact that this corresponds to a set of intervals currently is not relevant to or usable by an end consumer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with dropping it. I mostly added it for the sort functionality but also didn't think it would hurt to add.
task CreateManifest { | ||
|
||
input { | ||
Array[String] manifest_lines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these ordered? If so, I suppose this task could also be responsible for numbering the rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. I wanted them to be ordered which is why I run it through sort -n
. It's also why I have the index as the first column.. I can see if I can try something smarter by using a Map in WDL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find a way to make this work with the built in WDL functions.. I can package the interval_index and manifest line with their Pair
object but I don't see a reasonable way to actually print the values out into the command block. Given that, I'm going to pass the index through as part of the manifest line like I was doing before but I am at least passing IN the index as part of the scatter call instead of parsing it from the filename.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks reasonable to me -- question about the non-output gcs bucket path, and agree with CHs comments code/style wise
|
||
# Parent Task will collect manifest lines and create a joined file | ||
# Currently, the schema is `[interval_number], [output_file_location], [output_file_size_bytes], [output_file_index_location], [output_file_size_bytes]` | ||
echo ~{interval_index},${OUTPUT_FILE_DEST},${OUTPUT_FILE_BYTES},${OUTPUT_FILE_INDEX_DEST},${OUTPUT_FILE_INDEX_BYTES} >> manifest.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe in the case where there is not an OUTPUT_GCS_DIR
theOUTPUT_FILE_DEST
is just going to be the filename on local filesystem not in the cromwell execution gcs bucket… is that what you intend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. I could try to add the cromwell execution gcs bucket though, didn't know if that was possible. Do you know how I can get the current execution directory from a cromwell task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kcibul going to merge this for now so I can get this in but I can follow up
* add manifest file * proper sorting * fix regex * better comment * remove debug lines * upload manifest to GCS * pass through interval number * write as map * wdl 1.1 * use old method for sort * add quotes
The manifest file describes the files produced by the GvsExtractCallset task and is uploaded to GCS once all the shards have finished running.
The existence of the manifest.txt file can be used to determine if the extraction is complete or not by just using
gsutil
command and not going through Cromwell/Terra.Here's a snippet of what that looks like.