-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add manifest summary file to GvsExtractCallset #7457
Changes from 6 commits
9e9f06b
3df6992
8b288e6
1c0f411
565e85f
a46cac6
bbc2508
4187062
b611e87
8231cb1
6ef3b04
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -97,10 +97,17 @@ workflow GvsExtractCallset { | |
file_sizes_bytes = flatten([ExtractTask.output_vcf_bytes, ExtractTask.output_vcf_index_bytes]) | ||
} | ||
|
||
call CreateManifest { | ||
input: | ||
manifest_lines = ExtractTask.manifest, | ||
output_gcs_dir = output_gcs_dir | ||
} | ||
|
||
output { | ||
Array[File] output_vcfs = ExtractTask.output_vcf | ||
Array[File] output_vcf_indexes = ExtractTask.output_vcf_index | ||
Float total_vcfs_size_mb = SumBytes.total_mb | ||
File manifest = CreateManifest.manifest | ||
} | ||
} | ||
|
||
|
@@ -188,16 +195,30 @@ task ExtractTask { | |
~{true='--emit-pls' false='' emit_pls} \ | ||
${FILTERING_ARGS} | ||
|
||
du -b ~{output_file} | cut -f1 > vcf_bytes.txt | ||
du -b ~{output_file}.tbi | cut -f1 > vcf_index_bytes.txt | ||
INTERVAL_NUMBER=$(echo ~{output_file} | grep -oEi '[0-9]+\.vcf\.gz' | cut -d'.' -f1) | ||
|
||
OUTPUT_FILE_BYTES=$(du -b ~{output_file} | cut -f1) | ||
echo ${OUTPUT_FILE_BYTES} > vcf_bytes.txt | ||
|
||
OUTPUT_FILE_INDEX_BYTES=$(du -b ~{output_file}.tbi | cut -f1) | ||
echo ${OUTPUT_FILE_INDEX_BYTES} > vcf_index_bytes.txt | ||
|
||
# Drop trailing slash if one exists | ||
OUTPUT_GCS_DIR=$(echo ~{output_gcs_dir} | sed 's/\/$//') | ||
|
||
if [ -n "${OUTPUT_GCS_DIR}" ]; then | ||
gsutil cp ~{output_file} ${OUTPUT_GCS_DIR}/ | ||
gsutil cp ~{output_file}.tbi ${OUTPUT_GCS_DIR}/ | ||
OUTPUT_FILE_DEST=${OUTPUT_GCS_DIR}/~{output_file} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not for this PR, existing issue: a lot of these variable replacements should probably be quoted. This prevents potential word splitting bugs, an possibly security vulnerabilities. |
||
OUTPUT_FILE_INDEX_DEST=${OUTPUT_GCS_DIR}/~{output_file}.tbi | ||
else | ||
OUTPUT_FILE_DEST=~{output_file} | ||
OUTPUT_FILE_INDEX_DEST=~{output_file}.tbi | ||
fi | ||
|
||
# Parent Task will collect manifest lines and create a joined file | ||
# Currently, the schema is `[interval_number], [output_file_location], [output_file_size_bytes], [output_file_index_location], [output_file_size_bytes]` | ||
echo ${INTERVAL_NUMBER},${OUTPUT_FILE_DEST},${OUTPUT_FILE_BYTES},${OUTPUT_FILE_INDEX_DEST},${OUTPUT_FILE_INDEX_BYTES} >> manifest.txt | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not totally sold on this column's necessity. What are your thoughts? Is it needed in order to produce an ordered/merged manifest, i.e. for the If we do need it, then I would use more generic terminology here, e.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm OK with dropping it. I mostly added it for the sort functionality but also didn't think it would hurt to add. |
||
>>> | ||
|
||
# ------------------------------------------------ | ||
|
@@ -218,6 +239,7 @@ task ExtractTask { | |
Float output_vcf_bytes = read_float("vcf_bytes.txt") | ||
File output_vcf_index = "~{output_file}.tbi" | ||
Float output_vcf_index_bytes = read_float("vcf_index_bytes.txt") | ||
String manifest = read_string("manifest.txt") | ||
} | ||
} | ||
|
||
|
@@ -354,3 +376,37 @@ task SumBytes { | |
cpu: 1 | ||
} | ||
} | ||
|
||
task CreateManifest { | ||
|
||
input { | ||
Array[String] manifest_lines | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are these ordered? If so, I suppose this task could also be responsible for numbering the rows. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah. I wanted them to be ordered which is why I run it through There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I couldn't find a way to make this work with the built in WDL functions.. I can package the interval_index and manifest line with their |
||
String? output_gcs_dir | ||
} | ||
|
||
command <<< | ||
set -e | ||
MANIFEST_LINES_TXT=~{write_lines(manifest_lines)} | ||
echo "interval_number, vcf_file_location, vcf_file_bytes, vcf_index_location, vcf_index_bytes" >> manifest.txt | ||
sort -n ${MANIFEST_LINES_TXT} >> manifest.txt | ||
|
||
# Drop trailing slash if one exists | ||
OUTPUT_GCS_DIR=$(echo ~{output_gcs_dir} | sed 's/\/$//') | ||
|
||
if [ -n "${OUTPUT_GCS_DIR}" ]; then | ||
gsutil cp manifest.txt ${OUTPUT_GCS_DIR}/ | ||
fi | ||
>>> | ||
|
||
output { | ||
File manifest = "manifest.txt" | ||
} | ||
|
||
runtime { | ||
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:305.0.0" | ||
memory: "3 GB" | ||
disks: "local-disk 10 HDD" | ||
preemptible: 3 | ||
cpu: 1 | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is complicated and also breaks layers of abstraction by assuming a specific filename format. If you need this, I would add a new input parameter like
interval_index