Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add manifest summary file to GvsExtractCallset #7457

Merged
merged 11 commits into from
Sep 9, 2021
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 58 additions & 2 deletions scripts/variantstore/wdl/GvsExtractCallset.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -97,10 +97,17 @@ workflow GvsExtractCallset {
file_sizes_bytes = flatten([ExtractTask.output_vcf_bytes, ExtractTask.output_vcf_index_bytes])
}

call CreateManifest {
input:
manifest_lines = ExtractTask.manifest,
output_gcs_dir = output_gcs_dir
}

output {
Array[File] output_vcfs = ExtractTask.output_vcf
Array[File] output_vcf_indexes = ExtractTask.output_vcf_index
Float total_vcfs_size_mb = SumBytes.total_mb
File manifest = CreateManifest.manifest
}
}

Expand Down Expand Up @@ -188,16 +195,30 @@ task ExtractTask {
~{true='--emit-pls' false='' emit_pls} \
${FILTERING_ARGS}

du -b ~{output_file} | cut -f1 > vcf_bytes.txt
du -b ~{output_file}.tbi | cut -f1 > vcf_index_bytes.txt
INTERVAL_NUMBER=$(echo ~{output_file} | grep -oEi '[0-9]+\.vcf\.gz' | cut -d'.' -f1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is complicated and also breaks layers of abstraction by assuming a specific filename format. If you need this, I would add a new input parameter like interval_index


OUTPUT_FILE_BYTES=$(du -b ~{output_file} | cut -f1)
echo ${OUTPUT_FILE_BYTES} > vcf_bytes.txt

OUTPUT_FILE_INDEX_BYTES=$(du -b ~{output_file}.tbi | cut -f1)
echo ${OUTPUT_FILE_INDEX_BYTES} > vcf_index_bytes.txt

# Drop trailing slash if one exists
OUTPUT_GCS_DIR=$(echo ~{output_gcs_dir} | sed 's/\/$//')

if [ -n "${OUTPUT_GCS_DIR}" ]; then
gsutil cp ~{output_file} ${OUTPUT_GCS_DIR}/
gsutil cp ~{output_file}.tbi ${OUTPUT_GCS_DIR}/
OUTPUT_FILE_DEST=${OUTPUT_GCS_DIR}/~{output_file}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, existing issue: a lot of these variable replacements should probably be quoted. This prevents potential word splitting bugs, an possibly security vulnerabilities.

OUTPUT_FILE_INDEX_DEST=${OUTPUT_GCS_DIR}/~{output_file}.tbi
else
OUTPUT_FILE_DEST=~{output_file}
OUTPUT_FILE_INDEX_DEST=~{output_file}.tbi
fi

# Parent Task will collect manifest lines and create a joined file
# Currently, the schema is `[interval_number], [output_file_location], [output_file_size_bytes], [output_file_index_location], [output_file_size_bytes]`
echo ${INTERVAL_NUMBER},${OUTPUT_FILE_DEST},${OUTPUT_FILE_BYTES},${OUTPUT_FILE_INDEX_DEST},${OUTPUT_FILE_INDEX_BYTES} >> manifest.txt
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally sold on this column's necessity. What are your thoughts? Is it needed in order to produce an ordered/merged manifest, i.e. for the sort -n later?

If we do need it, then I would use more generic terminology here, e.g. shard_index or output_index. I think the fact that this corresponds to a set of intervals currently is not relevant to or usable by an end consumer

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with dropping it. I mostly added it for the sort functionality but also didn't think it would hurt to add.

>>>

# ------------------------------------------------
Expand All @@ -218,6 +239,7 @@ task ExtractTask {
Float output_vcf_bytes = read_float("vcf_bytes.txt")
File output_vcf_index = "~{output_file}.tbi"
Float output_vcf_index_bytes = read_float("vcf_index_bytes.txt")
String manifest = read_string("manifest.txt")
}
}

Expand Down Expand Up @@ -354,3 +376,37 @@ task SumBytes {
cpu: 1
}
}

task CreateManifest {

input {
Array[String] manifest_lines
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these ordered? If so, I suppose this task could also be responsible for numbering the rows.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. I wanted them to be ordered which is why I run it through sort -n. It's also why I have the index as the first column.. I can see if I can try something smarter by using a Map in WDL.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a way to make this work with the built in WDL functions.. I can package the interval_index and manifest line with their Pair object but I don't see a reasonable way to actually print the values out into the command block. Given that, I'm going to pass the index through as part of the manifest line like I was doing before but I am at least passing IN the index as part of the scatter call instead of parsing it from the filename.

String? output_gcs_dir
}

command <<<
set -e
MANIFEST_LINES_TXT=~{write_lines(manifest_lines)}
echo "interval_number, vcf_file_location, vcf_file_bytes, vcf_index_location, vcf_index_bytes" >> manifest.txt
sort -n ${MANIFEST_LINES_TXT} >> manifest.txt

# Drop trailing slash if one exists
OUTPUT_GCS_DIR=$(echo ~{output_gcs_dir} | sed 's/\/$//')

if [ -n "${OUTPUT_GCS_DIR}" ]; then
gsutil cp manifest.txt ${OUTPUT_GCS_DIR}/
fi
>>>

output {
File manifest = "manifest.txt"
}

runtime {
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:305.0.0"
memory: "3 GB"
disks: "local-disk 10 HDD"
preemptible: 3
cpu: 1
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,7 @@ def get_subselect(fq_vet_table, samples, id):
execute_with_retry("create and populate vet new table", sql)
else:
execute_with_retry("populate vet new table", sql)
return
return



Expand Down