Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add phylogenetic #8

Merged
merged 31 commits into from
Aug 2, 2024
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
53246bf
Move phylogenetic workflow to phylogenetic directory
j23414 Jul 9, 2024
bbb7e77
Add copy example data custom rules
j23414 Jul 9, 2024
4b3c822
Since lassa has S and L segments
j23414 Jul 9, 2024
cf59a92
Update the CI
j23414 Jul 9, 2024
ecb6aa3
Move rules for preparing sequences to its own smk file
j23414 Jul 9, 2024
1fd7d55
Move rules for constructing phylogeny to its own smk file
j23414 Jul 9, 2024
c3fa8f6
Move rules for annotating phylogeny to its own smk file
j23414 Jul 9, 2024
ee0135a
Move rule for exporting auspice json to its own smk file
j23414 Jul 9, 2024
c078718
Move config values to config file
j23414 Jul 9, 2024
05dcd7d
Update augur export v1 to v2
j23414 Jul 9, 2024
5bfd527
Move config to defaults to match pathogen-repo-guide
j23414 Jul 9, 2024
003ecfc
Add description statement
j23414 Jul 9, 2024
4d5aeec
Copy phylogenetic instructions from pathogen-repo-guide
j23414 Jul 9, 2024
d81791c
Download sequences and metadata from data.nextstrain.org
j23414 Jul 10, 2024
d7b5931
Pass curated GenBank data through the rest of pipeline
j23414 Jul 10, 2024
ee21b9f
Bypass duplicate reference strain detected
j23414 Jul 10, 2024
543de0b
Fixup: Add description statement
j23414 Jul 10, 2024
de8645d
Fixup example sequences to ID on accession
j23414 Jul 10, 2024
fa12fbd
Fixup AmbiguousRuleException
j23414 Jul 10, 2024
c5f87ae
Add rule to autogenerate colors
j23414 Jul 10, 2024
8ba2317
Display strain name on tree
j23414 Jul 10, 2024
2553ebc
Attribution
j23414 Jul 10, 2024
689800e
Add phylogenetic automation and deploy
j23414 Jul 10, 2024
f818c4b
Separate files into segment directories
j23414 Jul 29, 2024
e4d25fb
Update description to match https://nextstrain.org/lassa/s
j23414 Jul 30, 2024
ecd6ac9
Fixup: Update description to match https://nextstrain.org/lassa/s
j23414 Jul 30, 2024
3eb4a8d
Update .github/workflows/ingest-to-phylogenetic.yaml
j23414 Jul 31, 2024
7e177ea
ingest: Switch to lowercase segment names
j23414 Jul 31, 2024
072da67
phylogenetic: Switch to lowercase segment names
j23414 Jul 31, 2024
81d1cd1
Stage the phylogenetic build to get feedback from SME before making i…
j23414 Jul 31, 2024
7cde259
Since number of S and L segment sequences are both below 5k, include …
j23414 Aug 2, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 10 additions & 3 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
@@ -1,9 +1,16 @@
name: CI

on:
- push
- pull_request
push:
branches:
- main
pull_request:
workflow_dispatch:
# Routinely check that we continue to work in the face of external changes.
schedule:
# Every day at 18:37 UTC / 10:37 Seattle (winter) / 11:37 Seattle (summer)
- cron: "37 18 * * *"

jobs:
ci:
uses: nextstrain/.github/.github/workflows/pathogen-repo-ci.yaml@v0
uses: nextstrain/.github/.github/workflows/pathogen-repo-ci.yaml@master
102 changes: 102 additions & 0 deletions .github/workflows/ingest-to-phylogenetic.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
name: Ingest to phylogenetic

defaults:
run:
# This is the same as GitHub Action's `bash` keyword as of 20 June 2023:
# https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsshell
#
# Completely spelling it out here so that GitHub can't change it out from under us
# and we don't have to refer to the docs to know the expected behavior.
shell: bash --noprofile --norc -eo pipefail {0}

on:
schedule:
# Note times are in UTC, which is 1 or 2 hours behind CET depending on daylight savings.
#
# Note the actual runs might be late.
# Numerous people were confused, about that, including me:
# - https://github.community/t/scheduled-action-running-consistently-late/138025/11
# - https://github.com/github/docs/issues/3059
#
# Note, '*' is a special character in YAML, so you have to quote this string.
#
# Docs:
# - https://docs.github.com/en/actions/learn-github-actions/events-that-trigger-workflows#schedule
#
# Tool that deciphers this particular format of crontab string:
# - https://crontab.guru/
#
# Runs at 5pm UTC (1pm EDT/10am PDT) since curation by NCBI happens on the East Coast.
# We were running into invalid zip archive errors at 9am PDT, so hoping an hour
# delay will lower the error frequency
- cron: '0 17 * * *'

workflow_dispatch:
inputs:
ingest_image:
description: 'Specific container image to use for ingest workflow (will override the default of "nextstrain build")'
required: false
phylogenetic_image:
description: 'Specific container image to use for phylogenetic workflow (will override the default of "nextstrain build")'
required: false

jobs:
ingest:
permissions:
id-token: write
uses: ./.github/workflows/ingest.yaml
secrets: inherit
with:
image: ${{ inputs.ingest_image }}

# Check if ingest results include new data by checking for the cache
# of the file with the results' Metadata.sh256sum (which should have been added within upload-to-s3)
# GitHub will remove any cache entries that have not been accessed in over 7 days,
# so if the workflow has not been run over 7 days then it will trigger phylogenetic.
check-new-data:
needs: [ingest]
runs-on: ubuntu-latest
outputs:
cache-hit: ${{ steps.check-cache.outputs.cache-hit }}
steps:
- name: Get sha256sum
id: get-sha256sum
env:
AWS_DEFAULT_REGION: ${{ vars.AWS_DEFAULT_REGION }}
run: |
s3_urls=(
"s3://nextstrain-data/files/workflows/lassa/metadata_all.tsv.zst"
"s3://nextstrain-data/files/workflows/lassa/sequences_all.fasta.zst"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These URLs need to be updated based on the current upload config

Suggested change
"s3://nextstrain-data/files/workflows/lassa/metadata_all.tsv.zst"
"s3://nextstrain-data/files/workflows/lassa/sequences_all.fasta.zst"
"s3://nextstrain-data/files/workflows/lassa/all/metadata.tsv.zst"
"s3://nextstrain-data/files/workflows/lassa/all/sequences.fasta.zst"

Side question, should these check the L/S files since they are the files used by the phylogenetic workflow?

Copy link
Contributor Author

@j23414 j23414 Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question!  Considering that this same workflow in dengue only checks for the 'all' serotype, I believe this approach should be sufficient? Since the 'all', 'l', and 's' files are updated concurrrently, they should equally trigger the phylogenetic workflow.

However, since there is no such thing as an 'all' tree for lassa (unless we concatenated segments) and if we later decide that the all dataset is not necessary for debugging, I could see using either 'l' or 's' instead, just in case.

)

# Code below is modified from ingest/upload-to-s3
# https://github.com/nextstrain/ingest/blob/c0b4c6bb5e6ccbba86374d2c09b42077768aac23/upload-to-s3#L23-L29

no_hash=0000000000000000000000000000000000000000000000000000000000000000

for s3_url in "${s3_urls[@]}"; do
s3path="${s3_url#s3://}"
bucket="${s3path%%/*}"
key="${s3path#*/}"

s3_hash="$(aws s3api head-object --no-sign-request --bucket "$bucket" --key "$key" --query Metadata.sha256sum --output text 2>/dev/null || echo "$no_hash")"
echo "${s3_hash}" | tee -a ingest-output-sha256sum
done

- name: Check cache
id: check-cache
uses: actions/cache@v4
with:
path: ingest-output-sha256sum
key: ingest-output-sha256sum-${{ hashFiles('ingest-output-sha256sum') }}
lookup-only: true

phylogenetic:
needs: [check-new-data]
if: ${{ needs.check-new-data.outputs.cache-hit != 'true' }}
permissions:
id-token: write
uses: ./.github/workflows/phylogenetic.yaml
secrets: inherit
with:
image: ${{ inputs.phylogenetic_image }}
107 changes: 107 additions & 0 deletions .github/workflows/phylogenetic.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
name: Phylogenetic

defaults:
run:
# This is the same as GitHub Action's `bash` keyword as of 20 June 2023:
# https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsshell
#
# Completely spelling it out here so that GitHub can't change it out from under us
# and we don't have to refer to the docs to know the expected behavior.
shell: bash --noprofile --norc -eo pipefail {0}

on:
workflow_call:
inputs:
image:
description: 'Specific container image to use for phylogenetic workflow (will override the default of "nextstrain build")'
required: false
type: string

workflow_dispatch:
inputs:
image:
description: 'Specific container image to use for ingest workflow (will override the default of "nextstrain build")'
required: false
type: string
trial_name:
description: |
Trial name for deploying builds.
If not set, builds will overwrite existing builds at s3://nextstrain-data/lassa*
If set, builds will be deployed to s3://nextstrain-staging/lassa_trials_<trial_name>_*
required: false
type: string
sequences_url:
description: |
URL for the sequences.fasta.zst file
If not provided, will use default sequences_url from phylogenetic/defaults/config.yaml
required: false
type: string
metadata_url:
description: |
URL for the metadata.tsv.zst file
If not provided, will use default metadata_url from phylogenetic/defaults/config.yaml
required: false
type: string

jobs:
set_config_overrides:
runs-on: ubuntu-latest
steps:
- id: config
name: Set config overrides
env:
TRIAL_NAME: ${{ inputs.trial_name }}
SEQUENCES_URL: ${{ inputs.sequences_url }}
METADATA_URL: ${{ inputs.metadata_url }}
run: |
config=""

if [[ "$TRIAL_NAME" ]]; then
config+=" deploy_url='s3://nextstrain-staging/lassa_trials_"$TRIAL_NAME"_'"
fi

if [[ "$SEQUENCES_URL" ]]; then
config+=" sequences_url='"$SEQUENCES_URL"'"
fi

if [[ "$METADATA_URL" ]]; then
config+=" metadata_url='"$METADATA_URL"'"
fi

if [[ $config ]]; then
config="--config $config"
fi

echo "config=$config" >> "$GITHUB_OUTPUT"
outputs:
config_overrides: ${{ steps.config.outputs.config }}

phylogenetic:
needs: [set_config_overrides]
permissions:
id-token: write
uses: nextstrain/.github/.github/workflows/pathogen-repo-build.yaml@master
secrets: inherit
with:
# Starting with the default docker runtime
# We can migrate to AWS Batch when/if we need to for more resources or if
# the job runs longer than the GH Action limit of 6 hours.
runtime: docker
env: |
NEXTSTRAIN_DOCKER_IMAGE: ${{ inputs.image }}
CONFIG_OVERRIDES: ${{ needs.set_config_overrides.outputs.config_overrides }}
run: |
nextstrain build \
phylogenetic \
deploy_all \
--configfile build-configs/nextstrain-automation/config.yaml \
$CONFIG_OVERRIDES
# Specifying artifact name to differentiate ingest build outputs from
# the phylogenetic build outputs
artifact-name: phylogenetic-build-output
artifact-paths: |
phylogenetic/auspice/
phylogenetic/results/
phylogenetic/benchmarks/
phylogenetic/logs/
phylogenetic/.snakemake/log/
Loading