Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make tree for 450bp of the N gene ("N450") #20

Merged
merged 11 commits into from
Apr 1, 2024
1 change: 1 addition & 0 deletions phylogenetic/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ rule all:
auspice_json = "auspice/measles.json",

include: "rules/prepare_sequences.smk"
include: "rules/prepare_sequences_N450.smk"
include: "rules/construct_phylogeny.smk"
include: "rules/annotate_phylogeny.smk"
include: "rules/export.smk"
Expand Down
7 changes: 7 additions & 0 deletions phylogenetic/defaults/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,20 @@ strain_id_field: "accession"
files:
exclude: "defaults/dropped_strains.txt"
reference: "defaults/measles_reference.gb"
reference_N450: "defaults/measles_reference_N450.gb"
reference_N450_fasta: "defaults/measles_reference_N450.fasta"
colors: "defaults/colors.tsv"
auspice_config: "defaults/auspice_config.json"
filter:
group_by: "country year month"
sequences_per_group: 20
min_date: 1950
min_length: 5000
filter_N450:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor, not blocking]

@joverlee521 do you have a canonical way we should structure build-specific rule parameters (e.g. filter vs filter_N450) when we want to use a single config file for both/all builds?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not (yet?), the pathogen-repo-guide has been based on one build per config file.

Existing workflows use three different patterns

  1. seasonal flu has top level per build configs.
This would look like:
builds:
    genome:
        filter:
            group_by: ...
            sequences_per_group: ...
            [...]
    N450:
        filter:
            group_by: ...
            sequences_per_group: ...
            [...]
  1. ncov has build configs nested within rule groupings.
This would look like:
filter:
    genome:
        group_by: ...
        sequences_per_group: ...
        [...]
    N450:
        group_by: ...
        sequences_per_group: ...
        [...]
  1. rsv nests build names within specific config parameters.
This would look like:
filter:
    group_by:
        genome: ...
        N450: ...
    sequences_per_group:
        genome: ...
        N450: ...
    [...]

[1] is the most flexible, allowing each build to define its own parameters. This makes it very easy to scan the config file for one build's parameters in a single place. However, since each param has to be defined per build, this can result in very long config files, which is why seasonal-flu has complex array-builds configs to programmatically create the configs during the workflow.

[2] is also pretty flexible, where each build can configure each parameter per rule grouping. There's less repetition of parameters so config files won't be as long, but a single build's config is spread throughout the rules groupings. It is also not very clear which rule configs can be configured per build and which rule configs are shared among builds.

[3] is the least flexible, as it only allows each build to change specific configs. This can also be confusing why some configs are nested while others are not.

group_by: "country year"
subsample_max_sequences: 3000
min_date: 1950
min_length: 400
refine:
coalescent: "opt"
date_inference: "marginal"
Expand Down
2 changes: 1 addition & 1 deletion phylogenetic/rules/prepare_sequences.smk
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ rule align:
sequences = "results/filtered.fasta",
reference = config["files"]["reference"]
output:
alignment = "results/aligned.fasta"
alignment = "results/aligned_genome.fasta"
shell:
"""
augur align \
Expand Down
58 changes: 58 additions & 0 deletions phylogenetic/rules/prepare_sequences_N450.smk
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
"""
This part of the workflow prepares sequences for constructing the phylogenetic tree for 450bp of the N gene.

See Augur's usage docs for these commands for more details.
"""

rule align_and_extract_N450:
input:
sequences = "data/sequences.fasta",
reference = config["files"]["reference_N450_fasta"]
output:
sequences = "results/sequences_N450.fasta"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's preferable to organise builds as directories within results, e.g. "results/genome/sequences.fasta" and "results/N450/sequences.fasta". As it's an implementation detail, this change doesn't have to be made in this PR.

(This comment applies throughout the snakemake files added in this PR.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in a9d2644

params:
min_length = config['filter_N450']['min_length']
shell:
"""
nextclade run \
-j 1 \
--input-ref {input.reference} \
--output-fasta {output.sequences} \
--min-seed-cover 0.01 \
--min-length {params.min_length} \
--silent \
{input.sequences}
"""
rule filter_N450:
"""
Filtering to
- {params.sequences_per_group} sequence(s) per {params.group_by!s}
- excluding strains in {input.exclude}
- minimum genome length of {params.min_length}
- excluding strains with missing region, country or date metadata
"""
input:
sequences = "results/sequences_N450.fasta",
metadata = "data/metadata.tsv",
exclude = config["files"]["exclude"]
output:
sequences = "results/aligned_N450.fasta"
params:
group_by = config['filter_N450']['group_by'],
subsample_max_sequences = config["filter_N450"]["subsample_max_sequences"],
min_date = config["filter_N450"]["min_date"],
min_length = config['filter_N450']['min_length'],
strain_id = config["strain_id_field"]
shell:
"""
augur filter \
--sequences {input.sequences} \
--metadata {input.metadata} \
--metadata-id-columns {params.strain_id} \
--exclude {input.exclude} \
--output {output.sequences} \
--group-by {params.group_by} \
--subsample-max-sequences {params.subsample_max_sequences} \
--min-date {params.min_date} \
--min-length {params.min_length}
"""