Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make tree for 450bp of the N gene ("N450") #20

Merged
merged 11 commits into from
Apr 1, 2024
Merged

Conversation

kimandrews
Copy link
Contributor

Description of proposed changes

The goal of this PR is to create a tree using a 450bp region of the N gene ("N450") that is highly represented on NCBI for measles. Addresses github issue #13

General steps include:

  • Add a reference sequence that is 450bp of the N gene from the reference that is used for the whole-genome tree
  • Align all sequences against the N450 reference using Nextclade
  • Filter by alignment length
  • Subsample by date and geography
  • Use Snakemake wildcards to generate and export phylogeny

Many of these changes follow those made for E gene trees in the dengue repo

Related issue(s)

#13

Checklist

  • Checks pass

Reference is comprised of 450bp of the N gene from the same sample that is used for the genome tree (NCBI Accession NC_001498.1).
* Add rule to align sequences to N450 reference using nextclade
* Add rule to filter by length, date, country
@kimandrews kimandrews requested a review from a team March 22, 2024 22:22
Copy link
Contributor

@j23414 j23414 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worked on my computer:

git clone https://github.com/nextstrain/measles.git
cd measles
git checkout add-N450-tree
nextstrain build phylogenetic
nextstrain view phylogenetic/auspice

And I can see many more geolocations represented in the N450 tree.

Screenshot 2024-03-23 at 2 09 38 PM

colors: "defaults/colors.tsv"
auspice_config: "defaults/auspice_config.json"
filter:
group_by: "country year month"
sequences_per_group: 20
min_date: 1950
min_length: 5000
filter_N450:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor, not blocking]

@joverlee521 do you have a canonical way we should structure build-specific rule parameters (e.g. filter vs filter_N450) when we want to use a single config file for both/all builds?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not (yet?), the pathogen-repo-guide has been based on one build per config file.

Existing workflows use three different patterns

  1. seasonal flu has top level per build configs.
This would look like:
builds:
    genome:
        filter:
            group_by: ...
            sequences_per_group: ...
            [...]
    N450:
        filter:
            group_by: ...
            sequences_per_group: ...
            [...]
  1. ncov has build configs nested within rule groupings.
This would look like:
filter:
    genome:
        group_by: ...
        sequences_per_group: ...
        [...]
    N450:
        group_by: ...
        sequences_per_group: ...
        [...]
  1. rsv nests build names within specific config parameters.
This would look like:
filter:
    group_by:
        genome: ...
        N450: ...
    sequences_per_group:
        genome: ...
        N450: ...
    [...]

[1] is the most flexible, allowing each build to define its own parameters. This makes it very easy to scan the config file for one build's parameters in a single place. However, since each param has to be defined per build, this can result in very long config files, which is why seasonal-flu has complex array-builds configs to programmatically create the configs during the workflow.

[2] is also pretty flexible, where each build can configure each parameter per rule grouping. There's less repetition of parameters so config files won't be as long, but a single build's config is spread throughout the rules groupings. It is also not very clear which rule configs can be configured per build and which rule configs are shared among builds.

[3] is the least flexible, as it only allows each build to change specific configs. This can also be confusing why some configs are nested while others are not.

sequences = "data/sequences.fasta",
reference = config["files"]["reference_N450_fasta"]
output:
sequences = "results/sequences_N450.fasta"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's preferable to organise builds as directories within results, e.g. "results/genome/sequences.fasta" and "results/N450/sequences.fasta". As it's an implementation detail, this change doesn't have to be made in this PR.

(This comment applies throughout the snakemake files added in this PR.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in a9d2644

tree = "results/tree.nwk",
node_data = "results/branch_lengths.json"
tree = "results/tree_{gene}.nwk",
node_data = "results/branch_lengths_{gene}.json"
params:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following deprecation notice is present when running refine:

        DEPRECATION WARNING. TreeTime.resolve_polytomies: You are resolving
        polytomies using the old 'greedy' mode. This is not well suited for large
        polytomies. Stochastic resolution will become the default in future
        versions. To switch now, rerun with the flag `--stochastic-resolve`. To
        keep using the greedy method in the future, run with `--greedy-resolve`

Especially for the N450 build where we will have large polytomies we should use --stochastic-resolve.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f39ba8b

aa_muts = "results/aa_muts.json",
branch_lengths = "results/branch_lengths_{gene}.json",
nt_muts = "results/nt_muts_{gene}.json",
aa_muts = "results/aa_muts_{gene}.json",
colors = config["files"]["colors"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest removing all the "country" colours from the defaults/colors.tsv in this PR simply because a number of countries are missing colours. Auspice will pick a better set of colours than the current situation of some colours + some greys. We can then add nicer colours in a subsequent PR, as desired.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 119fbcf

tree = "results/tree_raw.nwk",
alignment = "results/aligned.fasta",
tree = "results/tree_raw_{gene}.nwk",
alignment = "results/aligned_{gene}.fasta",
metadata = "data/metadata.tsv"
output:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The temporal signal is much better in this build (and the build Trevor showed me) than the one we looked at mid-week - do you know what changed? This gives me more confidence that the (temporal) rooting is working ok for the N450 build - it's broadly similar to the genome build - and so there's no longer a need to start in "unrooted" view if you'd prefer to go back to the more typical rectangular view.

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lower temporal signal for the tree I showed you last week seems to be related to the subsampling method I had used. Here is the subsampling method I used for that tree:

filter_N450:
group_by: "country year month"
sequences_per_group: 2
min_date: 1950
min_length: 400

And here is the clock view for a tree I generated today using that subsampling method:

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed default display back to rooted time-tree in 862dbaf

"type": "continuous"
},
{
"key": "author",
Copy link
Member

@jameshadfield jameshadfield Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Not blocking this PR]

We used to add "author" as a color-by as a workaround so that it appeared as a filtering option in Auspice, however a couple of recent changes have rendered this pattern unnecessary. We can now use "metadata_columns" in the auspice-config JSON (or --metadata-columns) to export "author" and Auspice will automatically add it as a filtering option. This is nicer than exposing author as a coloring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 8bea320

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest changes look good to me!

I think this is good to merge and deploy the latest genome/N450 builds to nextstrain.org

@kimandrews kimandrews merged commit 9c1fea2 into main Apr 1, 2024
32 checks passed
@kimandrews kimandrews deleted the add-N450-tree branch April 1, 2024 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants