Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pangolin flag #593

Merged
merged 14 commits into from
Apr 12, 2021
Merged

Pangolin flag #593

merged 14 commits into from
Apr 12, 2021

Conversation

sidneymbell
Copy link
Collaborator

Draft implementation for fixing #557.

I need to spend time debugging and testing, but would love any high-level feedback on the approach if you have a minute.

cc @ttung

Arguably should be going through "combine_metadata" but I can't quite figure out how to do that properly in this context.
@sidneymbell
Copy link
Collaborator Author

sidneymbell commented Apr 1, 2021

Alright, this branch now:
A. Does not run pangolin if the config flag is not present
B. Does run pangolin if the config flag is present
C. Produces the expected output pangolineages.csv with minimal runtime increase
D. Attempts to export this output via node_data.

I have two main questions (probably for @huddlej or @jameshadfield ?) before I land this:

  1. I believe the more appropriate way of exporting the information in pangolineages.csv is by adding it to the combined_metadata.tsv. Presumably this needs to be fed into combine_metadata somehow. However, this is rather confusing to me with the new multiple inputs setup.

Would love suggestions on how to implement this metadata addendum (I can then remove the make_pangolin_node_data.py script.

Sidenote: (D) is currently broken, but given that I don't think this is the path we want to go down, putting those questions aside for now.

  1. On my local instance, I've gone ahead and added pangolin to my nextstrain.yaml via the route suggested here. This works well, at least on my machine. However, I'm not sure this is necessarily a desired addition as it adds more dependencies that some users may not want?

I'm unfamiliar with the conda: argument in snakemake. Is this a way to specify a separate env file to be used for that specific rule? If so, does this seem like a reasonable approach to separating out the dependencies?

@rneher
Copy link
Member

rneher commented Apr 2, 2021

@sidneymbell

this looks mostly good, but the node data output needs be

{
"produced_by":"pangolin",
"nodes": {
  "tax1": {"pango": "B.34.2323.454"},
  "tax2": {"pango": "X.3.3.4"},
...
}
}

Your implementation is missing the explicit attribute name.

besides, we would need to figure out how this interacts with the Pango lineage assignment we do from GISAID metadata.

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sidneymbell I made some specific comments below, but maybe most are not helpful if the goal is to not have a new node data JSON.

Is the most helpful outcome here to add a "Nextstrain-approved" pangolin lineage column to the combined metadata (local + GISAID, for example) prior to subsampling? Or is the goal more specifically for PHLs to reannotate their own metadata file(s) prior to filtering, subsampling, etc.? The specific implementation could be different depending on which of those use cases you need to support.

scripts/make_pangolin_node_data.py Outdated Show resolved Hide resolved
scripts/make_pangolin_node_data.py Outdated Show resolved Hide resolved
workflow/snakemake_rules/main_workflow.smk Outdated Show resolved Hide resolved
workflow/snakemake_rules/main_workflow.smk Outdated Show resolved Hide resolved
@sidneymbell
Copy link
Collaborator Author

Thanks so much, @rneher and @huddlej!

The goal for us is to enable our partner PHLs to see up-to-date pangolin lineage assignments for both the contextual gisaid data and their local data both (1) on the tree and (2, very impt) in the "download metadata" tsv file from the auspice frontend.

Would reimplementing the make_pangolin_node_data.py to instead feed through combine_metadata accomplish both (1) and (2)? If so, I'd love guidance on where/how to insert this in the mainworkflow.smk (or elsewhere...?).

Re: these assignments vs gisaid's, I'd propose having these pangolin assignments (if run) supercede the gisaid assignments for two reasons:

  1. these are more comprehensive as they also include local samples
  2. discrepancies between the gisaid and local pangolineages should be minor; if there are any discrepancies, taking all the lineages from one set at least keeps things internally consistent

@huddlej
Copy link
Contributor

huddlej commented Apr 2, 2021

The goal for us is to enable our partner PHLs to see up-to-date pangolin lineage assignments for both the contextual gisaid data and their local data both (1) on the tree and (2, very impt) in the "download metadata" tsv file from the auspice frontend

Perfect! It looks like most (maybe all?) node data annotations in the auspice JSON are available through Auspice's metadata download. Since the current workflow mainly flows from raw metadata and sequences toward node data JSONs aggregated into an auspice JSON, the simplest way to achieve the goal is to export the new Pangolin annotations in a node data JSON like you've setup here.

You don't have to choose between existing annotations and the re-annotation. For instance, you can provide an argument like --attribute-name to the script that converts the Pangolin output to a node data JSON and use the value of that argument to set the attribute name of the lineages in the output JSON. This would allow us to annotate lineages as a separate field or in the same field which will override the existing metadata values.

Someone from @nextstrain/core can add the pangolin package to the nextstrain Conda package and Docker image in separate PRs. Before we make those changes, maybe we can get this PR working fully with the --use-conda flag? I can help test on OS X and the Hutch Linux cluster.

@sidneymbell
Copy link
Collaborator Author

Thanks for your help, @huddlej !

Recent changes:

  • Node data export works, confirmed that it shows up in the metadata.tsv file (🎉 ) under the attribute name PANGO lineages - local (suggestions welcome)
  • Tidied up old comments, etc.
  • Added pangolin's dependencies to the yaml file, tested on my local machine that the --use-conda flag produces a clean build

Opening full PR for review, looking forward to any additional suggestions you may have.

@sidneymbell sidneymbell marked this pull request as ready for review April 2, 2021 23:51
@sidneymbell sidneymbell requested a review from huddlej April 2, 2021 23:51
Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worked for me as it is, @sidneymbell, but I made a couple of requests for minor Snakemake surgery and a simpler conda environment file. Let me know if you have any questions. We can also chat online (offline?), if it helps to be more synchronous.

Next, we'll have to figure out the best way to get pangolin and pangolearn in the Nextstrain Docker image, but that can be a separate PR.

scripts/make_pangolin_node_data.py Outdated Show resolved Hide resolved
workflow/envs/nextstrain.yaml Outdated Show resolved Hide resolved
workflow/snakemake_rules/main_workflow.smk Outdated Show resolved Hide resolved
workflow/snakemake_rules/main_workflow.smk Outdated Show resolved Hide resolved
workflow/snakemake_rules/main_workflow.smk Outdated Show resolved Hide resolved
workflow/snakemake_rules/main_workflow.smk Show resolved Hide resolved
workflow/snakemake_rules/main_workflow.smk Show resolved Hide resolved
workflow/snakemake_rules/main_workflow.smk Show resolved Hide resolved
@sidneymbell sidneymbell requested a review from huddlej April 12, 2021 20:23
@sidneymbell
Copy link
Collaborator Author

As always, thanks for such a detailed, thoughtful and constructive review, @huddlej ! I believe I've implemented all of the suggested changes, and this runs well on my box using the --use-conda flag. LMK if you encounter any bugs or have additional suggestions. Thanks!

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, @sidneymbell. I had only two small edits:

  • conda-forge has to appear first in the channel list of the conda environment
  • the run_pangolin rule needs to write out to the log file with 2>&1 | tee {log}

After those changes, we can merge! 🎉

At the last Nextstrain meeting, we discussed adding the pangolin tools to the base Nextstrain Docker image vs. maintaining a separate image for the ncov workflow. The consensus was to add these tools to the base Docker image, so I'll make a separate issue/PR for that change once I figure out how to install pangolin without Conda. :)

workflow/envs/nextstrain.yaml Outdated Show resolved Hide resolved
@sidneymbell sidneymbell requested a review from huddlej April 12, 2021 21:55
@sidneymbell sidneymbell merged commit 2cf2133 into nextstrain:master Apr 12, 2021
@sidneymbell sidneymbell deleted the pangolin-flag branch April 12, 2021 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants