Add phylogenetic #8

j23414 · 2024-07-09T18:24:11Z

Description of proposed changes

Add phylogenetic workflow

Refactor workflow into a phylogenetic directory to match pathogen repo guide
Pull curated NCBI genbank data from s3 instead of Fauna
Since original trees were built from S and L segments, add rules and configs to pull out segments from NCBI GenBank sequences using Nextclade
Update phylogenetic automation rules

The resulting phylogenetic trees are staged at:

Related issue(s)

Checklist

Checks pass

Post-merge clean up

As mentioned in nextstrain/conda-base#85 (comment), once this is merged, various downstream CIs will need to be updated:

joverlee521

Reviewed workflow bits only, will leave scientific review to others.

phylogenetic/rules/prepare_sequences_segments.smk

phylogenetic/defaults/auspice_config.json

phylogenetic/build-configs/ci/copy_example_data.smk

phylogenetic/rules/prepare_sequences_segments.smk

joverlee521 · 2024-07-19T20:07:05Z

phylogenetic/rules/export.smk

+        colors = "results/colors_{segment}.tsv"
+    shell:
+        """
+        python3 scripts/assign-colors.py \


non-blocking
Another reminder for us that there's an open issue to add this functionality to augur: nextstrain/augur#1185

phylogenetic/config/auspice_config.json

joverlee521 · 2024-07-19T20:18:02Z

phylogenetic/rules/export.smk

+        display_strain_field=config["display_strain_field"],
+    shell:
+        """
+        python3 scripts/set_final_strain_name.py \


non-blocking
Another reminder for us to get back to nextstrain/auspice#1668.

Is setting the strain name helpful for lassa? It looks like a majority of the strain names are the GenBank accession anyways.

In practice, strain names are commonly used on phylogenetic trees for two main reasons:

Communication: It's easier to discuss and refer to a named strain rather than a complex identifier.

Standardization: Using strain names encourages sample submitters to develop clear naming conventions.

For Lassa virus specifically, the strain name serves an additional important function. Since researchers can submit two separate GenBank samples per strain (one for the L segment and one for the S segment), the consistent use of strain names allows looking at tanglegrams of the segments.

Thanks for flagging! In my quick exploration, indeed there was only 49% of the samples getting strain names. After digging in, rescued several more so that percentage should be higher.

Move lassa CI to the latest pathogen-repo-ci Depends on nextstrain/lassa#8

Following the pathogen-repo-guide * https://github.com/nextstrain/pathogen-repo-guide/tree/e3bfb52c8155058a3d48592f4268a7382bf3e12a

Copy the "copy_example_data" custom rules from the pathogen-repo-guide * https://github.com/nextstrain/pathogen-repo-guide/tree/e3bfb52c8155058a3d48592f4268a7382bf3e12a/phylogenetic/build-configs/ci

Part of work to update this repo to match the pathogen-repo-guide.

Augur align detects the reference strain in the reference file and the curated dataset, and throws a "duplicate strain error" `Duplicate strains of "KM822127" detected` Usually I bypass this using `augur align --remove-reference` but this error is still showing up. Ergo, adding a postfix to the reference IDs to bypass error.

To match the curated sequences, fixup example sequences to ID on accession.

Since there are more countries represented then the original lassa build, autogenerate colors for geolocations. This was copied and modified from the "colors" rule in RSV's workflow * https://github.com/nextstrain/rsv/blob/a1788ce2c9c4375fb5a06d1426c64c45cf90225f/workflow/snakemake_rules/export.smk#L13-L27

* Capitalize L and S to match ingest * Refactor and place intermediate files in segment directories * Match segment capitalization in reference files and example files

genehack · 2024-07-30T16:40:07Z

phylogenetic/defaults/description.md

@@ -1,5 +1,9 @@
 We gratefully acknowledge the authors, originating and submitting laboratories of the genetic sequences and metadata for sharing their work. Please note that although data generators have generously shared data in an open fashion, that does not mean there should be free license to publish on this data. Data generators should be cited where possible and collaborations should be sought in some circumstances. Please try to avoid scooping someone else's work. Reach out if uncertain.

+This work is made possible by the open sharing of genetic data by research groups, including these groups currently collecting Lassa sequences: [Christian Happi](http://acegid.org/), [Pardis Sabeti](https://www.sabetilab.org/), [Katherine Siddle](https://www.sabetilab.org/katherine-siddle/) and colleagues, whose data was shared via [this virological.org post](http://virological.org/t/new-lassa-virus-genomes-from-nigeria-2015-2016/191). If you intend to use these sequences prior to publication, please contact them directly to coordinate.
+
+The Irrua specialist Teaching Hospital (ISTH) and Institute for Lassa Fever Research and Control (ILFRC), Irrua, Edo State, Nigeria; The Bernhard-Nocht Institute for Tropical Medicine (BNITM), Hamburg, Germany; Public Health England (PHE); African Center of Excellence for Genomics of Infectious Disease (ACEGID ), Redeemer’s University, Ede, Nigeria; Broad Institute of MIT and Harvard University (Cambridge, MA, USA). For further details, including conditions of reuse, please contact [Ephraim Epogbaini](mailto:epogbaini@yahoo.com), [Stephan Günther](http://www.who.int/blueprint/about/stephan-gunther/en/), and [Philippe Lemey](https://rega.kuleuven.be/cev/ecv/lab-members/PhilippeLemey.html). Their data was first shared via [this virological.org post](http://virological.org/t/2018-lasv-sequencing-continued/192/8), which is continually updated.


couple nits:

fix cap

remove whitespace

for the other institutions, parenthetical followups are used for institutional abbreviations, not location -- standardize entry for Harvard/The Broad to match

Many of the institutions have active websites (e.g., ISTH = https://www.isth.org.ng/); consider linking to them.

Suggested change

The Irrua specialist Teaching Hospital (ISTH) and Institute for Lassa Fever Research and Control (ILFRC), Irrua, Edo State, Nigeria; The Bernhard-Nocht Institute for Tropical Medicine (BNITM), Hamburg, Germany; Public Health England (PHE); African Center of Excellence for Genomics of Infectious Disease (ACEGID ), Redeemer’s University, Ede, Nigeria; Broad Institute of MIT and Harvard University (Cambridge, MA, USA). For further details, including conditions of reuse, please contact [Ephraim Epogbaini](mailto:epogbaini@yahoo.com), [Stephan Günther](http://www.who.int/blueprint/about/stephan-gunther/en/), and [Philippe Lemey](https://rega.kuleuven.be/cev/ecv/lab-members/PhilippeLemey.html). Their data was first shared via [this virological.org post](http://virological.org/t/2018-lasv-sequencing-continued/192/8), which is continually updated.

The Irrua Specialist Teaching Hospital (ISTH) and Institute for Lassa Fever Research and Control (ILFRC), Irrua, Edo State, Nigeria; The Bernhard-Nocht Institute for Tropical Medicine (BNITM), Hamburg, Germany; Public Health England (PHE); African Center of Excellence for Genomics of Infectious Disease (ACEGID), Redeemer’s University, Ede, Nigeria; Broad Institute of MIT and Harvard University, Cambridge, MA, USA. For further details, including conditions of reuse, please contact [Ephraim Epogbaini](mailto:epogbaini@yahoo.com), [Stephan Günther](http://www.who.int/blueprint/about/stephan-gunther/en/), and [Philippe Lemey](https://rega.kuleuven.be/cev/ecv/lab-members/PhilippeLemey.html). Their data was first shared via [this virological.org post](http://virological.org/t/2018-lasv-sequencing-continued/192/8), which is continually updated.

Co-authored-by: John SJ Anderson <john@genehack.org>

joverlee521 · 2024-07-30T19:19:08Z

.github/workflows/ingest-to-phylogenetic.yaml

+            "s3://nextstrain-data/files/workflows/lassa/metadata_all.tsv.zst"
+            "s3://nextstrain-data/files/workflows/lassa/sequences_all.fasta.zst"


These URLs need to be updated based on the current upload config

Suggested change

"s3://nextstrain-data/files/workflows/lassa/metadata_all.tsv.zst"

"s3://nextstrain-data/files/workflows/lassa/sequences_all.fasta.zst"

"s3://nextstrain-data/files/workflows/lassa/all/metadata.tsv.zst"

"s3://nextstrain-data/files/workflows/lassa/all/sequences.fasta.zst"

Side question, should these check the L/S files since they are the files used by the phylogenetic workflow?

Good question! Considering that this same workflow in dengue only checks for the 'all' serotype, I believe this approach should be sufficient? Since the 'all', 'l', and 's' files are updated concurrrently, they should equally trigger the phylogenetic workflow.

However, since there is no such thing as an 'all' tree for lassa (unless we concatenated segments) and if we later decide that the all dataset is not necessary for debugging, I could see using either 'l' or 's' instead, just in case.

joverlee521 · 2024-07-30T20:09:45Z

phylogenetic/Snakefile

@@ -8,11 +8,14 @@ workdir: workflow.current_basedir
 # Use default configuration values. Override with Snakemake's --configfile/--config options.
 configfile: "defaults/config.yaml"

-SEGMENTS = ["l", "s"]
+segments = ["L", "S"]


Flagging that the l -> L and s -> S change will require updating the nextstrain.org manifest for lassa

Geh, good to know. I can revert L -> l in phylogenetic and back up to ingest as that may be a more consistent-with-history solution.

Co-authored-by: Jover Lee <joverlee521@gmail.com>

…t live Pushing the phylogenetic build to staging instead of production, to allow for time for SME's to review the build before making it live. Make sure to update this to the live url once the build is approved.

…all that meet a min length requirement

j23414 · 2024-08-02T13:42:20Z

This additional commit (7cde259) updates augur filter to allow visualization of all S (694) and L (1272) segments in Auspice, following @jameshadfield's suggestion during a call.

I removed some group-by filtering to include more data while adding a minimum length filter to ensure quality. This change enables a comprehensive view of the dataset.

Move lassa CI to the latest pathogen-repo-ci Depends on nextstrain/lassa#8

j23414 force-pushed the add-phylogenetic branch from 62a768d to e63e922 Compare July 10, 2024 18:16

j23414 changed the title ~~[DO NOT MERGE] Add phylogenetic~~ Add phylogenetic Jul 10, 2024

j23414 marked this pull request as ready for review July 10, 2024 18:48

j23414 force-pushed the add-phylogenetic branch from 880e067 to 55e1a91 Compare July 15, 2024 11:51

Base automatically changed from add-ingest to master July 15, 2024 13:45

j23414 force-pushed the add-phylogenetic branch from 55e1a91 to 963be62 Compare July 18, 2024 15:59

joverlee521 mentioned this pull request Jul 18, 2024

CI: Update test-pathogen-repo-ci nextstrain/conda-base#85

Merged

1 task

joverlee521 requested changes Jul 19, 2024

View reviewed changes

joverlee521 added a commit to nextstrain/conda-base that referenced this pull request Jul 19, 2024

Update lassa CI

70aaa96

Move lassa CI to the latest pathogen-repo-ci Depends on nextstrain/lassa#8

joverlee521 mentioned this pull request Jul 19, 2024

Update lassa CI nextstrain/conda-base#87

Merged

1 task

joverlee521 added a commit to nextstrain/docker-base that referenced this pull request Jul 19, 2024

Update lassa CI

e7fa516

Move lassa CI to the latest pathogen-repo-ci Depends on nextstrain/lassa#8

joverlee521 mentioned this pull request Jul 19, 2024

Update lassa CI nextstrain/docker-base#225

Merged

1 task

joverlee521 added a commit to nextstrain/augur that referenced this pull request Jul 19, 2024

Update lassa CI

77e5789

Move lassa CI to the latest pathogen-repo-ci Depends on nextstrain/lassa#8

joverlee521 mentioned this pull request Jul 19, 2024

Update lassa CI nextstrain/augur#1554

Merged

4 tasks

joverlee521 added a commit to nextstrain/conda-base that referenced this pull request Jul 22, 2024

Update lassa CI

754bc8c

Move lassa CI to the latest pathogen-repo-ci Depends on nextstrain/lassa#8

joverlee521 added a commit to nextstrain/docker-base that referenced this pull request Jul 22, 2024

Update lassa CI

8e598b2

Move lassa CI to the latest pathogen-repo-ci Depends on nextstrain/lassa#8

joverlee521 added a commit to nextstrain/augur that referenced this pull request Jul 22, 2024

Update lassa CI

2cdd3a3

Move lassa CI to the latest pathogen-repo-ci Depends on nextstrain/lassa#8

This was referenced Jul 22, 2024

Fix missing strain names for 1351 records #11

Merged

Separate curated data by L and S segments #12

Merged

j23414 linked an issue Jul 24, 2024 that may be closed by this pull request

add phylogenetic workflow #13

Closed

j23414 added 9 commits July 29, 2024 14:13

Move phylogenetic workflow to phylogenetic directory

53246bf

Following the pathogen-repo-guide * https://github.com/nextstrain/pathogen-repo-guide/tree/e3bfb52c8155058a3d48592f4268a7382bf3e12a

Add copy example data custom rules

bbb7e77

Copy the "copy_example_data" custom rules from the pathogen-repo-guide * https://github.com/nextstrain/pathogen-repo-guide/tree/e3bfb52c8155058a3d48592f4268a7382bf3e12a/phylogenetic/build-configs/ci

Since lassa has S and L segments

4b3c822

Update the CI

cf59a92

Move rules for preparing sequences to its own smk file

ecb6aa3

Part of work to update this repo to match the pathogen-repo-guide.

Move rules for constructing phylogeny to its own smk file

1fd7d55

Part of work to update this repo to match the pathogen-repo-guide.

Move rules for annotating phylogeny to its own smk file

c3fa8f6

Part of work to update this repo to match the pathogen-repo-guide.

Move rule for exporting auspice json to its own smk file

ee0135a

Part of work to update this repo to match the pathogen-repo-guide.

Move config values to config file

c078718

j23414 force-pushed the add-phylogenetic branch from 9c14a80 to 39b1a28 Compare July 29, 2024 19:07

j23414 added 2 commits July 29, 2024 18:15

Add description statement

003ecfc

Copy phylogenetic instructions from pathogen-repo-guide

4d5aeec

j23414 force-pushed the add-phylogenetic branch from 39b1a28 to ea6c231 Compare July 29, 2024 23:10

j23414 added 11 commits July 30, 2024 06:44

Download sequences and metadata from data.nextstrain.org

d81791c

Pass curated GenBank data through the rest of pipeline

d7b5931

Fixup: Add description statement

543de0b

Fixup example sequences to ID on accession

de8645d

To match the curated sequences, fixup example sequences to ID on accession.

Fixup AmbiguousRuleException

fa12fbd

Display strain name on tree

8ba2317

Attribution

2553ebc

Add phylogenetic automation and deploy

689800e

Separate files into segment directories

f818c4b

* Capitalize L and S to match ingest * Refactor and place intermediate files in segment directories * Match segment capitalization in reference files and example files

j23414 force-pushed the add-phylogenetic branch from ea6c231 to f818c4b Compare July 30, 2024 10:45

Update description to match https://nextstrain.org/lassa/s

e4d25fb

genehack approved these changes Jul 30, 2024

View reviewed changes

Fixup: Update description to match https://nextstrain.org/lassa/s

ecd6ac9

Co-authored-by: John SJ Anderson <john@genehack.org>

joverlee521 reviewed Jul 30, 2024

View reviewed changes

j23414 and others added 5 commits July 31, 2024 16:06

Update .github/workflows/ingest-to-phylogenetic.yaml

3eb4a8d

Co-authored-by: Jover Lee <joverlee521@gmail.com>

ingest: Switch to lowercase segment names

7e177ea

phylogenetic: Switch to lowercase segment names

072da67

Since number of S and L segment sequences are both below 5k, include …

7cde259

…all that meet a min length requirement

j23414 merged commit 86cf85b into master Aug 2, 2024
4 checks passed

j23414 deleted the add-phylogenetic branch August 2, 2024 14:24

joverlee521 added a commit to nextstrain/conda-base that referenced this pull request Aug 2, 2024

Update lassa CI

51bff8f

Move lassa CI to the latest pathogen-repo-ci Depends on nextstrain/lassa#8

joverlee521 added a commit to nextstrain/augur that referenced this pull request Aug 2, 2024

Update lassa CI

7815f2d

Move lassa CI to the latest pathogen-repo-ci Depends on nextstrain/lassa#8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add phylogenetic #8

Add phylogenetic #8

j23414 commented Jul 9, 2024 •

edited by joverlee521

Loading

joverlee521 left a comment

joverlee521 Jul 19, 2024

joverlee521 Jul 19, 2024

j23414 Jul 22, 2024

genehack Jul 30, 2024

joverlee521 Jul 30, 2024

j23414 Jul 31, 2024 •

edited

Loading

joverlee521 Jul 30, 2024

j23414 Jul 30, 2024

j23414 commented Aug 2, 2024

		"s3://nextstrain-data/files/workflows/lassa/metadata_all.tsv.zst"
		"s3://nextstrain-data/files/workflows/lassa/sequences_all.fasta.zst"

Add phylogenetic #8

Add phylogenetic #8

Conversation

j23414 commented Jul 9, 2024 • edited by joverlee521 Loading

Description of proposed changes

Related issue(s)

Checklist

Post-merge clean up

joverlee521 left a comment

Choose a reason for hiding this comment

joverlee521 Jul 19, 2024

Choose a reason for hiding this comment

joverlee521 Jul 19, 2024

Choose a reason for hiding this comment

j23414 Jul 22, 2024

Choose a reason for hiding this comment

genehack Jul 30, 2024

Choose a reason for hiding this comment

joverlee521 Jul 30, 2024

Choose a reason for hiding this comment

j23414 Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

joverlee521 Jul 30, 2024

Choose a reason for hiding this comment

j23414 Jul 30, 2024

Choose a reason for hiding this comment

j23414 commented Aug 2, 2024

j23414 commented Jul 9, 2024 •

edited by joverlee521

Loading

j23414 Jul 31, 2024 •

edited

Loading