-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restrict contextual samples to 1 year back in 1m/2m/6m builds #1129
Conversation
For when subsampling in the Nextstrain GISAID profile, rather than treating early contextual samples as origin of pandemic to beginning of focal window, eg for 6m analysis from 2020 to 6m ago, instead use a consistent 24m of additional context. So, for 6m, this is context of 30m ago to 6m and focal of 6m ago to present. Additionally, reduce the amount of contextual sequences included from a 4:1 ratio of focal to context to a 10:1 ratio of focal to context.
Drop forced inclusion of Wuhan/1 root in the Nextstrain GISAID profile and swap rooting to use "best", ie temporally optimal rooting. This allows the root to be the common ancestor of the subsampled sequences. This makes it so that with the changes to time-based subsampling in the previous commit, the "6m" analysis includes samples from the previous 30m and the TMRCA is in ~2021. This set up should be significantly more future proof than needing to continually make new clade-specific (eg /21L/) roots as selective sweeps occur.
After working through the coloring option in PR #1132 I'm definitely more of a fan of the color update. Unless there's conflicting preferences, I'll plan to just close this PR. |
I just added the "revisit sometime" label. As time accrues leaving the older context all the way back to Wuhan will get increasingly clunky. In perhaps 6 months or a year we should implement something like this PR and include a |
Description of proposed changes
Currently, we focus on a recent time window in many of our ncov analyses. For example, ncov/gisaid/north-america/6m, does more intensive sampling of the previous 6 months (aiming for ~4000 "recent" samples and ~1000 "early" samples). However, as the past continues to recede (as it does) we're getting more and more early context that isn't so relevant for understanding circulating diversity. Here's the live 6m tree for example:
Once selective sweeps occur, we can largely forget about past evolution when looking at current diversity. This had previously prompted us to make the "21L" builds that root to clade 21L / lineage BA.2.
At this point, we're getting closer and closer to wanting the same thing with a clade 24A / lineage JN.1 rooting. However, this strategy is clearly not sustainable. In 4 more years we don't want to have have 4 different rootings, all of which require updating and it not being clear to users what they should be looking at.
This PR addresses the issue in a simple fashion, basically making ncov work more like seasonal-flu or avian-flu where there is recent focal samples and older contextual samples, but the contextual samples only go back a year rather than many.
Here's the resulting global 6m tree:
Results from running this PR can be seen at:
etc...
This is using a 10:1 ratio of recent to early samples and doing a +1 year back for early samples. For for the 6m analysis, it's 0m to 6m back as recent focal samples and 6m to 18m back as early contextual samples.
I had tried a +2 year context as well, but it didn't seem to add much understanding while taking up additional screen real estate and additional color ramp. You can compare here however: global/6m
The the biggest worry I see here is that people currently landing at ncov/gisaid/global/6m can see what's currently circulating and get all the context that they may need going back to the beginning of the pandemic (with well known VOCs, etc...).
If we did merge this, we should make two showcase cards on the splash page to direct to
6m
vsall-time
to (partially) address this. Also, if we did merge this, I'd imagine deprecating the 21L builds, where we'd remove them from the automated GitHub Actions rebuild, remove them from the manifest and add redirects to go from ncov/gisaid/21L/global/6m to ncov/gisaid/global/6m.The other approach to this same issue would be take older clades (and conceivably Pango lineages) and make these gray while keeping the color ramp only for more recent clades. This strategy is also not mutually exclusive and we do could do both or neither. I'll try to put together a separate PR for the clade colors idea. Though even if colors update fixes things enough for the time being, I do think we'll eventually want to do something like this strategy. But it's possible this is a couple years down the road.
In addition to code review, I'd appreciate 👍 / 👎 feedback on whether you prefer this to the current sampling strategy.
Testing
Tested locally and via GitHub Action trial builds.
Release checklist
If this pull request introduces new features, complete the following steps:
docs/src/reference/change_log.md
in this pull request to document these changes by the date they were added.