-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to filter sites/columns in alignment based on percentage of identity #5
Comments
Hi Romain, Thanks for using goalign and for your suggestion. Frederic |
Hi Frederic, For the conservation score I was thinking of:
Ideally the option to filter columns based on this score would allow to set "conservation score" threshold for inclusion/exclusion and also allows to specify if this score is calculated from most abundant character in the column or a specific character (- or A or N...). The application I have in mind is the custom filtering of bacterial whole genomes alignment to get core variant alignment which are commonly used to build phylogenic trees. In this particular case people usually remove all identical columns to only retain variant sites (eg. remove column with 100% conservation score) and filter all columns that contain one or more gap (-). With the option I suggest one could fully control the way they filter columns to obtain the "core" variants alignment (eg remove invariant column with 100% identity score, retain position <1% of gap or deletion -, remove column with >2% ambiguous position (N)...). There is currently no alignment processing tool that allow this level of filtering flexibility. I hope my explanation make sense. Romain |
Hi Romain, I think I understand, thanks for the suggestion. I tried with the option
You can try this test version under linux: Do not hesitate to tell me what you think about that option, Frederic |
Hi Fred, hi Romain,
But usually one wants to *retain* sites that are rather well conserved, not
filter them out. There is a whole body of literature about alignment
trimmers, some of the most famous being GBlocks, TrimAl and BMGE. Also,
most of them work with sliding windows to smoothen the conservation signal,
so that we don't keep one individual site inside a larger
non-conserved/noisy region, etc.
So this can reach quite far. I would suggest that the most simple filtering
scheme is to filter out sites having more than a certain percentage of gaps
*or* N (if nucleotide alignment) *or* X (if a.a. alignment).
I'm not sure what you want to do with the identity of the most common
character in a site. Doesn't matter too much, and making decisions based on
the ratio of sequences having that character out of the total number of
sequences doesn't work at least for amino acid sequences (e.g. a site
comprising 1/3 I, 1/3 L and 1/3 V) is very well conserved because
isoleucine, leucine and valine have very similar physico-chemical
properties.
Cheers,
Jean-Baka
…On Sat, 2 May 2020, 23:30 fredericlemoine, ***@***.***> wrote:
Hi Romain, I think I understand, thanks for the suggestion. I tried with
the option --char to goalign clean sites.
- WIth --char MAJ --cutoff 1, it will remove sites with 100% of the
most abundant character at these positions (100% invariant sites).
- With --char MAJ --cutoff 0.9, it will remove sites having >= 90% of
the most abundant character at these positions.
- With --char N --cutoff 0.1, it will remove sites having >= 10% N
You can try this test version under linux:
goalign_amd64_linux.zip
<https://github.com/evolbioinfo/goalign/files/4568782/goalign_amd64_linux.zip>
Do not hesitate to tell me what you think about that option,
Frederic
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#5 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACA6M7S6V76KWRMHBSYE723RPR7GBANCNFSM4MWWENAA>
.
|
I think @rguerillot and @jean-baka are both right, and there are different things one might to do, that are problem and data dependent. For the the case of COVID-19 we have 1000s of consensus genomes, but most are patchy with NNN runs (low depth) and some gap In the COVID case with 1000s of samples, removing a column (site) that is nearly invariant but not quite is a good idea, as the singleton/outlier is more likely to be error than noise, and if kept will produce artificially long branch lengths. For this case, as @jean-baka said, we want the opposite of 1Do we need an 2I like the idea of pseudo-character classed 3And then character class 4Support fractions and absolute values 5Do we need an option for case insensitive |
Thanks @jean-baka and @tseemann for your interesting comments. I indeed thought about alignment trimmers, I should have mentioned them in my first comment. And maybe even just for removing sites just based on number of gaps, proper alignment trimmers could be a better option. In the case of COVID, there are cases where we want to remove mutations that are unique in their column, because they are most likely to be sequencing errors than real mutations (especially if there are several such columns). So I was not that surprised about that need. But indeed, it could necessitate more sophisticated methods. For your suggestions @tseemann:
|
Other character class examples would be:
Maybe negative classes?
I've used You referred to entropy - i notice EASEL has operations that use that. it's also portentiually useful but for fine grained control (and ease of reviewer understanding) the clear cut (3 or less) or (10%) is easier to follow. |
Just chiming in for the COVID case here. I think an option to mask a singleton site (rather than mask/remove a column) might be useful. The reason is as our COVID alignments grow, there's a monotonic increase in the proportion of sites with singletons. So just removing the entire site seems like we'll be removing more and more data as time goes on. |
I added the option |
@fredericlemoine i can only install tagged releases. I'll wait until you formalise it. |
@roblanf i would argue singleton columns will be less, or equally, likely as we sample more widely? the virus can only change so much, and we are sequencing all PCR+ve samples we get? |
Heh, good point. I guess it will be something like logistic growth in that case. In a recent alignment of ~11K sequences, 24.5K of the 29.5K sites (~80%) were constant. So there's still a pretty big target of constant sites. If we really wanted to, we could downsample by date and plot the curve... |
@fredericlemoine Here I am 3 years later, and working on bacteria again, and encountering similar problems that I want to solve with A common step in bacterial phylo is to only use the "core" sties. For example, a 95% fuzzy core would only keep sites with >= 95% A,C,G,T, and allow for 5% to be GAP, N, X, or IUPAC say. I think what I need is for But there is no Is there a better way to do this? |
@tseemann , you may try the code after the last commit. It should work with --char ACGT (or any set of characters, e.g. --char AG-) and with --reverse. |
@tseemann I just pushed a new docker image with the changes in case you don't want to build goalign :
|
Thanks for developing this great alignment processing tool.
I would be very interested if you could add an option that would allow to filter sites/columns in an alignment based on conservation across all sequences in the alignment (percentage of identity, eg ).
This would be very helpful for constructing "relaxed" core variants from multiple genomes alignment.
Romain
The text was updated successfully, but these errors were encountered: