Option to filter sites/columns in alignment based on percentage of identity #5

rguerillot · 2020-04-30T22:39:09Z

Thanks for developing this great alignment processing tool.
I would be very interested if you could add an option that would allow to filter sites/columns in an alignment based on conservation across all sequences in the alignment (percentage of identity, eg ).
This would be very helpful for constructing "relaxed" core variants from multiple genomes alignment.

Romain

fredericlemoine · 2020-05-01T18:31:45Z

Hi Romain,

Thanks for using goalign and for your suggestion.
How would you compute this conservation score? Would it be, for a given column, the number of sequences that are different from a reference ? Or the number of different alleles or the « entropy » of the column for example?

Frederic

rguerillot · 2020-05-02T01:36:56Z

Hi Frederic,

For the conservation score I was thinking of:

most abundant character (base or amino acid or N or X or -) in the column / total number of characters in the column.

Ideally the option to filter columns based on this score would allow to set "conservation score" threshold for inclusion/exclusion and also allows to specify if this score is calculated from most abundant character in the column or a specific character (- or A or N...).

The application I have in mind is the custom filtering of bacterial whole genomes alignment to get core variant alignment which are commonly used to build phylogenic trees. In this particular case people usually remove all identical columns to only retain variant sites (eg. remove column with 100% conservation score) and filter all columns that contain one or more gap (-).
The problem with this is this approach is the number of column remaining in the final core variants alignment is getting smaller an smaller when you add more genomes for various reasons (eg 1 genome has a deletion (-) at a site) and therefore the resolution of your phylogeny become quickly very bad when you have 1000s of genomes.

With the option I suggest one could fully control the way they filter columns to obtain the "core" variants alignment (eg remove invariant column with 100% identity score, retain position <1% of gap or deletion -, remove column with >2% ambiguous position (N)...).

There is currently no alignment processing tool that allow this level of filtering flexibility.

I hope my explanation make sense.

Romain

fredericlemoine · 2020-05-02T20:30:12Z

Hi Romain, I think I understand, thanks for the suggestion. I tried with the option --char to goalign clean sites.

WIth --char MAJ --cutoff 1, it will remove sites with 100% of the most abundant character at these positions (100% invariant sites).
With --char MAJ --cutoff 0.9, it will remove sites having >= 90% of the most abundant character at these positions.
With --char N --cutoff 0.1, it will remove sites having >= 10% N

You can try this test version under linux:
goalign_amd64_linux.zip

Do not hesitate to tell me what you think about that option,

Frederic

jean-baka · 2020-05-03T06:33:08Z

Hi Fred, hi Romain, But usually one wants to *retain* sites that are rather well conserved, not filter them out. There is a whole body of literature about alignment trimmers, some of the most famous being GBlocks, TrimAl and BMGE. Also, most of them work with sliding windows to smoothen the conservation signal, so that we don't keep one individual site inside a larger non-conserved/noisy region, etc. So this can reach quite far. I would suggest that the most simple filtering scheme is to filter out sites having more than a certain percentage of gaps *or* N (if nucleotide alignment) *or* X (if a.a. alignment). I'm not sure what you want to do with the identity of the most common character in a site. Doesn't matter too much, and making decisions based on the ratio of sequences having that character out of the total number of sequences doesn't work at least for amino acid sequences (e.g. a site comprising 1/3 I, 1/3 L and 1/3 V) is very well conserved because isoleucine, leucine and valine have very similar physico-chemical properties. Cheers, Jean-Baka

…

On Sat, 2 May 2020, 23:30 fredericlemoine, ***@***.***> wrote: Hi Romain, I think I understand, thanks for the suggestion. I tried with the option --char to goalign clean sites. - WIth --char MAJ --cutoff 1, it will remove sites with 100% of the most abundant character at these positions (100% invariant sites). - With --char MAJ --cutoff 0.9, it will remove sites having >= 90% of the most abundant character at these positions. - With --char N --cutoff 0.1, it will remove sites having >= 10% N You can try this test version under linux: goalign_amd64_linux.zip <https://github.com/evolbioinfo/goalign/files/4568782/goalign_amd64_linux.zip> Do not hesitate to tell me what you think about that option, Frederic — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACA6M7S6V76KWRMHBSYE723RPR7GBANCNFSM4MWWENAA> .

tseemann · 2020-05-03T07:39:05Z

I think @rguerillot and @jean-baka are both right, and there are different things one might to do, that are problem and data dependent.

For the the case of COVID-19 we have 1000s of consensus genomes, but most are patchy with NNN runs (low depth) and some gap - (true deletion or maybe 0 depth). Because this data is not high quality the signal to noise ratio is low and false SNPs distort phylogeny badly.

In the COVID case with 1000s of samples, removing a column (site) that is nearly invariant but not quite is a good idea, as the singleton/outlier is more likely to be error than noise, and if kept will produce artificially long branch lengths. For this case, as @jean-baka said, we want the opposite of
--char MAJ --cutoff 0.9,

1

Do we need an --inverse / -v option to invert the filtering?

2

I like the idea of pseudo-character classed MAJ etc. It's elegant!
Maybe MAJ1 MAJ2 .. MAJn for most to lowest frequent?
I would like to be able to say "remove sites where MAJ2 (second most frequent allele) occurs < 3 times" (see 3 too)

3

And then character class MIN ?
And MIN1 MIN2 ... MINn for least to most frequent?
I think there are use cases for knowing the lowest freq allele too.
(MAJn is the same thing but we don't know n)

4

Support fractions and absolute values
Sometimes i want to discard a site if it has ONE (1) outlier.
I don't want to have to calculate the number of rows first then convert to a proportion.
ie. if --cutoff is 0 to 1.0 (yes 1.0) then treat as proportion
if --cutoff is 1 (not 1.0) to INT_MAX then treat as absolute number

5

Do we need an option for case insensitive -i ?

fredericlemoine · 2020-05-03T08:15:57Z

Thanks @jean-baka and @tseemann for your interesting comments.

I indeed thought about alignment trimmers, I should have mentioned them in my first comment. And maybe even just for removing sites just based on number of gaps, proper alignment trimmers could be a better option.

In the case of COVID, there are cases where we want to remove mutations that are unique in their column, because they are most likely to be sequencing errors than real mutations (especially if there are several such columns). So I was not that surprised about that need. But indeed, it could necessitate more sophisticated methods.

For your suggestions @tseemann:

An inverse option would be interesting indeed. But I am not sure what you mean by the opposite of --char MAJ --cutoff 0.9. This option means that you remove sites where the most abundant character constitutes 90% or more of the site. Would the opposite be that you keep only such sites?
Very interesting ideas about classes of characters. I will think about it.
This is something I was planning to do, exactly for the case you mention of 1 occurrence.
The case insensitive option would be interesting, I started thinking about it while writing the documentation.

tseemann · 2020-05-03T09:28:36Z

Other character class examples would be:

DNA [AGTC]
DNAN [AGTCN]
IUPAC
any valid IUPAC code eg. R ie. count A + T

Maybe negative classes?

not DNA = --char ^DNA or similar

I've used trimal etc before but their focus seems to be on removing sequences, not columns. And trimal is very SLOW for the column operations.

You referred to entropy - i notice EASEL has operations that use that. it's also portentiually useful but for fine grained control (and ease of reviewer understanding) the clear cut (3 or less) or (10%) is easier to follow.

roblanf · 2020-05-03T22:25:52Z

Just chiming in for the COVID case here. I think an option to mask a singleton site (rather than mask/remove a column) might be useful.

The reason is as our COVID alignments grow, there's a monotonic increase in the proportion of sites with singletons. So just removing the entire site seems like we'll be removing more and more data as time goes on.

fredericlemoine · 2020-05-07T08:32:13Z

Just chiming in for the COVID case here. I think an option to mask a singleton site (rather than mask/remove a column) might be useful.

The reason is as our COVID alignments grow, there's a monotonic increase in the proportion of sites with singletons. So just removing the entire site seems like we'll be removing more and more data as time goes on.

I added the option --uniqueto goalign mask. It masks characters that are unique in their column (except gaps). If you try it (linux executable here) do not hesitate to tell me if it does not correspond to what yo were suggesting.

tseemann · 2020-05-11T01:53:46Z

@fredericlemoine i can only install tagged releases. I'll wait until you formalise it.

tseemann · 2020-05-11T01:54:56Z

@roblanf i would argue singleton columns will be less, or equally, likely as we sample more widely? the virus can only change so much, and we are sequencing all PCR+ve samples we get?

roblanf · 2020-05-11T01:59:24Z

Heh, good point. I guess it will be something like logistic growth in that case.

In a recent alignment of ~11K sequences, 24.5K of the 29.5K sites (~80%) were constant. So there's still a pretty big target of constant sites. If we really wanted to, we could downsample by date and plot the curve...

…eqs #5

tseemann · 2023-04-15T02:23:15Z

@fredericlemoine Here I am 3 years later, and working on bacteria again, and encountering similar problems that I want to solve with goalign clean sites but not sure how to.

A common step in bacterial phylo is to only use the "core" sties.
That is, sites (columns) which ONLY have A,C,G,T.
This is overly strict, so prefer to do "fuzzy core" which allows some % of non-A,C,G,T letters.

For example, a 95% fuzzy core would only keep sites with >= 95% A,C,G,T, and allow for 5% to be GAP, N, X, or IUPAC say.

I think what I need is for goalign clean sites --inverse --char ACGT --cutoff 0.05

But there is no --inverse and no way for --char to take a SET or CLASS of characters.

Is there a better way to do this?

…--char #5

fredericlemoine · 2023-04-17T09:35:33Z

@tseemann , you may try the code after the last commit. It should work with --char ACGT (or any set of characters, e.g. --char AG-) and with --reverse.

fredericlemoine · 2023-04-19T12:47:36Z

@tseemann I just pushed a new docker image with the changes in case you don't want to build goalign :

docker pull evolbioinfo/goalign:dev

fredericlemoine added a commit that referenced this issue May 2, 2020

Added option --char to goalign clean sites command #5

05e935f

fredericlemoine added a commit that referenced this issue May 2, 2020

Added a test #5

f70594c

fredericlemoine added a commit that referenced this issue May 6, 2020

Added --unique option to goalign mask #5

a73068a

fredericlemoine added a commit that referenced this issue May 7, 2020

Corrected doc for goalign mask #5

db73762

fredericlemoine added a commit that referenced this issue Jul 7, 2020

Added ignore case to clean seqs and clean sites and --char to clean s…

e6a3635

…eqs #5

fredericlemoine added a commit that referenced this issue Apr 17, 2023

added option goalign clean sites --reverse, and multi characters for …

3cfac2a

…--char #5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to filter sites/columns in alignment based on percentage of identity #5

Option to filter sites/columns in alignment based on percentage of identity #5

rguerillot commented Apr 30, 2020

fredericlemoine commented May 1, 2020

rguerillot commented May 2, 2020 •

edited

Loading

fredericlemoine commented May 2, 2020

jean-baka commented May 3, 2020 via email

tseemann commented May 3, 2020 •

edited

Loading

fredericlemoine commented May 3, 2020 •

edited

Loading

tseemann commented May 3, 2020 •

edited

Loading

roblanf commented May 3, 2020

fredericlemoine commented May 7, 2020 •

edited

Loading

tseemann commented May 11, 2020

tseemann commented May 11, 2020

roblanf commented May 11, 2020

tseemann commented Apr 15, 2023 •

edited

Loading

fredericlemoine commented Apr 17, 2023

fredericlemoine commented Apr 19, 2023

Option to filter sites/columns in alignment based on percentage of identity #5

Option to filter sites/columns in alignment based on percentage of identity #5

Comments

rguerillot commented Apr 30, 2020

fredericlemoine commented May 1, 2020

rguerillot commented May 2, 2020 • edited Loading

fredericlemoine commented May 2, 2020

jean-baka commented May 3, 2020 via email

tseemann commented May 3, 2020 • edited Loading

1

2

3

4

5

fredericlemoine commented May 3, 2020 • edited Loading

tseemann commented May 3, 2020 • edited Loading

roblanf commented May 3, 2020

fredericlemoine commented May 7, 2020 • edited Loading

tseemann commented May 11, 2020

tseemann commented May 11, 2020

roblanf commented May 11, 2020

tseemann commented Apr 15, 2023 • edited Loading

fredericlemoine commented Apr 17, 2023

fredericlemoine commented Apr 19, 2023

rguerillot commented May 2, 2020 •

edited

Loading

tseemann commented May 3, 2020 •

edited

Loading

fredericlemoine commented May 3, 2020 •

edited

Loading

tseemann commented May 3, 2020 •

edited

Loading

fredericlemoine commented May 7, 2020 •

edited

Loading

tseemann commented Apr 15, 2023 •

edited

Loading