Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to filter sites/columns in alignment based on percentage of identity #5

Open
rguerillot opened this issue Apr 30, 2020 · 15 comments

Comments

@rguerillot
Copy link

Thanks for developing this great alignment processing tool.
I would be very interested if you could add an option that would allow to filter sites/columns in an alignment based on conservation across all sequences in the alignment (percentage of identity, eg ).
This would be very helpful for constructing "relaxed" core variants from multiple genomes alignment.

Romain

@fredericlemoine
Copy link
Collaborator

Hi Romain,

Thanks for using goalign and for your suggestion.
How would you compute this conservation score? Would it be, for a given column, the number of sequences that are different from a reference ? Or the number of different alleles or the « entropy » of the column for example?

Frederic

@rguerillot
Copy link
Author

rguerillot commented May 2, 2020

Hi Frederic,

For the conservation score I was thinking of:

  • most abundant character (base or amino acid or N or X or -) in the column / total number of characters in the column.

Ideally the option to filter columns based on this score would allow to set "conservation score" threshold for inclusion/exclusion and also allows to specify if this score is calculated from most abundant character in the column or a specific character (- or A or N...).

The application I have in mind is the custom filtering of bacterial whole genomes alignment to get core variant alignment which are commonly used to build phylogenic trees. In this particular case people usually remove all identical columns to only retain variant sites (eg. remove column with 100% conservation score) and filter all columns that contain one or more gap (-).
The problem with this is this approach is the number of column remaining in the final core variants alignment is getting smaller an smaller when you add more genomes for various reasons (eg 1 genome has a deletion (-) at a site) and therefore the resolution of your phylogeny become quickly very bad when you have 1000s of genomes.

With the option I suggest one could fully control the way they filter columns to obtain the "core" variants alignment (eg remove invariant column with 100% identity score, retain position <1% of gap or deletion -, remove column with >2% ambiguous position (N)...).

There is currently no alignment processing tool that allow this level of filtering flexibility.

I hope my explanation make sense.

Romain

@fredericlemoine
Copy link
Collaborator

Hi Romain, I think I understand, thanks for the suggestion. I tried with the option --char to goalign clean sites.

  • WIth --char MAJ --cutoff 1, it will remove sites with 100% of the most abundant character at these positions (100% invariant sites).
  • With --char MAJ --cutoff 0.9, it will remove sites having >= 90% of the most abundant character at these positions.
  • With --char N --cutoff 0.1, it will remove sites having >= 10% N

You can try this test version under linux:
goalign_amd64_linux.zip

Do not hesitate to tell me what you think about that option,

Frederic

@jean-baka
Copy link

jean-baka commented May 3, 2020 via email

@tseemann
Copy link

tseemann commented May 3, 2020

I think @rguerillot and @jean-baka are both right, and there are different things one might to do, that are problem and data dependent.

For the the case of COVID-19 we have 1000s of consensus genomes, but most are patchy with NNN runs (low depth) and some gap - (true deletion or maybe 0 depth). Because this data is not high quality the signal to noise ratio is low and false SNPs distort phylogeny badly.

In the COVID case with 1000s of samples, removing a column (site) that is nearly invariant but not quite is a good idea, as the singleton/outlier is more likely to be error than noise, and if kept will produce artificially long branch lengths. For this case, as @jean-baka said, we want the opposite of
--char MAJ --cutoff 0.9,

1

Do we need an --inverse / -v option to invert the filtering?

2

I like the idea of pseudo-character classed MAJ etc. It's elegant!
Maybe MAJ1 MAJ2 .. MAJn for most to lowest frequent?
I would like to be able to say "remove sites where MAJ2 (second most frequent allele) occurs < 3 times" (see 3 too)

3

And then character class MIN ?
And MIN1 MIN2 ... MINn for least to most frequent?
I think there are use cases for knowing the lowest freq allele too.
(MAJn is the same thing but we don't know n)

4

Support fractions and absolute values
Sometimes i want to discard a site if it has ONE (1) outlier.
I don't want to have to calculate the number of rows first then convert to a proportion.
ie. if --cutoff is 0 to 1.0 (yes 1.0) then treat as proportion
if --cutoff is 1 (not 1.0) to INT_MAX then treat as absolute number

5

Do we need an option for case insensitive -i ?

@fredericlemoine
Copy link
Collaborator

fredericlemoine commented May 3, 2020

Thanks @jean-baka and @tseemann for your interesting comments.

I indeed thought about alignment trimmers, I should have mentioned them in my first comment. And maybe even just for removing sites just based on number of gaps, proper alignment trimmers could be a better option.

In the case of COVID, there are cases where we want to remove mutations that are unique in their column, because they are most likely to be sequencing errors than real mutations (especially if there are several such columns). So I was not that surprised about that need. But indeed, it could necessitate more sophisticated methods.

For your suggestions @tseemann:

  1. An inverse option would be interesting indeed. But I am not sure what you mean by the opposite of --char MAJ --cutoff 0.9. This option means that you remove sites where the most abundant character constitutes 90% or more of the site. Would the opposite be that you keep only such sites?
  2. Very interesting ideas about classes of characters. I will think about it.
  3. This is something I was planning to do, exactly for the case you mention of 1 occurrence.
  4. The case insensitive option would be interesting, I started thinking about it while writing the documentation.

@tseemann
Copy link

tseemann commented May 3, 2020

Other character class examples would be:

  • DNA [AGTC]
  • DNAN [AGTCN]
  • IUPAC
  • any valid IUPAC code eg. R ie. count A + T

Maybe negative classes?

  • not DNA = --char ^DNA or similar

I've used trimal etc before but their focus seems to be on removing sequences, not columns. And trimal is very SLOW for the column operations.

You referred to entropy - i notice EASEL has operations that use that. it's also portentiually useful but for fine grained control (and ease of reviewer understanding) the clear cut (3 or less) or (10%) is easier to follow.

@roblanf
Copy link

roblanf commented May 3, 2020

Just chiming in for the COVID case here. I think an option to mask a singleton site (rather than mask/remove a column) might be useful.

The reason is as our COVID alignments grow, there's a monotonic increase in the proportion of sites with singletons. So just removing the entire site seems like we'll be removing more and more data as time goes on.

@fredericlemoine
Copy link
Collaborator

fredericlemoine commented May 7, 2020

Just chiming in for the COVID case here. I think an option to mask a singleton site (rather than mask/remove a column) might be useful.

The reason is as our COVID alignments grow, there's a monotonic increase in the proportion of sites with singletons. So just removing the entire site seems like we'll be removing more and more data as time goes on.

I added the option --uniqueto goalign mask. It masks characters that are unique in their column (except gaps). If you try it (linux executable here) do not hesitate to tell me if it does not correspond to what yo were suggesting.

@tseemann
Copy link

@fredericlemoine i can only install tagged releases. I'll wait until you formalise it.

@tseemann
Copy link

@roblanf i would argue singleton columns will be less, or equally, likely as we sample more widely? the virus can only change so much, and we are sequencing all PCR+ve samples we get?

@roblanf
Copy link

roblanf commented May 11, 2020

Heh, good point. I guess it will be something like logistic growth in that case.

In a recent alignment of ~11K sequences, 24.5K of the 29.5K sites (~80%) were constant. So there's still a pretty big target of constant sites. If we really wanted to, we could downsample by date and plot the curve...

@tseemann
Copy link

tseemann commented Apr 15, 2023

@fredericlemoine Here I am 3 years later, and working on bacteria again, and encountering similar problems that I want to solve with goalign clean sites but not sure how to.

A common step in bacterial phylo is to only use the "core" sties.
That is, sites (columns) which ONLY have A,C,G,T.
This is overly strict, so prefer to do "fuzzy core" which allows some % of non-A,C,G,T letters.

For example, a 95% fuzzy core would only keep sites with >= 95% A,C,G,T, and allow for 5% to be GAP, N, X, or IUPAC say.

I think what I need is for goalign clean sites --inverse --char ACGT --cutoff 0.05

But there is no --inverse and no way for --char to take a SET or CLASS of characters.

Is there a better way to do this?

@fredericlemoine
Copy link
Collaborator

@tseemann , you may try the code after the last commit. It should work with --char ACGT (or any set of characters, e.g. --char AG-) and with --reverse.

@fredericlemoine
Copy link
Collaborator

@tseemann I just pushed a new docker image with the changes in case you don't want to build goalign :

docker pull evolbioinfo/goalign:dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants