Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

matches, mismatches and gaps from Bio.Align.PairwiseAlignment #3538

Closed
peterjc opened this issue Apr 29, 2021 · 16 comments · Fixed by #4221
Closed

matches, mismatches and gaps from Bio.Align.PairwiseAlignment #3538

peterjc opened this issue Apr 29, 2021 · 16 comments · Fixed by #4221

Comments

@peterjc
Copy link
Member

peterjc commented Apr 29, 2021

Example use case creating an PairwiseAlignment instance as per the tutorial:

>>> from Bio import Align
>>> aligner = Align.PairwiseAligner()
>>> seq1 = "GAACT"
>>> seq2 = "GAT"
>>> alignments = aligner.align(seq1, seq2)
>>> print(alignments.score)
3.0
>>> alignment = alignments[0]
>>> print(alignment)
GAACT
||--|
GA--T

I would like to be able to get easy access to the number of matches, mismatches, and gaps:

>>> alignment.matches
3
>>> alignment.mismatches
0
>>> alignment.gaps
2

Workaround to get these numbers from the default string representation:

>>> middle_line = str(alignment).splitlines()[1]
>>> [middle_line.count(_) for _ in "|.-"]
[3, 0, 2]

This can be done more efficiently directly from alignment.path but that is non-trivial.

@mdehoon
Copy link
Contributor

mdehoon commented Apr 29, 2021

For a quick solution, you could format the alignment as PSL:

>>> print(format(alignment, "psl"))
3	0	0	0	0	0	1	2	+	query	target	5	0	5	2	2,1,	0,2,	0,4,

See http://genome.ucsc.edu/FAQ/FAQformat#format2 for a description of the PSL format.
First column is the number of matches, second column is the number of mismatches, column6 + column8 is the number of gaps inserted.

@mdehoon
Copy link
Contributor

mdehoon commented Apr 29, 2021

There is one complication though: How should we could alignments to N's? As it's neither a match nor a mismatch, in the PSL format they are counted separately. But in general, N does not need to be the wildcard character; it could be X or ? or something else. Or there could be no wildcard character.

@mdehoon
Copy link
Contributor

mdehoon commented Apr 29, 2021

And one more complication: Is a aligned to A a match or a mismatch?

@peterjc
Copy link
Member Author

peterjc commented Apr 29, 2021

As to a versus A, my impression is they are already treated as not matching (unless perhaps something different happens when using a scoring matrix?)

I would personally assume wildcards count as a match - again does the existing scoring behaviour not set a precedent to follow? I may be using this wrong (need to tweak scoring?):

>>> from Bio import Align
>>> aligner = Align.PairwiseAligner(wildcard="N")
>>> seq1 = "GAACT"
>>> seq2 = "GNT"
>>> alignments = aligner.align(seq1, seq2)
>>> print(alignments.score)
2.0
>>> alignment = alignments[0]
>>> print(alignment)
GAAC-T
|----|
G---NT

I would expect N to match the A or C here.

Update - looking at later alignments it does give alignments like the following, but the scoring and display suggest this is a treated as a mismatch:

GAACT
|.--|
GN--T

I'm probably mis-using the wildcard

@mdehoon
Copy link
Contributor

mdehoon commented Apr 29, 2021

You are using the wildcard correctly. Since they are not a match and not a mismatch, they get a zero score.

The mismatch symbol . is shown because the alignment object does not store which character is the wildcard, and therfore there is no way to tell that the A-N alignment is an alignment to the wildcard.

@mdehoon
Copy link
Contributor

mdehoon commented Apr 29, 2021

As to a versus A, my impression is they are already treated as not matching (unless perhaps something different happens when using a scoring matrix?)

By default they are not matching. But the scoring matrix could in principle define the same score for A-X and a-X alignments, and then they are effectively matching.

@peterjc
Copy link
Member Author

peterjc commented May 3, 2021

Ah, showing the wildcard matching in the pretty display would be nice but not important.

Would the suggested three properties be well defined additions to the object?

@MarkusPiotrowski
Copy link
Contributor

My impression is that such properties where requested several times in the past, at least for pairwise2, and thus they would be useful additions.

However, one should keep in mind that 'match' and 'mismatch' are only well defined when looking for identity. It's more complicated when you have a scoring matrix with different degrees of similarity.

How should such cases be handled? Return these properties only when the alignment has used an identity matrix? Or raise a warning?

@mdehoon
Copy link
Contributor

mdehoon commented May 3, 2021

Would the suggested three properties be well defined additions to the object?

I would prefer a method instead of a property, so that the user can specify the wildcard character and upper/lower case handling as arguments to this method. Similar to the _format_psl method:

>>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner()
>>> aligner.gap_score = -5
>>> alignments = aligner.align("AAAXGAAA", "AAAATAAA")
>>> print(alignments[0])
AAAXGAAA
|||..|||
AAAATAAA

>>> print(alignments[0].format("psl"))
6       2       0       0       0       0       0       0       +       query   8       0       8       target  8       0       8       1       8,      0,      0,

# 6 matches, 2 mismatches
>>> print(alignments[0].format("psl", wildcard='X'))
6       1       0       1       0       0       0       0       +       query   8       0       8       target  8       0       8       1       8,      0,      0,

# 6 matches, 1 mismatch, 1 match against the wildcard

However, one should keep in mind that 'match' and 'mismatch' are only well defined when looking for identity. It's more complicated when you have a scoring matrix with different degrees of similarity.

How should such cases be handled? Return these properties only when the alignment has used an identity matrix? Or raise a warning?

Once we have an alignment, it does not matter how it was generated. We can still report the number of matches and mismatches even if the alignment was generated using a scoring matrix.

@MarkusPiotrowski
Copy link
Contributor

Once we have an alignment, it does not matter how it was generated. We can still report the number of matches and mismatches even if the alignment was generated using a scoring matrix.

Yes, this is totally clear to me. I was just commenting on the well defined from @peterjc question. I always felt a bit uneasy that we don't use a 'similar character' in the match line, as it is used e.g. in the BLAST output. However, the difficulties of implementing such a character (a. when does 'similarity' start? b. in pairwise2 this is not possible, because the returned result has no knowledge about the used matrix) let me stand back from such a change.
Still, in a pretty print output the user can still judge about the mismatches, but just reporting the number of mismatches as a number may give a wrong impression about the similarity of two sequences. That's the point I wanted to raise.

@peterjc
Copy link
Member Author

peterjc commented May 3, 2021

Point taken, What if the new properties were called .identities, .mismatches and .gaps? That would be clearer to me.

@mdehoon If instead of properties this was method based, would something like .counts(...) returning these three integers be what you had in mind? The return value(s) could include variations like similar and non-similar mismatches depending on the arguments.

@mdehoon
Copy link
Contributor

mdehoon commented May 3, 2021

@mdehoon If instead of properties this was method based, would something like .counts(...) returning these three integers be what you had in mind? The return value(s) could include variations like similar and non-similar mismatches depending on the arguments.

Yes. The PSL format actually has one more integer: The number of matches against lower-case nucleotides (usually representing repeat regions). I guess the number of integers to be returned by counts(...) depends on the arguments (e.g. there is no need to return the number of matches against the wildcard if no wildcard character is defined). Perhaps we can return them as a namedtuple to be explicit about which number is which.

@mdehoon
Copy link
Contributor

mdehoon commented May 3, 2021

Yes, this is totally clear to me. I was just commenting on the well defined from @peterjc question. I always felt a bit uneasy that we don't use a 'similar character' in the match line, as it is used e.g. in the BLAST output. However, the difficulties of implementing such a character (a. when does 'similarity' start? b. in pairwise2 this is not possible, because the returned result has no knowledge about the used matrix) let me stand back from such a change.
Still, in a pretty print output the user can still judge about the mismatches, but just reporting the number of mismatches as a number may give a wrong impression about the similarity of two sequences. That's the point I wanted to raise.

If we introduce a counts method, we could offer the possibility to pass the substitution matrix as one of the arguments, together with a threshold score value that distinguishes matches from mismatches.

Or, if we want to be clever, we can pass a Boolean matrix:

>>> from Bio.Align import substitution_matrices
>>> m = substitution_matrices.load("blosum62")
>>> print(m)
#  Matrix made by matblas from blosum62.iij
#  * column uses minimum score
#  BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
#  Blocks Database = /data/blocks_5.0/blocks.dat
#  Cluster Percentage: >= 62
#  Entropy =   0.6979, Expected =  -0.5209
     A    R    N    D    C    Q    E    G    H    I ...
A  4.0 -1.0 -2.0 -2.0  0.0 -1.0 -1.0  0.0 -2.0 -1.0 ...
R -1.0  5.0  0.0 -2.0 -3.0  1.0  0.0 -2.0  0.0 -3.0 ...
N -2.0  0.0  6.0  1.0 -3.0  0.0  0.0  0.0  1.0 -3.0 ...
D -2.0 -2.0  1.0  6.0 -3.0  0.0  2.0 -1.0 -1.0 -3.0 ...
C  0.0 -3.0 -3.0 -3.0  9.0 -3.0 -4.0 -3.0 -3.0 -1.0 ...
Q -1.0  1.0  0.0  0.0 -3.0  5.0  2.0 -2.0  0.0 -3.0 ...
...
>>> threshold = 0.5  # greater than 0.5 is match, smaller than 0.5 is mismatch
>>> b = (m > threshold)
>>> print(b)  # our Boolean matrix:
    A   R   N   D   C   Q   E   G   H   I ...
A 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
R 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ...
N 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 ...
D 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 ...
C 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ...
Q 0.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 ...
...
>>> b['A', 'A']
True
>>> b['A', 'R']
False
>>> b['N', 'N']
True
>>> b['N', 'D']
True
>>> b['N', 'C']
False

Then we could do

>>> match_count, mismatch_count = alignment.counts(matching = (m > 0.5))

@MarkusPiotrowski
Copy link
Contributor

Would be nice if we could also have this for the match line in the pretty print output, wouldn't it?

@mufernando
Copy link

wow, just came here looking for such method.

I like the idea of counting gaps independently in the query vs target (sensu PSL). Also I think a vs. A should be marked as mismatch, but it wouldn't hurt to give the user a flag that changes the behavior.

In terms of implementation, is it possible to do sth more clever than just zipping through both alignments? Happy to help.

@peterjc
Copy link
Member Author

peterjc commented Jan 17, 2023

Note my original workaround no longer works due to the change in the alignment string representation in Biopython 1.80 (#4183). Was:

>>> from Bio import Align
>>> aligner = Align.PairwiseAligner()
>>> seq1 = "GAACT"
>>> seq2 = "GAT"
>>> alignments = aligner.align(seq1, seq2)
>>> print(alignments.score)
3.0
>>> alignment = alignments[0]
>>> print(alignment)
GAACT
||--|
GA--T

Now:

>>> print(alignment)
target            0 GAACT 5
                  0 ||--| 5
query             0 GA--T 3

We can easily get the two aligned sequence strings using alignment[0] and alignment[1], so these are possible workarounds:

>>> gaps = sum(1 for a, b in zip(alignment[0], alignment[1]) if a == "-" or b == "-")
>>> identities = sum(1 for a, b in zip(alignment[0], alignment[1]) if a == b and a != "-")
>>> mismatches = sum(1 for a, b in zip(alignment[0], alignment[1]) if a != b and a != "-" and b != "-")

Given the changes in Biopython 1.80, adding a .count(...) method seems more compelling.

This might potentially be more efficient than zipping over the two aligned strings by tracing the .path information as done in the Biopython 1.79 ._format_pretty(...) private method?

I've also not considered how the optional arguments suggestion earlier might be implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants