matches, mismatches and gaps from Bio.Align.PairwiseAlignment #3538

peterjc · 2021-04-29T12:43:39Z

Example use case creating an PairwiseAlignment instance as per the tutorial:

>>> from Bio import Align
>>> aligner = Align.PairwiseAligner()
>>> seq1 = "GAACT"
>>> seq2 = "GAT"
>>> alignments = aligner.align(seq1, seq2)
>>> print(alignments.score)
3.0
>>> alignment = alignments[0]
>>> print(alignment)
GAACT
||--|
GA--T

I would like to be able to get easy access to the number of matches, mismatches, and gaps:

>>> alignment.matches
3
>>> alignment.mismatches
0
>>> alignment.gaps
2

Workaround to get these numbers from the default string representation:

>>> middle_line = str(alignment).splitlines()[1]
>>> [middle_line.count(_) for _ in "|.-"]
[3, 0, 2]

This can be done more efficiently directly from alignment.path but that is non-trivial.

The text was updated successfully, but these errors were encountered:

mdehoon · 2021-04-29T13:25:34Z

For a quick solution, you could format the alignment as PSL:

>>> print(format(alignment, "psl"))
3	0	0	0	0	0	1	2	+	query	target	5	0	5	2	2,1,	0,2,	0,4,

See http://genome.ucsc.edu/FAQ/FAQformat#format2 for a description of the PSL format.
First column is the number of matches, second column is the number of mismatches, column6 + column8 is the number of gaps inserted.

mdehoon · 2021-04-29T13:31:43Z

There is one complication though: How should we could alignments to N's? As it's neither a match nor a mismatch, in the PSL format they are counted separately. But in general, N does not need to be the wildcard character; it could be X or ? or something else. Or there could be no wildcard character.

mdehoon · 2021-04-29T13:34:31Z

And one more complication: Is a aligned to A a match or a mismatch?

peterjc · 2021-04-29T13:53:42Z

As to a versus A, my impression is they are already treated as not matching (unless perhaps something different happens when using a scoring matrix?)

I would personally assume wildcards count as a match - again does the existing scoring behaviour not set a precedent to follow? I may be using this wrong (need to tweak scoring?):

>>> from Bio import Align
>>> aligner = Align.PairwiseAligner(wildcard="N")
>>> seq1 = "GAACT"
>>> seq2 = "GNT"
>>> alignments = aligner.align(seq1, seq2)
>>> print(alignments.score)
2.0
>>> alignment = alignments[0]
>>> print(alignment)
GAAC-T
|----|
G---NT

I would expect N to match the A or C here.

Update - looking at later alignments it does give alignments like the following, but the scoring and display suggest this is a treated as a mismatch:

GAACT
|.--|
GN--T

I'm probably mis-using the wildcard

mdehoon · 2021-04-29T14:56:11Z

You are using the wildcard correctly. Since they are not a match and not a mismatch, they get a zero score.

The mismatch symbol . is shown because the alignment object does not store which character is the wildcard, and therfore there is no way to tell that the A-N alignment is an alignment to the wildcard.

mdehoon · 2021-04-29T14:59:22Z

As to a versus A, my impression is they are already treated as not matching (unless perhaps something different happens when using a scoring matrix?)

By default they are not matching. But the scoring matrix could in principle define the same score for A-X and a-X alignments, and then they are effectively matching.

peterjc · 2021-05-03T08:27:55Z

Ah, showing the wildcard matching in the pretty display would be nice but not important.

Would the suggested three properties be well defined additions to the object?

MarkusPiotrowski · 2021-05-03T09:08:58Z

My impression is that such properties where requested several times in the past, at least for pairwise2, and thus they would be useful additions.

However, one should keep in mind that 'match' and 'mismatch' are only well defined when looking for identity. It's more complicated when you have a scoring matrix with different degrees of similarity.

How should such cases be handled? Return these properties only when the alignment has used an identity matrix? Or raise a warning?

mdehoon · 2021-05-03T09:40:56Z

Would the suggested three properties be well defined additions to the object?

I would prefer a method instead of a property, so that the user can specify the wildcard character and upper/lower case handling as arguments to this method. Similar to the _format_psl method:

>>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner()
>>> aligner.gap_score = -5
>>> alignments = aligner.align("AAAXGAAA", "AAAATAAA")
>>> print(alignments[0])
AAAXGAAA
|||..|||
AAAATAAA

>>> print(alignments[0].format("psl"))
6       2       0       0       0       0       0       0       +       query   8       0       8       target  8       0       8       1       8,      0,      0,

# 6 matches, 2 mismatches
>>> print(alignments[0].format("psl", wildcard='X'))
6       1       0       1       0       0       0       0       +       query   8       0       8       target  8       0       8       1       8,      0,      0,

# 6 matches, 1 mismatch, 1 match against the wildcard

However, one should keep in mind that 'match' and 'mismatch' are only well defined when looking for identity. It's more complicated when you have a scoring matrix with different degrees of similarity.

How should such cases be handled? Return these properties only when the alignment has used an identity matrix? Or raise a warning?

Once we have an alignment, it does not matter how it was generated. We can still report the number of matches and mismatches even if the alignment was generated using a scoring matrix.

MarkusPiotrowski · 2021-05-03T10:03:33Z

Once we have an alignment, it does not matter how it was generated. We can still report the number of matches and mismatches even if the alignment was generated using a scoring matrix.

Yes, this is totally clear to me. I was just commenting on the well defined from @peterjc question. I always felt a bit uneasy that we don't use a 'similar character' in the match line, as it is used e.g. in the BLAST output. However, the difficulties of implementing such a character (a. when does 'similarity' start? b. in pairwise2 this is not possible, because the returned result has no knowledge about the used matrix) let me stand back from such a change.
Still, in a pretty print output the user can still judge about the mismatches, but just reporting the number of mismatches as a number may give a wrong impression about the similarity of two sequences. That's the point I wanted to raise.

peterjc · 2021-05-03T12:44:53Z

Point taken, What if the new properties were called .identities, .mismatches and .gaps? That would be clearer to me.

@mdehoon If instead of properties this was method based, would something like .counts(...) returning these three integers be what you had in mind? The return value(s) could include variations like similar and non-similar mismatches depending on the arguments.

mdehoon · 2021-05-03T14:29:25Z

@mdehoon If instead of properties this was method based, would something like .counts(...) returning these three integers be what you had in mind? The return value(s) could include variations like similar and non-similar mismatches depending on the arguments.

Yes. The PSL format actually has one more integer: The number of matches against lower-case nucleotides (usually representing repeat regions). I guess the number of integers to be returned by counts(...) depends on the arguments (e.g. there is no need to return the number of matches against the wildcard if no wildcard character is defined). Perhaps we can return them as a namedtuple to be explicit about which number is which.

mdehoon · 2021-05-03T16:08:01Z

Yes, this is totally clear to me. I was just commenting on the well defined from @peterjc question. I always felt a bit uneasy that we don't use a 'similar character' in the match line, as it is used e.g. in the BLAST output. However, the difficulties of implementing such a character (a. when does 'similarity' start? b. in pairwise2 this is not possible, because the returned result has no knowledge about the used matrix) let me stand back from such a change.
Still, in a pretty print output the user can still judge about the mismatches, but just reporting the number of mismatches as a number may give a wrong impression about the similarity of two sequences. That's the point I wanted to raise.

If we introduce a counts method, we could offer the possibility to pass the substitution matrix as one of the arguments, together with a threshold score value that distinguishes matches from mismatches.

Or, if we want to be clever, we can pass a Boolean matrix:

>>> from Bio.Align import substitution_matrices
>>> m = substitution_matrices.load("blosum62")
>>> print(m)
#  Matrix made by matblas from blosum62.iij
#  * column uses minimum score
#  BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
#  Blocks Database = /data/blocks_5.0/blocks.dat
#  Cluster Percentage: >= 62
#  Entropy =   0.6979, Expected =  -0.5209
     A    R    N    D    C    Q    E    G    H    I ...
A  4.0 -1.0 -2.0 -2.0  0.0 -1.0 -1.0  0.0 -2.0 -1.0 ...
R -1.0  5.0  0.0 -2.0 -3.0  1.0  0.0 -2.0  0.0 -3.0 ...
N -2.0  0.0  6.0  1.0 -3.0  0.0  0.0  0.0  1.0 -3.0 ...
D -2.0 -2.0  1.0  6.0 -3.0  0.0  2.0 -1.0 -1.0 -3.0 ...
C  0.0 -3.0 -3.0 -3.0  9.0 -3.0 -4.0 -3.0 -3.0 -1.0 ...
Q -1.0  1.0  0.0  0.0 -3.0  5.0  2.0 -2.0  0.0 -3.0 ...
...
>>> threshold = 0.5  # greater than 0.5 is match, smaller than 0.5 is mismatch
>>> b = (m > threshold)
>>> print(b)  # our Boolean matrix:
    A   R   N   D   C   Q   E   G   H   I ...
A 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
R 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ...
N 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 ...
D 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 ...
C 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ...
Q 0.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 ...
...
>>> b['A', 'A']
True
>>> b['A', 'R']
False
>>> b['N', 'N']
True
>>> b['N', 'D']
True
>>> b['N', 'C']
False

Then we could do

>>> match_count, mismatch_count = alignment.counts(matching = (m > 0.5))

MarkusPiotrowski · 2021-05-04T07:07:01Z

Would be nice if we could also have this for the match line in the pretty print output, wouldn't it?

mufernando · 2021-05-14T16:57:16Z

wow, just came here looking for such method.

I like the idea of counting gaps independently in the query vs target (sensu PSL). Also I think a vs. A should be marked as mismatch, but it wouldn't hurt to give the user a flag that changes the behavior.

In terms of implementation, is it possible to do sth more clever than just zipping through both alignments? Happy to help.

peterjc · 2023-01-17T13:48:02Z

Note my original workaround no longer works due to the change in the alignment string representation in Biopython 1.80 (#4183). Was:

>>> from Bio import Align
>>> aligner = Align.PairwiseAligner()
>>> seq1 = "GAACT"
>>> seq2 = "GAT"
>>> alignments = aligner.align(seq1, seq2)
>>> print(alignments.score)
3.0
>>> alignment = alignments[0]
>>> print(alignment)
GAACT
||--|
GA--T

Now:

>>> print(alignment)
target            0 GAACT 5
                  0 ||--| 5
query             0 GA--T 3

We can easily get the two aligned sequence strings using alignment[0] and alignment[1], so these are possible workarounds:

>>> gaps = sum(1 for a, b in zip(alignment[0], alignment[1]) if a == "-" or b == "-")
>>> identities = sum(1 for a, b in zip(alignment[0], alignment[1]) if a == b and a != "-")
>>> mismatches = sum(1 for a, b in zip(alignment[0], alignment[1]) if a != b and a != "-" and b != "-")

Given the changes in Biopython 1.80, adding a .count(...) method seems more compelling.

This might potentially be more efficient than zipping over the two aligned strings by tracing the .path information as done in the Biopython 1.79 ._format_pretty(...) private method?

I've also not considered how the optional arguments suggestion earlier might be implemented.

peterjc added the Enhancement label Apr 29, 2021

mdehoon mentioned this issue Feb 15, 2022

Use of substitution_matrices is much slower than MatrixInfo in pairwise alignment #3862

Closed

peterjc mentioned this issue Jan 24, 2023

Simple counts() method for pairwise alignments #4221

Merged

3 tasks

peterjc closed this as completed in #4221 Jan 25, 2023

peterjc mentioned this issue Mar 1, 2023

Human Readable PairwiseAligner String change - option for the old way? #4250

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

matches, mismatches and gaps from Bio.Align.PairwiseAlignment #3538

matches, mismatches and gaps from Bio.Align.PairwiseAlignment #3538

peterjc commented Apr 29, 2021

mdehoon commented Apr 29, 2021 •

edited

Loading

mdehoon commented Apr 29, 2021

mdehoon commented Apr 29, 2021

peterjc commented Apr 29, 2021 •

edited

Loading

mdehoon commented Apr 29, 2021

mdehoon commented Apr 29, 2021

peterjc commented May 3, 2021

MarkusPiotrowski commented May 3, 2021

mdehoon commented May 3, 2021

MarkusPiotrowski commented May 3, 2021

peterjc commented May 3, 2021

mdehoon commented May 3, 2021

mdehoon commented May 3, 2021

MarkusPiotrowski commented May 4, 2021

mufernando commented May 14, 2021

peterjc commented Jan 17, 2023

matches, mismatches and gaps from Bio.Align.PairwiseAlignment #3538

matches, mismatches and gaps from Bio.Align.PairwiseAlignment #3538

Comments

peterjc commented Apr 29, 2021

mdehoon commented Apr 29, 2021 • edited Loading

mdehoon commented Apr 29, 2021

mdehoon commented Apr 29, 2021

peterjc commented Apr 29, 2021 • edited Loading

mdehoon commented Apr 29, 2021

mdehoon commented Apr 29, 2021

peterjc commented May 3, 2021

MarkusPiotrowski commented May 3, 2021

mdehoon commented May 3, 2021

MarkusPiotrowski commented May 3, 2021

peterjc commented May 3, 2021

mdehoon commented May 3, 2021

mdehoon commented May 3, 2021

MarkusPiotrowski commented May 4, 2021

mufernando commented May 14, 2021

peterjc commented Jan 17, 2023

mdehoon commented Apr 29, 2021 •

edited

Loading

peterjc commented Apr 29, 2021 •

edited

Loading