-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
matches, mismatches and gaps from Bio.Align.PairwiseAlignment #3538
Comments
For a quick solution, you could format the alignment as PSL: >>> print(format(alignment, "psl"))
3 0 0 0 0 0 1 2 + query target 5 0 5 2 2,1, 0,2, 0,4, See http://genome.ucsc.edu/FAQ/FAQformat#format2 for a description of the PSL format. |
There is one complication though: How should we could alignments to |
And one more complication: Is |
As to I would personally assume wildcards count as a match - again does the existing scoring behaviour not set a precedent to follow? I may be using this wrong (need to tweak scoring?): >>> from Bio import Align
>>> aligner = Align.PairwiseAligner(wildcard="N")
>>> seq1 = "GAACT"
>>> seq2 = "GNT"
>>> alignments = aligner.align(seq1, seq2)
>>> print(alignments.score)
2.0
>>> alignment = alignments[0]
>>> print(alignment)
GAAC-T
|----|
G---NT I would expect N to match the A or C here. Update - looking at later alignments it does give alignments like the following, but the scoring and display suggest this is a treated as a mismatch:
I'm probably mis-using the wildcard |
You are using the wildcard correctly. Since they are not a match and not a mismatch, they get a zero score. The mismatch symbol |
By default they are not matching. But the scoring matrix could in principle define the same score for |
Ah, showing the wildcard matching in the pretty display would be nice but not important. Would the suggested three properties be well defined additions to the object? |
My impression is that such properties where requested several times in the past, at least for However, one should keep in mind that 'match' and 'mismatch' are only well defined when looking for identity. It's more complicated when you have a scoring matrix with different degrees of similarity. How should such cases be handled? Return these properties only when the alignment has used an identity matrix? Or raise a warning? |
I would prefer a method instead of a property, so that the user can specify the wildcard character and upper/lower case handling as arguments to this method. Similar to the >>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner()
>>> aligner.gap_score = -5
>>> alignments = aligner.align("AAAXGAAA", "AAAATAAA")
>>> print(alignments[0])
AAAXGAAA
|||..|||
AAAATAAA
>>> print(alignments[0].format("psl"))
6 2 0 0 0 0 0 0 + query 8 0 8 target 8 0 8 1 8, 0, 0,
# 6 matches, 2 mismatches
>>> print(alignments[0].format("psl", wildcard='X'))
6 1 0 1 0 0 0 0 + query 8 0 8 target 8 0 8 1 8, 0, 0,
# 6 matches, 1 mismatch, 1 match against the wildcard
Once we have an alignment, it does not matter how it was generated. We can still report the number of matches and mismatches even if the alignment was generated using a scoring matrix. |
Yes, this is totally clear to me. I was just commenting on the well defined from @peterjc question. I always felt a bit uneasy that we don't use a 'similar character' in the match line, as it is used e.g. in the BLAST output. However, the difficulties of implementing such a character (a. when does 'similarity' start? b. in |
Point taken, What if the new properties were called @mdehoon If instead of properties this was method based, would something like |
Yes. The PSL format actually has one more integer: The number of matches against lower-case nucleotides (usually representing repeat regions). I guess the number of integers to be returned by |
If we introduce a Or, if we want to be clever, we can pass a Boolean matrix: >>> from Bio.Align import substitution_matrices
>>> m = substitution_matrices.load("blosum62")
>>> print(m)
# Matrix made by matblas from blosum62.iij
# * column uses minimum score
# BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
# Blocks Database = /data/blocks_5.0/blocks.dat
# Cluster Percentage: >= 62
# Entropy = 0.6979, Expected = -0.5209
A R N D C Q E G H I ...
A 4.0 -1.0 -2.0 -2.0 0.0 -1.0 -1.0 0.0 -2.0 -1.0 ...
R -1.0 5.0 0.0 -2.0 -3.0 1.0 0.0 -2.0 0.0 -3.0 ...
N -2.0 0.0 6.0 1.0 -3.0 0.0 0.0 0.0 1.0 -3.0 ...
D -2.0 -2.0 1.0 6.0 -3.0 0.0 2.0 -1.0 -1.0 -3.0 ...
C 0.0 -3.0 -3.0 -3.0 9.0 -3.0 -4.0 -3.0 -3.0 -1.0 ...
Q -1.0 1.0 0.0 0.0 -3.0 5.0 2.0 -2.0 0.0 -3.0 ...
...
>>> threshold = 0.5 # greater than 0.5 is match, smaller than 0.5 is mismatch
>>> b = (m > threshold)
>>> print(b) # our Boolean matrix:
A R N D C Q E G H I ...
A 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
R 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ...
N 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 ...
D 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 ...
C 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ...
Q 0.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 ...
...
>>> b['A', 'A']
True
>>> b['A', 'R']
False
>>> b['N', 'N']
True
>>> b['N', 'D']
True
>>> b['N', 'C']
False Then we could do >>> match_count, mismatch_count = alignment.counts(matching = (m > 0.5)) |
Would be nice if we could also have this for the match line in the pretty print output, wouldn't it? |
wow, just came here looking for such method. I like the idea of counting gaps independently in the query vs target (sensu PSL). Also I think In terms of implementation, is it possible to do sth more clever than just zipping through both alignments? Happy to help. |
Note my original workaround no longer works due to the change in the alignment string representation in Biopython 1.80 (#4183). Was: >>> from Bio import Align
>>> aligner = Align.PairwiseAligner()
>>> seq1 = "GAACT"
>>> seq2 = "GAT"
>>> alignments = aligner.align(seq1, seq2)
>>> print(alignments.score)
3.0
>>> alignment = alignments[0]
>>> print(alignment)
GAACT
||--|
GA--T Now: >>> print(alignment)
target 0 GAACT 5
0 ||--| 5
query 0 GA--T 3
We can easily get the two aligned sequence strings using
Given the changes in Biopython 1.80, adding a This might potentially be more efficient than zipping over the two aligned strings by tracing the I've also not considered how the optional arguments suggestion earlier might be implemented. |
Example use case creating an
PairwiseAlignment
instance as per the tutorial:I would like to be able to get easy access to the number of matches, mismatches, and gaps:
Workaround to get these numbers from the default string representation:
This can be done more efficiently directly from
alignment.path
but that is non-trivial.The text was updated successfully, but these errors were encountered: