-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BioPython 1.80 silently changes alignment formatting defaults #4183
Comments
This change was necessary because Biopython 1.80 contains a new set of parsers for alignment files in different formats. For example the BED and MAF formats typically contain alignments of short sequences to chromosomes. The old behavior of If for your purpose you need the sequence with gaps, you can use >>> from Bio.Align import PairwiseAligner
>>>
>>> aligner = PairwiseAligner()
>>> alns = aligner.align("GATTACA", "GATCA")
>>> aln = alns[0]
>>> aln[0]
'GATTACA'
>>> aln[1]
'GAT--CA' This is covered in section 6.6.8 of the documentation. |
@mdehoon I appreciate the assistance in this matter - though I will admit I did not understand the rationale for making the change. Adding a new parser/feature should not be a reason to alter how other parts of other people's programs work. That being said, it does sound like any code that displays alignments will need to create a branching point that detects the version of BioPython and deploys different codes depending on the version. |
Sorry, but there was no reasonable alternative other than changing the behavior of Basically, you can consider the previous behavior of |
This silent change also broke my production code; I'm happy to update my code in compliance with this, but I would have appreciated the change being specifically pointed out in the biopython changelog; it took a while to track down why my CI tests started failing! |
We can add a back-dated entry to the NEWS file, useful for people you browse it here on Github. Any suggestions for wording? |
Amusingly, this turns 12 lines of convoluted code into 2 simple and straightforward lines, so kudos overall: https://github.com/Paradoxdruid/pyllelic/blob/master/pyllelic/quma.py#L379-L412 In terms of language, how about:
|
Just some background on why this change was unavoidable: The Alignments generated by the pairwise aligner, the >>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner() Global alignment: >>> alignments = aligner.align("CCCCAAAAAAACC", "TTAAAAAAATT")
>>> alignment = alignments[0]
>>> print(alignment)
CCCC--AAAAAAACC--
------|||||||----
----TTAAAAAAA--TT Local alignment: >>> aligner.mode = 'local'
>>> alignments = aligner.align("CCCCAAAAAAACC", "TTAAAAAAATT")
>>> alignment = alignments[0]
>>> print(alignment)
CCCCAAAAAAACC
|||||||
TTAAAAAAATT With Biopython 1.80, the alignment can also come from parsing an externally generated alignment file. This is one example from a MAF (Multiple Alignment Format) file downloaded from UCSC:
Here, 27699739 and 28862317 are the start position of the alignment, and 158545518 and 161576975 are the chromosome sizes. The The same issue arises with other file formats now supported by |
See #4184 for a suggested fix to the |
Thanks for the quick turnaround; as a suggestion, I would recommend that all objects in BioPython also return a JSON model that contains all the attributes. For example for an alignment it should do:
this is what people that run this module need anyway. We always need to know both what created the alignment and what the alignment is. The traditional model where there are objects with methods, and each method needs to be looked up in docs, seems to be stuck in the last century and represents an antiquated object-oriented model. I find it incredibly tedious and boring to look up docs because it is so unnecessary! The instance already knows the score; just give it to me as data rather than me having to a) crawl through endless pages of examples and b) then write more code. The score is data; just give it up already :-) One glance at a JSON could tell me everything I wanted to know about the alignment so that I don't have to hit the docs ever. The modern world wants to work in data, and JSON so that you can get all information in one shot rather than painstakingly building and navigating obtuse objects via random accessors that someone came up with under time pressure. I have made one such model that takes a BioPython GenBank model and turns it into JSON, and it was immensely more joyous to use (not to mention you get the freedom of not having to write code to parse it). You can load up chromosome 1 of the human genome in seconds when it is stored in compressed JSON. The JSON parsing is very fast. This is just a suggestion of modernizing BioPython in general. |
take the following program:
With BioPython 1.79 prints:
With BioPython 1.80 prints:
I find it extraordinarily counterproductive for a foundational tool to start producing different defaults with no warning. How can anyone rely on it to be the basis of other functionality? You just nuked every single program that relied on default formatting to compute anything else.
Even after consulting the docs, I cannot locate the formatting that would produce the previous behavior. How would one generate the previous output? I am looking to get the aligned sequences and trace each as single long strings.
I understand that the new format is neat, and probably took a lot of effort to implement, plus it would be something I would consider using myself - still, that does not warrant breaking all the code out there that relied on accessing long strings returned from the aligner.
The text was updated successfully, but these errors were encountered: