Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BioPython 1.80 silently changes alignment formatting defaults #4183

Closed
ialbert opened this issue Nov 26, 2022 · 9 comments
Closed

BioPython 1.80 silently changes alignment formatting defaults #4183

ialbert opened this issue Nov 26, 2022 · 9 comments

Comments

@ialbert
Copy link

ialbert commented Nov 26, 2022

take the following program:

from Bio.Align import PairwiseAligner

aligner = PairwiseAligner()
alns = aligner.align("GATTACA", "GATCA")
out = format(alns[0])

print(out)

With BioPython 1.79 prints:

GATTACA
|||--||
GAT--CA

With BioPython 1.80 prints:

target            0 GATTACA 7
                  0 |||--|| 7
query             0 GAT--CA 5
  1. I find it extraordinarily counterproductive for a foundational tool to start producing different defaults with no warning. How can anyone rely on it to be the basis of other functionality? You just nuked every single program that relied on default formatting to compute anything else.

  2. Even after consulting the docs, I cannot locate the formatting that would produce the previous behavior. How would one generate the previous output? I am looking to get the aligned sequences and trace each as single long strings.

I understand that the new format is neat, and probably took a lot of effort to implement, plus it would be something I would consider using myself - still, that does not warrant breaking all the code out there that relied on accessing long strings returned from the aligner.

@mdehoon
Copy link
Contributor

mdehoon commented Nov 27, 2022

@ialbert

This change was necessary because Biopython 1.80 contains a new set of parsers for alignment files in different formats. For example the BED and MAF formats typically contain alignments of short sequences to chromosomes. The old behavior of format would have yielded a string of the size of chromosomes, so several tens or hundreds of millions of characters long. The behavior of format therefore had to change to the current default. Adding a warning may have helped, but existing programs that rely on the previous format would still break.

If for your purpose you need the sequence with gaps, you can use

>>> from Bio.Align import PairwiseAligner
>>> 
>>> aligner = PairwiseAligner()
>>> alns = aligner.align("GATTACA", "GATCA")
>>> aln = alns[0]
>>> aln[0]
'GATTACA'
>>> aln[1]
'GAT--CA'

This is covered in section 6.6.8 of the documentation.

@ialbert
Copy link
Author

ialbert commented Nov 27, 2022

@mdehoon I appreciate the assistance in this matter - though I will admit I did not understand the rationale for making the change. Adding a new parser/feature should not be a reason to alter how other parts of other people's programs work.

That being said, it does sound like any code that displays alignments will need to create a branching point that detects the version of BioPython and deploys different codes depending on the version.

@mdehoon
Copy link
Contributor

mdehoon commented Nov 27, 2022

Adding a new parser/feature should not be a reason to alter how other parts of other people's programs work.

Sorry, but there was no reasonable alternative other than changing the behavior of format.

Basically, you can consider the previous behavior of format to be a bug, which was fixed in 1.80, and require 1.80 for your program.

@Paradoxdruid
Copy link

This silent change also broke my production code; I'm happy to update my code in compliance with this, but I would have appreciated the change being specifically pointed out in the biopython changelog; it took a while to track down why my CI tests started failing!

@peterjc
Copy link
Member

peterjc commented Nov 29, 2022

We can add a back-dated entry to the NEWS file, useful for people you browse it here on Github. Any suggestions for wording?

@Paradoxdruid
Copy link

Amusingly, this turns 12 lines of convoluted code into 2 simple and straightforward lines, so kudos overall: https://github.com/Paradoxdruid/pyllelic/blob/master/pyllelic/quma.py#L379-L412

In terms of language, how about:

Functions read, parse, and write were added to Bio.Align to read and write Alignment objects.
String formatting and printing output of Bio.Alignment changed to support these new functions.

@mdehoon
Copy link
Contributor

mdehoon commented Nov 30, 2022

Just some background on why this change was unavoidable:

The format function calls the __format__ method on the object.
By Python convention, __format__ without arguments should return the same as __str__.
The print function in Python calls __str__, and should return a human-readable string as it's printed to the screen.

Alignments generated by the pairwise aligner, the __format__ method in Biopython 1.79 for short sequences indeed does generate a human-readable string. It shows the unaligned parts (if any) of the sequence first, then the aligned part, and then any remaining unaligned sequence:

>>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner()

Global alignment:

>>> alignments = aligner.align("CCCCAAAAAAACC", "TTAAAAAAATT")
>>> alignment = alignments[0]
>>> print(alignment)
CCCC--AAAAAAACC--
------|||||||----
----TTAAAAAAA--TT

Local alignment:

>>> aligner.mode = 'local'
>>> alignments = aligner.align("CCCCAAAAAAACC", "TTAAAAAAATT")
>>> alignment = alignments[0]
>>> print(alignment)
CCCCAAAAAAACC
    |||||||
  TTAAAAAAATT

With Biopython 1.80, the alignment can also come from parsing an externally generated alignment file. This is one example from a MAF (Multiple Alignment Format) file downloaded from UCSC:

s hg16.chr7    27699739 6 + 158545518 TAAAGA
s panTro1.chr6 28862317 6 + 161576975 TAAAGA

Here, 27699739 and 28862317 are the start position of the alignment, and 158545518 and 161576975 are the chromosome sizes.

The __format__ method in Biopython 1.79 would return a string of 27699739 and 28862317 unaligned characters, then the alignment, and then 158545518-27699739 and 161576975-28862317 unaligned characters. The unaligned characters are not stored in the MAF file format. But even if they were (or if we fill in the missing sequence by ? characters), the string generated by __format__ would be 161576975 characters long, and therefore completely useless for printing to the screen.

The same issue arises with other file formats now supported by Bio.Align, such as BED, MAF, PSL, SAM, and exonerate files. Even with protein alignments (which are relatively short), the string generated by Biopython 1.79 becomes unreadable if the protein sequence is longer than the screen width.

@mdehoon
Copy link
Contributor

mdehoon commented Nov 30, 2022

See #4184 for a suggested fix to the NEWS file.

@mdehoon mdehoon closed this as completed Nov 30, 2022
@ialbert
Copy link
Author

ialbert commented Nov 30, 2022

Thanks for the quick turnaround; as a suggestion,

I would recommend that all objects in BioPython also return a JSON model that contains all the attributes. For example for an alignment it should do:

{
 query="GATTACA",
 target="GAT--CA", 
 score=10,
 scoring_matrix="nuc4.4",
 gap_open=10
 ...
}

this is what people that run this module need anyway. We always need to know both what created the alignment and what the alignment is.

The traditional model where there are objects with methods, and each method needs to be looked up in docs, seems to be stuck in the last century and represents an antiquated object-oriented model.

I find it incredibly tedious and boring to look up docs because it is so unnecessary! The instance already knows the score; just give it to me as data rather than me having to a) crawl through endless pages of examples and b) then write more code. The score is data; just give it up already :-) One glance at a JSON could tell me everything I wanted to know about the alignment so that I don't have to hit the docs ever.

The modern world wants to work in data, and JSON so that you can get all information in one shot rather than painstakingly building and navigating obtuse objects via random accessors that someone came up with under time pressure.

I have made one such model that takes a BioPython GenBank model and turns it into JSON, and it was immensely more joyous to use (not to mention you get the freedom of not having to write code to parse it). You can load up chromosome 1 of the human genome in seconds when it is stored in compressed JSON. The JSON parsing is very fast.

This is just a suggestion of modernizing BioPython in general.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants