Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extended BAM support #129

Open
rhpvorderman opened this issue Jan 26, 2024 · 4 comments
Open

Extended BAM support #129

rhpvorderman opened this issue Jan 26, 2024 · 4 comments

Comments

@rhpvorderman
Copy link
Collaborator

As discussed in marcelm/cutadapt#757 adding BAM support to dnaio would have several advantages for cutadapt. The alternative is to defer to pysam instead, and keep dnaio at its basic level.

In order for full BAM support I think the following things need to be added:

  • full tag support
  • BAM header support
  • BAM write support
    • BAM header support
    • BGZF format support
  • paired-end support

Converting a SequenceRecord object to binary BAM bytes, is something where I see no technical difficulties. A BAM header is also not verry difficult as it is just a SAM header with a few binary quirks.

BGZF is going to be a slight hurdle as it cannot be appropriately done with xopen, so direct library calls to zlib/libdeflate/isa-l are needed. Since only streamed writing is required it is actually quite easy to write a RawWriter that creates the BGZF blocks and wrap that in an io.BufferedWriter that has a 65200 byte buffer size. That will automatically take care of the sizing and will also be pretty fast.

Tag support is really tough. There are several options here:

  • Keep BAM file format tags and store that as a hidden bytes object or memory buffer which can be manipulated by calls. Unfortunately this requires parsing the format every time an operation is performed.
  • Simply use a python dict. This is massively costly in terms of space as well as runtime.
  • Use a custom format that allows easier manipulation than file format tags while being faster to manipulate than a python dict.
    On top of that, how do you design the API for manipulating the tags?

Then there is also paired end support. It is possible to put some limits on what is supported. Name-sorted files come to mind. Enabling full support requires supporting indexes and that is also a lot of work. In that case pysam is a better option.

@rhpvorderman
Copy link
Collaborator Author

A lot of the problems with tags can be solved with an implementation using the following guidelines.

  • Tags always have a two-letter code. Hence they can be stored in two bytes each. By creating an array of only tag codes, 32 of these can fit on one CPU cache line (64 bytes). As a result this array can be searched extremely quickly. If a tag is found, the index in the array can be used for accessing a value array.
  • All tag values fit in an 8-byte sequence, except arrays and strings. Array's and strings need a pointer, for 64-bit architectures, that conveniently also fits in a 8-byte sequence.

So an implementation with a tag array and a value array (containing typecode and value) seems efficient as well as easily convertable to and from BAM format.

Removing entries is simple: simply zero the entry. Make it so that zero entries are not written. Adding entries is also simple. Check if there are 0-entries (the tag array needs to be scanned anyway to see if the tag already exists) and overwrite them. If not make the arrays a little bigger. Replacing tags is also easy: overwrite them.

This is not going to be trivial to implement, but at least it as an implementation that will work well and scale well. (Albeit using slightly more memory than the htslib implementation, which simply keeps the file format representation in memory).

It is also easily possible in this format to selectively not support manipulating the tags array with every type code. What I mean is that of course dnaio should be able to correctly parse all tag typecodes as well as write them back properly, but support for adding new tags can be limited. For example the 'Z' typecode for strings will probably the most-used one by cutadapt (if not the only one) so that can be enabled. The same goes for reading typecodes. Sure it is not great that the implementation is not complete, but it allows for spreading the work over multiple releases and support as needed.

Sorry, I had to write all this down, otherwise I can't stop thinking about it and start working on my other tasks :-).

@rhpvorderman
Copy link
Collaborator Author

One of the most costly things in terms of performance is memory allocations. So it is a good thing to do these as little as possible. Unfortunately there are some variable length items in BAM file tags such as arrays and strings. This cannot be solved without dynamic memory allocation.

However with a bit of organization this can be remedied. For the value store I originally intented the following layout:

struct TagValue {
  uint8_t type[2];
  uint8_t value[8];
}

With value being cast to a pointer to a dynamically allocated piece of memory in case of 'B' or 'Z' type tags.

It is however much more efficient to allocate only a single piece of memory for all the B and Z tags. The value part of TagValue can then be cast to a uint32_t offset and uint32_t length. This comes with the slight inefficiency of having to resize the block when adding or manipulating values. This is however preferable to allocating lots of small pieces of memory that can cause severe memory bucket fragmentation.

Another question is whether to store tags and values separate or together. Also the following struct can be used

struct Tag {
  uint8_t tag[2];
  uint8_t type[2];
  uint8_t value[8];
}

That saves allocating one array, at the cost of fast tag contains checking. I prefer keeping the fast check.

So implementation wise the following members would need to be added to the SequenceRecord struct

uint32_t number_of_tags;
uint32_t array_store_size;
uint8_t *tags;
struct TagValue *tag_values;
uint8_t *array_store;

That is 32 bytes extra to sequence record. In my opinion this is acceptable.

Then we need to think about how to get and set tags. I think the following signatures are appropriate:

get_tag(tag: str) -> Union[int, float, str, array.array]: ...
set_tag(tag:str, value: Union[int, float, str, array.array]) ->  None: ...
get_tags() -> Dict[str, Union[int, float, str, array.array]]: ...

Yes this mimicks the pysam API. I think they made a pretty good design choice there. Setting multiple tags simultaneously or worrying about value_type seems to me to be beyond the scope of dnaio. No, it does not use numpy arrays. Numpy arrays are only efficient on quite large arrays as they have tremendous overhead. As well as introducing a massive dependency to the project. For arrays I am quite conflicted on whether to use memoryview (a type that when iterated over gives the correct results!) or array. On Cython using array.array is easier than when using native C.

I guess a good solution for now is to not support get_tags and simply throw a "NotImplementError" when get_tag encounters a B tag type code.

@peterjc
Copy link

peterjc commented Jun 11, 2024

See https://github.com/biopython/biopython/blob/master/Bio/bgzf.py for a possibly useful BGZF implementation available under the 3-clause BSD license. I think that'd let you vendor that file (bundle it as it within the MIT licensed dnaio), since adding a dependency on the whole of Biopython might not be tempting.

@rhpvorderman
Copy link
Collaborator Author

Thanks! Since only streaming is supported some shortcuts can be taken. An io.BufferedWriter with a buffersize of 65200 can be used to write to a RawBgzipWriter that will simply take any writes, compresses them and enclose them in a BGZF header and footer. BGZIP streaming is already supported since that can already be handled by normal gzip readers. This would grant the required amount of BGZF support with very little extra code.

I think dnaio can provide value over pysam by providing limited support for only a few BAM features as they are used in uBAM. That allows for a much cleaner interface whilst also being a lot faster. Full BAM and BGZF support including indexing will require lots of extra code, and is not really of value for uBAM files (as they come from the sequencer). In use cases where it is needed I think using native htslib bindings such as pysam are a better option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants