Getting the contigs of interest from the reference sequence instead of the BAM file #84

andreaswallberg · 2021-03-26T19:48:15Z

Dear @hasindu2008

This is a feature request.

Having set up my fast5-indexes and BAM file, I have realized that I sometimes want to run the analyses across specific contigs (up to 20-30 thousand ones in my case), rather than the whole genome.

I am aware that the program be specified to scan only a particular sequence, but there is an substantial overhead associated with running just across a short contig and then restart the program and run it again.

I guess one way to accomplish what I want is to filter the BAM file so that it only contains mappings against the contigs of interest.

Another, much simpler method is to filter the reference genome so that it only contains the contigs of interest. In my case, also in a specific order of interest, from the most prioritized sequences to the least. This is easy to accomplish by just updating the reference genome file.

However, it seems like the program defaults to scanning the data according to the order of contigs in the BAM file, not in the reference genome file.

Would it be possible to tweak the code so that the program gets the target contigs for analyses, and their order of analysis, from the reference file instead of the BAM file?

This would make the program very flexible and accommodate various ad-hoc subsets for analysis, if needed.

hasindu2008 · 2021-03-27T03:44:25Z

@andreaswallberg

Currently, f5c uses an hts iterator in htslib (sam_itr_queryi) that iterates through a given chr:beg-end region. When a batch of alignment records are loaded, the corresponding sequences are located from the reference genome through faidx. Doing vice versa is possible but require some code restructuring. An easier method that I can think of is using sam_itr_regions in htslib that accepts a list of regions (have to look further into it to be sure). What would be the best way to accept the list of the required regions - a bed file to the -w option?

I will try my best to get this feature onto the next f5c release. Thank you for the suggestion.

hasindu2008 · 2021-05-11T02:55:44Z

Hi @andreaswallberg

Finally, I managed to implement a feature for this in f5c. This is not yet thoroughly tested, but the experimental version is available in the branch https://github.com/hasindu2008/f5c/tree/multi_region2. Also, some compiled binaries are attached herewith. It would be great if you could give it a try and let me know if this suits your use case.

In this version, you can provide a bed file with region windows to f5c as -w reg.bed.

f5c-v0.6-33-g063bc74-binaries.tar.gz

hasindu2008 · 2021-08-20T02:28:49Z

In the latest release of f5c, now you can provide a list of regions as a bed file (-w regions.bed). Closing this issue for now. Feel free to reopen this issue if you have any more thoughts. Suggestions are always appreciated :)

hasindu2008 closed this as completed Aug 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting the contigs of interest from the reference sequence instead of the BAM file #84

Getting the contigs of interest from the reference sequence instead of the BAM file #84

andreaswallberg commented Mar 26, 2021

hasindu2008 commented Mar 27, 2021

hasindu2008 commented May 11, 2021

hasindu2008 commented Aug 20, 2021 •

edited

Loading

Getting the contigs of interest from the reference sequence instead of the BAM file #84

Getting the contigs of interest from the reference sequence instead of the BAM file #84

Comments

andreaswallberg commented Mar 26, 2021

hasindu2008 commented Mar 27, 2021

hasindu2008 commented May 11, 2021

hasindu2008 commented Aug 20, 2021 • edited Loading

hasindu2008 commented Aug 20, 2021 •

edited

Loading