Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting the contigs of interest from the reference sequence instead of the BAM file #84

Closed
andreaswallberg opened this issue Mar 26, 2021 · 3 comments

Comments

@andreaswallberg
Copy link

Dear @hasindu2008

This is a feature request.

Having set up my fast5-indexes and BAM file, I have realized that I sometimes want to run the analyses across specific contigs (up to 20-30 thousand ones in my case), rather than the whole genome.

I am aware that the program be specified to scan only a particular sequence, but there is an substantial overhead associated with running just across a short contig and then restart the program and run it again.

I guess one way to accomplish what I want is to filter the BAM file so that it only contains mappings against the contigs of interest.

Another, much simpler method is to filter the reference genome so that it only contains the contigs of interest. In my case, also in a specific order of interest, from the most prioritized sequences to the least. This is easy to accomplish by just updating the reference genome file.

However, it seems like the program defaults to scanning the data according to the order of contigs in the BAM file, not in the reference genome file.

Would it be possible to tweak the code so that the program gets the target contigs for analyses, and their order of analysis, from the reference file instead of the BAM file?

This would make the program very flexible and accommodate various ad-hoc subsets for analysis, if needed.

@hasindu2008
Copy link
Owner

@andreaswallberg

Currently, f5c uses an hts iterator in htslib (sam_itr_queryi) that iterates through a given chr:beg-end region. When a batch of alignment records are loaded, the corresponding sequences are located from the reference genome through faidx. Doing vice versa is possible but require some code restructuring. An easier method that I can think of is using sam_itr_regions in htslib that accepts a list of regions (have to look further into it to be sure). What would be the best way to accept the list of the required regions - a bed file to the -w option?

I will try my best to get this feature onto the next f5c release. Thank you for the suggestion.

@hasindu2008
Copy link
Owner

Hi @andreaswallberg

Finally, I managed to implement a feature for this in f5c. This is not yet thoroughly tested, but the experimental version is available in the branch https://github.com/hasindu2008/f5c/tree/multi_region2. Also, some compiled binaries are attached herewith. It would be great if you could give it a try and let me know if this suits your use case.

In this version, you can provide a bed file with region windows to f5c as -w reg.bed.

f5c-v0.6-33-g063bc74-binaries.tar.gz

@hasindu2008
Copy link
Owner

hasindu2008 commented Aug 20, 2021

In the latest release of f5c, now you can provide a list of regions as a bed file (-w regions.bed). Closing this issue for now. Feel free to reopen this issue if you have any more thoughts. Suggestions are always appreciated :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants