-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting the contigs of interest from the reference sequence instead of the BAM file #84
Comments
Currently, f5c uses an hts iterator in htslib (sam_itr_queryi) that iterates through a given chr:beg-end region. When a batch of alignment records are loaded, the corresponding sequences are located from the reference genome through faidx. Doing vice versa is possible but require some code restructuring. An easier method that I can think of is using sam_itr_regions in htslib that accepts a list of regions (have to look further into it to be sure). What would be the best way to accept the list of the required regions - a bed file to the -w option? I will try my best to get this feature onto the next f5c release. Thank you for the suggestion. |
Finally, I managed to implement a feature for this in f5c. This is not yet thoroughly tested, but the experimental version is available in the branch https://github.com/hasindu2008/f5c/tree/multi_region2. Also, some compiled binaries are attached herewith. It would be great if you could give it a try and let me know if this suits your use case. In this version, you can provide a bed file with region windows to f5c as -w reg.bed. |
In the latest release of f5c, now you can provide a list of regions as a bed file (-w regions.bed). Closing this issue for now. Feel free to reopen this issue if you have any more thoughts. Suggestions are always appreciated :) |
Dear @hasindu2008
This is a feature request.
Having set up my fast5-indexes and BAM file, I have realized that I sometimes want to run the analyses across specific contigs (up to 20-30 thousand ones in my case), rather than the whole genome.
I am aware that the program be specified to scan only a particular sequence, but there is an substantial overhead associated with running just across a short contig and then restart the program and run it again.
I guess one way to accomplish what I want is to filter the BAM file so that it only contains mappings against the contigs of interest.
Another, much simpler method is to filter the reference genome so that it only contains the contigs of interest. In my case, also in a specific order of interest, from the most prioritized sequences to the least. This is easy to accomplish by just updating the reference genome file.
However, it seems like the program defaults to scanning the data according to the order of contigs in the BAM file, not in the reference genome file.
Would it be possible to tweak the code so that the program gets the target contigs for analyses, and their order of analysis, from the reference file instead of the BAM file?
This would make the program very flexible and accommodate various ad-hoc subsets for analysis, if needed.
The text was updated successfully, but these errors were encountered: