-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fade missing many reads that appear to be fragmentation artifacts #27
Comments
I wanted to add that I've noticed that reads with longer stretches of soft clipping tend to be less likely to be removed. Taking a look at the code, I noticed this statement in
Are you discarding alignments with length greater than 10, or am I not understanding the context of this statement correctly? EDIT |
So this is a known issue with fade, though I would rather say it is a design flaw or really just part of the problem of finding enzymatic fragmentation artifacts.
These look like characteristic artifacts, though when looking at your screenshot I don't see the soft-clipped sequences' reverse complement in the reference sequence (unless I missed something). This likely means that the true alignment position of the artifact is outside of fade's process
Though in reality,
I doubt this is the issue as well, though to be honest I can't remember why I set this particular limitation. My guess would be
Unfortunately, the best solution right now would be to adjust the |
At the bottom of the screenshot, there are |
Ah I completely missed that part of your post. Yes those should be caught. Would it be possible for you to provide me with this section of your bam file so that I may do some debugging/digging? |
yes, the bam is attached to the original post |
Jeez, I missed that too. Thanks for your patience and for the detailed information, I will look into it and see what I find. |
no problem, thanks for your help! |
@charlesgregory do you have any updates on this? I'm wondering if it looks like this is fixable, or if we should be looking at other solutions. Thanks! |
with the data you provided it looks very addressable, and we are looking into parameter tuning It's interesting how well FADE performs on our data, and the poor recognition/fix rate you are seeing, which suggests that needed parameters differ depending on different regions of the genome. Hopefully, a single set of more liberal search heuristics can be identified. |
Great, thanks for the update! |
So this is likely due to fade not being great at detecting whether a SAM/BAM file is name sorted or not and due to how some of its algorithms work. When fade annotate artifact reads, it does so on primary reads. The aligner (BWA) actually in a way detects some artifact reads by marking them with secondary and supplementary alignments. You see them in your screen shot as artifact reads. When I opened your bam file in IGV (after running fade) with it the only reads that appeared artifact were supplementary and secondary reads, all primary artifact reads had been removed. If it can detect name-sorting, fade will remove all reads with the same name, if any of them indicate having an artifact. If it can detect name-sorting, fade will only remove the primary reads that are marked as artifact. It prints a warning message when it can't detect name-sorting. When I ran your data with the lastest fade version after name-sorting it, fade didn't detect it as name-sorted. fade's current method of determining bam sorted-ness is not great, I have improved this and can push up the fix soon. After running with my fix, these reads no longer appeared in the final bam. As for the The decision for fade to annotate primary reads only was made quite a while ago and I am not sure if it is the best approach anymore. I was worried about unintentionally removing supplementary reads that may be used to find INDELs. If I allow fade to operate on all reads, primary and secondary reads that are "artifacts" should be removed as well regardless of sorted-ness of the bam file. Then the only advantage of name-sorting before running |
I am seeing reads that are primary alignments that show the characteristic artifact pattern that are not being annotated as artifacts by |
That does sound strange. The mac and linux binaries from a given version should produce the same output, but I don't own a mac so I can't confirm that explicitly. I will add some integration tests to see if this is the case. |
I'm not using the mac binary, I'm running the docker image on my mac. I also just tested the linux binary on a different system (cloud workstation running |
Let me see if I can recreate what you're seeing. I can test fade on a Ubuntu 18.04.6 cloud instance with your provided test bam. |
I wasn't able to see what you are describing in my testing. Though I was able to update fade's annotate step to include supplementary and secondary reads which I think should address the original issue. I will have a new version out soon. |
I have a new version out |
No, |
I ran fade on bam that we believe has high level of fragmentation artifacts. I am seeing that 8% of the reads are soft-clipped, while only 0.01% of the reads are classified by fade as artifacts. When I
blat
some of the soft clipped reads that remain after runningfade out
, I see that they show the characteristic alignment of the fragmentation artifact, with part of the read aligning to the forward strand and part to the reverse. When I look for these reads in thefade stats-clip
output, I see that they have "artifact_status: false". Here is an IGV screenshot illustrating the issue:and here is the input bam subset to the region shown:
STD347-81.reg.bam.zip:
The text was updated successfully, but these errors were encountered: