-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bam dedup (umi_tools dedup) #555
Comments
Hi @zigenLi. Would you mind posting a few examples please. I suspect they may have different soft-clipping in the CIGAR string |
example 1: example 2: example 3: |
Hello, the above are some examples I pasted. Could you please check the problem? |
In example 1, the difference is soft-clipping at the end of read2. By default, UMI-tools assumes that > 4 bases soft-clipped is indicative of a difference in splicing, which would also mean the two reads are not duplicates. You can turn this behaviour off with In example 2 & 3, the first read has more bases soft-clipped at the starte, e.g On an unrelated note, your cell barcodes look strange, e.g not nucleotide sequences. Is that what you want? |
@IanSudbery - I think we could probably do with a page/FAQ entry on the readme giving examples why two reads may appear to be duplicates by eye, but not by UMI-tools. Took me a while to realise what was going on with example 1 above. Thoughts? |
@zigenLi Just checking you are aware that while you have set a cell barcode, you've not told dedup to consider it during the deduplication? ( @TomSmithCGAT Probably wise. |
Yes, the barcode sequence here is encoded. I am analyzing single-cell data. I want to deduplicate the generated bam file. The starting coordinates of the reads of the same UMI are different. Should I keep this kind of reads? |
Is the parameter "--per-cell" necessary when running the above command? What would be the impact without this parameter? |
It depends on the single cell protocol you have used. We always use some sort of positional information to deduplicate reads, but what information that is depends on how the reads were generated. If fragmentation of the cDNA happens before PCR (as in protocols like SMART-seq2), then one molecule will always give rise to reads with the same alignment position, and thus the position is informative as to whether a read is unique - reads with the same UMI, but different read alignment positions will have arisen from different cDNA molecules. In these cases we use UMI and (adjusted) alignment position to deduplicate reads. However, if PCR happens before fragmentation (like in 10X), then one molecule can give rise to reads with multiple different alignment positions and thus precise alignment position is not informative - reads with the same UMI, but different read alignment positions may have arisen from the same or different cDNA molecules. However, the alignment positions for a molecule must always be in the same transcript. Thus in these cases we use UMI and transcript/gene assignment to deduplicate reads. This mode is activated using
|
@TomSmithCGAT @IanSudbery But without parameter "--per-gene", adding parameter "--soft-clipthreshold=100" still exists reads with the same coordinates and UMI in the deduplicated bam file. |
umi_tools dedup --extract-umi-method=tag --umi-tag=UB --cell-tag=CB
--method unique
-I input.bam -S output.bam
--log=LOGFILE
Hello, I used the above command to deduplicate the bam file, but there are still reads in the deduplicated bam file with the same coordinates and UMI
The text was updated successfully, but these errors were encountered: