bam dedup (umi_tools dedup) #555

HengkuanLi · 2022-08-18T08:32:49Z

umi_tools dedup --extract-umi-method=tag --umi-tag=UB --cell-tag=CB
--method unique
-I input.bam -S output.bam
--log=LOGFILE

Hello, I used the above command to deduplicate the bam file, but there are still reads in the deduplicated bam file with the same coordinates and UMI

TomSmithCGAT · 2022-08-18T08:41:00Z

Hi @zigenLi. Would you mind posting a few examples please. I suspect they may have different soft-clipping in the CIGAR string

HengkuanLi · 2022-08-18T08:50:49Z

example 1:
E100047948L1C007R01003932541 0 10 103122537 3 87M4D13M * 0 0 GAATTTTGCGAAATGTTGTGACAGATATTTGTCAGAATTATTGGAAATATGTTATGGTGGATGGCATTTTTTCATATAAAATTATATTTCTGTTGAGACA * NH:i:2 HI:i:1 AS:i:86 nM:i:1 UR:Z:GTGCCCTCCA CB:Z:045E30A27A GN:Z:B2M UB:Z:GTGCCCTCCA RE:A:E DB:Z:CELL1_N3
E100047948L1C010R02000347603 0 10 103122537 255 89M11S * 0 0 GAATTTTGCAAAATGTTGTGACAGATATTTGTCAGAATTATTGGAAATATGTTATGGTGGATGGCATTTTTTCATATAAAATTATATTTCTGTTGTGTCA * NH:i:1 HI:i:1 AS:i:87 nM:i:0 UR:Z:GTGCCCTCCA CB:Z:045E30A27A GN:Z:B2M UB:Z:GTGCCCTCCA RE:A:E DB:Z:CELL1_N3

example 2:
E100047948L1C042R02203017007 0 11 74998101 3 5S95M * 0 0 CTATTTTTTTTTTTTTTTTAATGCTGAATATTGTGATTTCTTATTTGAGATTGAAAAAGACCCACTTGAAACTGATATATGTTAGAATATATTATGTTAA * NH:i:2 HI:i:1 AS:i:93 nM:i:0 UR:Z:TCGATGCAAG CB:Z:7D43A87E2B GN:Z:SF3B6 UB:Z:TCGATGCAAG RE:A:E DB:Z:CELL1_N3
E100047948L1C023R03300929063 0 11 74998101 3 2S98M * 0 0 ATTTTTTTTTTTTTTTAATGCTGAATATTGTGATTTCTTATTTGAGATTGAAAAAGACCCACTTGAAACTGATATATGTTAGAATATATTATGTTAATAA * NH:i:2 HI:i:1 AS:i:96 nM:i:0 UR:Z:TCGATGCAAG CB:Z:7D43A87E2B GN:Z:SF3B6 UB:Z:TCGATGCAAG RE:A:E DB:Z:CELL1_N3

example 3:
E100047948L1C039R02503252502 0 2 65569394 255 13S87M * 0 0 TGGCCTCCGACTTTTCCACCCTCCTGTGCCCCCCAATGGTTTTCTTTGCAAGTGCTTTTGGAACTAAGAAGCTAGTGTCTTTGATTAACTGTTGCCTGCT * NH:i:1 HI:i:1 AS:i:81 nM:i:2 UR:Z:AAACGCCCAA CB:Z:9818DBD8C1 GN:Z:ACTR3 UB:Z:AAACGCCCAA RE:A:E DB:Z:CELL1_N3
E100047948L1C032R02101989739 0 2 65569394 255 100M * 0 0 TTCCACCCTCCTGTGCCCCCCAATGGTTTTCTTTGCAAGTGCTTTTGGAACTAAGAAGCTAGTGTCTTGGATTAACTGATGCCTGCTAGTGCTTTCTGAT * NH:i:1 HI:i:1 AS:i:98 nM:i:0 UR:Z:AAACGCCCAA CB:Z:9818DBD8C1 GN:Z:ACTR3 UB:Z:AAACGCCCAA RE:A:E DB:Z:CELL1_N3

HengkuanLi · 2022-08-18T08:52:19Z

Hi @zigenLi. Would you mind posting a few examples please. I suspect they may have different soft-clipping in the CIGAR string

Hello, the above are some examples I pasted. Could you please check the problem?

TomSmithCGAT · 2022-08-18T09:04:41Z

In example 1, the difference is soft-clipping at the end of read2. By default, UMI-tools assumes that > 4 bases soft-clipped is indicative of a difference in splicing, which would also mean the two reads are not duplicates. You can turn this behaviour off with --soft-clipthreshold=100. See docs here

In example 2 & 3, the first read has more bases soft-clipped at the starte, e.g
example2: 5S95M vs 2S98M. UMI-tools takes account of the soft-clipping to determine the alignment start position to avoid the situation where duplicate reads have different alignments because one has low quality base calls at the start of the read which are then soft-clipped by the aligner.

On an unrelated note, your cell barcodes look strange, e.g not nucleotide sequences. Is that what you want?

TomSmithCGAT · 2022-08-18T09:08:47Z

@IanSudbery - I think we could probably do with a page/FAQ entry on the readme giving examples why two reads may appear to be duplicates by eye, but not by UMI-tools. Took me a while to realise what was going on with example 1 above. Thoughts?

IanSudbery · 2022-08-18T09:12:06Z

@zigenLi Just checking you are aware that while you have set a cell barcode, you've not told dedup to consider it during the deduplication? (--per-cell).

@TomSmithCGAT Probably wise.

HengkuanLi · 2022-08-18T09:28:52Z

In example 1, the difference is soft-clipping at the end of read2. By default, UMI-tools assumes that > 4 bases soft-clipped is indicative of a difference in splicing, which would also mean the two reads are not duplicates. You can turn this behaviour off with --soft-clipthreshold=100. See docs here

In example 2 & 3, the first read has more bases soft-clipped at the starte, e.g example2: 5S95M vs 2S98M. UMI-tools takes account of the soft-clipping to determine the alignment start position to avoid the situation where duplicate reads have different alignments because one has low quality base calls at the start of the read which are then soft-clipped by the aligner.

On an unrelated note, your cell barcodes look strange, e.g not nucleotide sequences. Is that what you want?

Yes, the barcode sequence here is encoded.

I am analyzing single-cell data. I want to deduplicate the generated bam file. The starting coordinates of the reads of the same UMI are different. Should I keep this kind of reads?

HengkuanLi · 2022-08-18T09:32:27Z

@zigenLi Just checking you are aware that while you have set a cell barcode, you've not told dedup to consider it during the deduplication? (--per-cell).

@TomSmithCGAT Probably wise.

Is the parameter "--per-cell" necessary when running the above command? What would be the impact without this parameter?

IanSudbery · 2022-08-18T09:46:29Z

I am analyzing single-cell data. I want to deduplicate the generated bam file. The starting coordinates of the reads of the same UMI are different. Should I keep this kind of reads?

It depends on the single cell protocol you have used. We always use some sort of positional information to deduplicate reads, but what information that is depends on how the reads were generated. If fragmentation of the cDNA happens before PCR (as in protocols like SMART-seq2), then one molecule will always give rise to reads with the same alignment position, and thus the position is informative as to whether a read is unique - reads with the same UMI, but different read alignment positions will have arisen from different cDNA molecules. In these cases we use UMI and (adjusted) alignment position to deduplicate reads.

However, if PCR happens before fragmentation (like in 10X), then one molecule can give rise to reads with multiple different alignment positions and thus precise alignment position is not informative - reads with the same UMI, but different read alignment positions may have arisen from the same or different cDNA molecules. However, the alignment positions for a molecule must always be in the same transcript. Thus in these cases we use UMI and transcript/gene assignment to deduplicate reads. This mode is activated using --per-gene and requires assigning reads to genes before running dedup.

Is the parameter "--per-cell" necessary when running the above command? What would be the impact without this parameter?

--per-cell tells UMI-tools to use the cell barcode when deduplicating reads as well as the UMI (and which ever positional information is specified as above). This means that two reads with the same UMI and the same position (or the same transcript/gene assignment), but coming from different cells, will be treated as independent of each other. This is generally desirable in a single-cell context.

HengkuanLi · 2022-08-19T03:26:59Z

@TomSmithCGAT @IanSudbery
Thank you very much for your help, I completed the deduplication of BAM file by adding the parameter "--per-gene".

But without parameter "--per-gene", adding parameter "--soft-clipthreshold=100" still exists reads with the same coordinates and UMI in the deduplicated bam file.

TomSmithCGAT added the Documentation label Aug 18, 2022

TomSmithCGAT self-assigned this Aug 18, 2022

TomSmithCGAT mentioned this issue Mar 7, 2024

Add FAQ entry re identification of possible duplicates reads/pairs #631

Merged

TomSmithCGAT closed this as completed Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bam dedup (umi_tools dedup) #555

bam dedup (umi_tools dedup) #555

HengkuanLi commented Aug 18, 2022

TomSmithCGAT commented Aug 18, 2022

HengkuanLi commented Aug 18, 2022

HengkuanLi commented Aug 18, 2022

TomSmithCGAT commented Aug 18, 2022 •

edited

Loading

TomSmithCGAT commented Aug 18, 2022 •

edited

Loading

IanSudbery commented Aug 18, 2022

HengkuanLi commented Aug 18, 2022

HengkuanLi commented Aug 18, 2022 •

edited

Loading

IanSudbery commented Aug 18, 2022

HengkuanLi commented Aug 19, 2022

bam dedup (umi_tools dedup) #555

bam dedup (umi_tools dedup) #555

Comments

HengkuanLi commented Aug 18, 2022

TomSmithCGAT commented Aug 18, 2022

HengkuanLi commented Aug 18, 2022

HengkuanLi commented Aug 18, 2022

TomSmithCGAT commented Aug 18, 2022 • edited Loading

TomSmithCGAT commented Aug 18, 2022 • edited Loading

IanSudbery commented Aug 18, 2022

HengkuanLi commented Aug 18, 2022

HengkuanLi commented Aug 18, 2022 • edited Loading

IanSudbery commented Aug 18, 2022

HengkuanLi commented Aug 19, 2022

TomSmithCGAT commented Aug 18, 2022 •

edited

Loading

TomSmithCGAT commented Aug 18, 2022 •

edited

Loading

HengkuanLi commented Aug 18, 2022 •

edited

Loading