Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bam dedup (umi_tools dedup) #555

Closed
HengkuanLi opened this issue Aug 18, 2022 · 10 comments
Closed

bam dedup (umi_tools dedup) #555

HengkuanLi opened this issue Aug 18, 2022 · 10 comments
Assignees

Comments

@HengkuanLi
Copy link

umi_tools dedup --extract-umi-method=tag --umi-tag=UB --cell-tag=CB
--method unique
-I input.bam -S output.bam
--log=LOGFILE

Hello, I used the above command to deduplicate the bam file, but there are still reads in the deduplicated bam file with the same coordinates and UMI

@TomSmithCGAT
Copy link
Member

Hi @zigenLi. Would you mind posting a few examples please. I suspect they may have different soft-clipping in the CIGAR string

@HengkuanLi
Copy link
Author

example 1:
E100047948L1C007R01003932541 0 10 103122537 3 87M4D13M * 0 0 GAATTTTGCGAAATGTTGTGACAGATATTTGTCAGAATTATTGGAAATATGTTATGGTGGATGGCATTTTTTCATATAAAATTATATTTCTGTTGAGACA * NH:i:2 HI:i:1 AS:i:86 nM:i:1 UR:Z:GTGCCCTCCA CB:Z:045E30A27A GN:Z:B2M UB:Z:GTGCCCTCCA RE:A:E DB:Z:CELL1_N3
E100047948L1C010R02000347603 0 10 103122537 255 89M11S * 0 0 GAATTTTGCAAAATGTTGTGACAGATATTTGTCAGAATTATTGGAAATATGTTATGGTGGATGGCATTTTTTCATATAAAATTATATTTCTGTTGTGTCA * NH:i:1 HI:i:1 AS:i:87 nM:i:0 UR:Z:GTGCCCTCCA CB:Z:045E30A27A GN:Z:B2M UB:Z:GTGCCCTCCA RE:A:E DB:Z:CELL1_N3

example 2:
E100047948L1C042R02203017007 0 11 74998101 3 5S95M * 0 0 CTATTTTTTTTTTTTTTTTAATGCTGAATATTGTGATTTCTTATTTGAGATTGAAAAAGACCCACTTGAAACTGATATATGTTAGAATATATTATGTTAA * NH:i:2 HI:i:1 AS:i:93 nM:i:0 UR:Z:TCGATGCAAG CB:Z:7D43A87E2B GN:Z:SF3B6 UB:Z:TCGATGCAAG RE:A:E DB:Z:CELL1_N3
E100047948L1C023R03300929063 0 11 74998101 3 2S98M * 0 0 ATTTTTTTTTTTTTTTAATGCTGAATATTGTGATTTCTTATTTGAGATTGAAAAAGACCCACTTGAAACTGATATATGTTAGAATATATTATGTTAATAA * NH:i:2 HI:i:1 AS:i:96 nM:i:0 UR:Z:TCGATGCAAG CB:Z:7D43A87E2B GN:Z:SF3B6 UB:Z:TCGATGCAAG RE:A:E DB:Z:CELL1_N3

example 3:
E100047948L1C039R02503252502 0 2 65569394 255 13S87M * 0 0 TGGCCTCCGACTTTTCCACCCTCCTGTGCCCCCCAATGGTTTTCTTTGCAAGTGCTTTTGGAACTAAGAAGCTAGTGTCTTTGATTAACTGTTGCCTGCT * NH:i:1 HI:i:1 AS:i:81 nM:i:2 UR:Z:AAACGCCCAA CB:Z:9818DBD8C1 GN:Z:ACTR3 UB:Z:AAACGCCCAA RE:A:E DB:Z:CELL1_N3
E100047948L1C032R02101989739 0 2 65569394 255 100M * 0 0 TTCCACCCTCCTGTGCCCCCCAATGGTTTTCTTTGCAAGTGCTTTTGGAACTAAGAAGCTAGTGTCTTGGATTAACTGATGCCTGCTAGTGCTTTCTGAT * NH:i:1 HI:i:1 AS:i:98 nM:i:0 UR:Z:AAACGCCCAA CB:Z:9818DBD8C1 GN:Z:ACTR3 UB:Z:AAACGCCCAA RE:A:E DB:Z:CELL1_N3

@HengkuanLi
Copy link
Author

Hi @zigenLi. Would you mind posting a few examples please. I suspect they may have different soft-clipping in the CIGAR string

Hello, the above are some examples I pasted. Could you please check the problem?

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Aug 18, 2022

In example 1, the difference is soft-clipping at the end of read2. By default, UMI-tools assumes that > 4 bases soft-clipped is indicative of a difference in splicing, which would also mean the two reads are not duplicates. You can turn this behaviour off with --soft-clipthreshold=100. See docs here

In example 2 & 3, the first read has more bases soft-clipped at the starte, e.g
example2: 5S95M vs 2S98M. UMI-tools takes account of the soft-clipping to determine the alignment start position to avoid the situation where duplicate reads have different alignments because one has low quality base calls at the start of the read which are then soft-clipped by the aligner.

On an unrelated note, your cell barcodes look strange, e.g not nucleotide sequences. Is that what you want?

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Aug 18, 2022

@IanSudbery - I think we could probably do with a page/FAQ entry on the readme giving examples why two reads may appear to be duplicates by eye, but not by UMI-tools. Took me a while to realise what was going on with example 1 above. Thoughts?

@IanSudbery
Copy link
Member

@zigenLi Just checking you are aware that while you have set a cell barcode, you've not told dedup to consider it during the deduplication? (--per-cell).

@TomSmithCGAT Probably wise.

@HengkuanLi
Copy link
Author

In example 1, the difference is soft-clipping at the end of read2. By default, UMI-tools assumes that > 4 bases soft-clipped is indicative of a difference in splicing, which would also mean the two reads are not duplicates. You can turn this behaviour off with --soft-clipthreshold=100. See docs here

In example 2 & 3, the first read has more bases soft-clipped at the starte, e.g example2: 5S95M vs 2S98M. UMI-tools takes account of the soft-clipping to determine the alignment start position to avoid the situation where duplicate reads have different alignments because one has low quality base calls at the start of the read which are then soft-clipped by the aligner.

On an unrelated note, your cell barcodes look strange, e.g not nucleotide sequences. Is that what you want?

Yes, the barcode sequence here is encoded.

I am analyzing single-cell data. I want to deduplicate the generated bam file. The starting coordinates of the reads of the same UMI are different. Should I keep this kind of reads?

@HengkuanLi
Copy link
Author

HengkuanLi commented Aug 18, 2022

@zigenLi Just checking you are aware that while you have set a cell barcode, you've not told dedup to consider it during the deduplication? (--per-cell).

@TomSmithCGAT Probably wise.

Is the parameter "--per-cell" necessary when running the above command? What would be the impact without this parameter?

@IanSudbery
Copy link
Member

I am analyzing single-cell data. I want to deduplicate the generated bam file. The starting coordinates of the reads of the same UMI are different. Should I keep this kind of reads?

It depends on the single cell protocol you have used. We always use some sort of positional information to deduplicate reads, but what information that is depends on how the reads were generated. If fragmentation of the cDNA happens before PCR (as in protocols like SMART-seq2), then one molecule will always give rise to reads with the same alignment position, and thus the position is informative as to whether a read is unique - reads with the same UMI, but different read alignment positions will have arisen from different cDNA molecules. In these cases we use UMI and (adjusted) alignment position to deduplicate reads.

However, if PCR happens before fragmentation (like in 10X), then one molecule can give rise to reads with multiple different alignment positions and thus precise alignment position is not informative - reads with the same UMI, but different read alignment positions may have arisen from the same or different cDNA molecules. However, the alignment positions for a molecule must always be in the same transcript. Thus in these cases we use UMI and transcript/gene assignment to deduplicate reads. This mode is activated using --per-gene and requires assigning reads to genes before running dedup.

Is the parameter "--per-cell" necessary when running the above command? What would be the impact without this parameter?

--per-cell tells UMI-tools to use the cell barcode when deduplicating reads as well as the UMI (and which ever positional information is specified as above). This means that two reads with the same UMI and the same position (or the same transcript/gene assignment), but coming from different cells, will be treated as independent of each other. This is generally desirable in a single-cell context.

@HengkuanLi
Copy link
Author

@TomSmithCGAT @IanSudbery
Thank you very much for your help, I completed the deduplication of BAM file by adding the parameter "--per-gene".

But without parameter "--per-gene", adding parameter "--soft-clipthreshold=100" still exists reads with the same coordinates and UMI in the deduplicated bam file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants