UMI count question #254

lishuangshuang0616 · 2024-03-28T09:45:03Z

Thank you for your software.
When analyzing single-cell 10x data, although we provide UMI data, the resulting output does not include UMI counts.
How does trust4 utilize UMI data during the assembly process? In barcode_report.tsv, there is only "read_fragment_count."
Does it represent the number of reads used to assemble each sequence?

mourisl · 2024-03-28T18:53:21Z

If you provide the UMI in a single-cell data, the read count will be with respect to the number of UMIs supporting the corresponding CDR3.

lishuangshuang0616 · 2024-03-29T01:01:01Z

I would like to know what kind of relationship exists between them, because in the cdr3.out file, I noticed that the sum of the read_fragment_count for all consensus IDs of a cell is much larger than the number of UMIs. For example, for a certain cell, the sum of read_fragment_count is 933, but the numbers of reads and UMIs are 1615 and 48, respectively. So, I am confused about how UMIs are allocated and whether only reads with the same UMIs are used for overlapping assembly.

mourisl · 2024-03-29T01:48:38Z

The assembly step does not use the UMI information for single-cell data. After the assembly, the reads from the same UMI maybe mapped to different contigs in the quantification step, as a result, some UMIs will be overcounted. But the count for a CDR3 of a specific contig from a cell is the unique number of UMIs that mapped to the CDR3.

For the screenshot, do you mean that the count 461 (red square) is above the 48 (the UMIs for a cell), so the UMI count is wrong here?

lishuangshuang0616 · 2024-03-29T02:06:47Z

Yes, why is the count much higher at this position? Is it an error, possibly stemming from my input mistake? If it's an error, I'll reanalyze it.

mourisl · 2024-03-29T02:07:48Z

What was your running command?

lishuangshuang0616 · 2024-03-29T02:23:02Z

$ fastq-extractor -t ${threads} -f ${coordinate} -o ${outdir}/tcrbcr -1 {R2.reads} --barcode {R1cb} --UMI {R1ub} --barcodeWhitelist {barcodewhitelist} --barcodeTranslate {barcodeTranslate}
$ trust4 -t ${threads} -f ${coordinate} -u ${outdir}/tcrbcr.fq --barcode ${outdir}/tcrbcr_bc.fa --UMI ${outdir}/tcrbcr_umi.fa
$ annotator -f ${coordinate} -a ${outdir}/tcrbcr_final.out -t ${threads} -o ${outdir}/tcrbcr --barcode --UMI  --readAssignment ${outdir}/tcrbcr_assign.out -r ${outdir}/tcrbcr/tcrbcr_assembled_reads.fa --airrAlignment > ${outdir}/tcrbcr_annot.fa

I split read1's cell barcode and unique molecular identifier (UMI) into two separate fastq files, and then I needed to convert the cell IDs, so I used barcodeTranslate. This shouldn't affect anything, right?

mourisl · 2024-03-29T02:32:48Z

This should be fine. One minor thing is that the "-1" for fastq-extractor should be "-u". Otherwise, it will think this is a paired-end data sets and throw an error of unequal number of reads.

How did you calculate that there were 48 UMIs for this cell?

lishuangshuang0616 · 2024-03-29T02:47:08Z

def cell_summary(barcode_fa, umi_fa, report):
    read_count_dict = defaultdict(int)
    umi_dict = defaultdict(set)
    with pysam.FastxFile(barcode_fa) as f1, \
        pysam.FastqFile(umi_fa) as f2:

        for read_1, read_2 in zip(f1, f2):
            cb = read_1.sequence
            umi = read_2.sequence
            read_count_dict[cb] += 1
            umi_dict[cb].add(umi)

    barcode_list = list(read_count_dict.keys())
    df_count = pd.DataFrame({'cell': barcode_list,
                                'read_count': [read_count_dict[i] for i in barcode_list],
                                'UMI': [len(umi_dict[i]) for i in barcode_list]})
    df_count.sort_values(by='UMI', ascending=False, inplace=True)
    df_count.to_csv(report, sep=',', index=None)

Script statistics tcrbcr_bc.fa and tcrbcr_umi.fa

lishuangshuang0616 · 2024-03-29T02:53:09Z

This should be fine. One minor thing is that the "-1" for fastq-extractor should be "-u". Otherwise, it will think this is a paired-end data sets and throw an error of unequal number of reads.

How did you calculate that there were 48 UMIs for this cell?

Yes, it's "-u" in my pipeline . I made mistake in the issue description.

mourisl · 2024-03-29T03:31:58Z

def cell_summary(barcode_fa, umi_fa, report):
    read_count_dict = defaultdict(int)
    umi_dict = defaultdict(set)
    with pysam.FastxFile(barcode_fa) as f1, \
        pysam.FastqFile(umi_fa) as f2:

        for read_1, read_2 in zip(f1, f2):
            cb = read_1.sequence
            umi = read_2.sequence
            read_count_dict[cb] += 1
            umi_dict[cb].add(umi)

    barcode_list = list(read_count_dict.keys())
    df_count = pd.DataFrame({'cell': barcode_list,
                                'read_count': [read_count_dict[i] for i in barcode_list],
                                'UMI': [len(umi_dict[i]) for i in barcode_list]})
    df_count.sort_values(by='UMI', ascending=False, inplace=True)
    df_count.to_csv(report, sep=',', index=None)

Script statistics tcrbcr_bc.fa and tcrbcr_umi.fa

This looks right to me. I just checked my run with barcode+UMI and a quick peek did not find any discrepancies.

Could you please run
"grep barcode:XXXX tcrbcr_assembled_reads.fa | cut -f6 -d' ' | sort | uniq | wc -l", where XXX is the barcode sequence of the cell with 461 read count. This command will return the number of distinct UMIs found by TRUST4 in the assembled reads.

lishuangshuang0616 · 2024-03-29T04:41:05Z

$grep '' tcrbcr_assembled_reads.fa |cut -d' ' -f6 |sort | uniq |wc -l
48
$grep '' tcrbcr_assembled_reads.fa |cut -d' ' -f6 |wc -l
1609

The numbers look similar to the statistics

mourisl · 2024-03-29T04:45:45Z

There might be a bug in the program then. Could you please share the _assembled_reads.fa and _final.out file with me?

mourisl · 2024-03-29T05:00:43Z

How about just the reads and final.out (6 lines per contig) from the cell that you found had the issue? You can either send through the email as the attachment, or googledrive/dropbox/baiduwangpan link? Thank you.

lishuangshuang0616 · 2024-03-29T05:02:48Z

ok，your email dress？

mourisl · 2024-03-29T05:03:21Z

Li.Song@dartmouth.edu

mourisl · 2024-03-31T17:39:25Z

Thank you for sharing the file. I got a reasonable UMI count in the _cdr3.out file based on the files you provided:

CELL8384_N1_0 0 TRBV4-2*01  TRBD2*02  TRBJ2-7*01  TRBC2 CTGGGGCATAACGCT TACAACTTTAAAGAACAG  TGTGCCAGCAGCCCACTGGACGGGGGAGGGGGAAACGAGCAGTACTTC  1.00  11.00 100.00  1
CELL8384_N1_946 0 TRAV5*01  * TRAJ28*01 TRAC  GACAGCTCCTCCACCTAC  ATTTTTTCAAATATGGACATG TGTGCAGAGATGGCACCTGGGGCTGGGAGTTACCAACTCACTTTC 1.00  5.00  100.00  1
CELL8384_N1_12367 0 TRBV4-2*01  TRBD1*01  TRBJ2-7*01  * * * TGTGCCAGCAGCCCACTGGACGGGGGGAGGGGGAAACGAGCAGTACTTC 0.83  1.00  97.14 0
CELL8384_N1_13572 0 TRBV4-2*01,TRBV7-8*03 TRBD2*02  TRBJ2-7*01  * * * TGTGCCAGCAGCCCACTGGACGGGGGAGGGGGAAACGAGCAGTACTTC  1.00  1.00  100.00  0                                                 
CELL8384_N1_13641 0 TRBV4-2*01  TRBD1*01  TRBJ2-7*01  * * * TGTGCCAGCAGCCCACTGGACGGGGGAGGGGAAACGAGCAGTACTTC 1.00  2.00  97.06 0
CELL8384_N1_17754 0 TRBV11-2*01,TRBV11-3*01 TRBD1*01  TRBJ2-7*01  TRBC2 * * TGTGCCAGCAGCTTAGACTACAGGTTATATGGGGAGCAGTACTTC 0.83  2.00  100.00  0
CELL8384_N1_17959 0 TRBV4-2*01,TRBV4-3*01 * * * * * TGGTGCCCGGCCCGAAGTACTGCTCGTTTCCCCCTCCCCCGTCCAGTGGGC 0.00  1.00  0.00  0
CELL8384_N1_20036 0 TRAV5*01  * * * * * TCTGCAGAGACAGATGTTTATCCTTTTTATTCAATAGAACAGTGA 0.00  1.00  0.00  0
CELL8384_N1_20872 0 * * TRAJ49*01 TRAC  * * AGGGACACCGGTAACCAGTTCTATTTT 0.00  1.00  0.00  0

Which version of TRUST4 did you use?

lishuangshuang0616 · 2024-04-01T00:39:17Z

TRUST4 v1.0.13-r473.
I'll try the latest version to see if this problem occurs.
Thanks.

mourisl · 2024-04-01T05:25:33Z

Thank you for sharing the larger data set. I think I've found and fixed the bug that may assign a read to another barcode in the contig abundance estimation step. Could you please pull down the github repo again and give it a try? This is a pretty serious bug, if it works on your data set, I will draft a new release soon.

lishuangshuang0616 · 2024-04-01T06:02:37Z

I tested a larger dataset and obtained results for several cell IDs. Now the UMIs are working properly.
Many thanks for your help, Dr. Li.

lishuangshuang0616 · 2024-04-01T07:00:19Z

grep 'CELL1000_N3' tcrbcr_assign.out|head -n 10
E200004414L1C001R03004110063    CELL1000_N3_509
E200004414L1C002R02602089629    CELL1000_N3_509
E200004414L1C003R00300997369    CELL1000_N3_509
E200004414L1C002R03201437208    CELL1000_N3_509
E200004414L1C003R00604427789    CELL1000_N3_509
E200004414L1C002R01201985373    CELL1000_N3_509
E200004414L1C002R00603037193    CELL1000_N3_509
E200004414L1C003R02601430041    CELL1000_N3_509
E200004414L1C002R00904293832    CELL1000_N3_509
E200004414L1C002R01800009866    CELL1000_N3_509

grep 'CELL1000_N3' tcrbcr_assembled_reads.fa|head -n 10
>E200004414L1C001R03004110063 -1 58323 66418 barcode:CELL1000_N3 umi:3844
>E200004414L1C002R02602089629 -1 58323 66418 barcode:CELL1000_N3 umi:2320
>E200004414L1C003R00300997369 -1 58323 66418 barcode:CELL1000_N3 umi:3033
>E200004414L1C002R03201437208 -1 56473 66418 barcode:CELL1000_N3 umi:3941
>E200004414L1C003R00604427789 -1 56473 66418 barcode:CELL1000_N3 umi:328
>E200004414L1C002R01201985373 -1 56333 66418 barcode:CELL1000_N3 umi:3370
>E200004414L1C002R00603037193 -1 56231 66418 barcode:CELL1000_N3 umi:6312
>E200004414L1C003R02601430041 -1 55874 66418 barcode:CELL1000_N3 umi:7102
>E200004414L1C002R00904293832 -1 55874 66307 barcode:CELL1000_N3 umi:345
>E200004414L1C002R01800009866 -1 55874 66307 barcode:CELL1000_N3 umi:2299

grep -A 1 'E200004414L1C001R03004110063' /tcrbcr_umi.fa
>E200004414L1C001R03004110063
CATAACTCAG
grep -A 1 'E200004414L1C002R02602089629' tcrbcr_umi.fa
>E200004414L1C002R02602089629
CATAACTTAG
grep -A 1 'E200004414L1C003R00300997369' tcrbcr_umi.fa
>E200004414L1C003R00300997369
CATAACTCAG
grep -A 1 'E200004414L1C002R03201437208' tcrbcr_umi.fa
>E200004414L1C002R03201437208
CATAACTCAG

Why do the umi numbers of the same umi become inconsistent after assembly?

mourisl · 2024-04-01T14:36:55Z

They should be consistent. Do you see those issues from the cell barcode you shared with me?

lishuangshuang0616 · 2024-04-01T15:28:54Z

I found the issue caused by my own mistake. I included 'missing_barcode' during the analysis, which caused the problem. Removing it will be OK.
Thank you.

mourisl · 2024-04-01T19:57:03Z

It's still quite strange. "E200004414L1C001R03004110063" and "E200004414L1C003R00300997369" have the same barcode and UMI, but their converted UMI numeric value is different. Their numeric UMI should not be affected by the "missing_barcode" issue. Or the UMIs correspond to other reads?

mourisl · 2024-04-02T01:30:55Z

I think I've found the issue. Could you please pull the updated github repo and give it a try? Please let me know whether it works when there are "missing_barcode" in the data. Thank you again for scrutinizing TRUST4's results.

lishuangshuang0616 · 2024-04-02T04:37:09Z

After using the new repo, the results match those obtained after removing the missing_barcode.
The information in several files also corresponds correctly.
Thank you very much for your help.

lishuangshuang0616 closed this as completed Apr 1, 2024

lishuangshuang0616 reopened this Apr 1, 2024

lishuangshuang0616 closed this as completed Apr 1, 2024

lishuangshuang0616 reopened this Apr 1, 2024

lishuangshuang0616 closed this as completed Apr 2, 2024

andreas-wilm mentioned this issue Nov 13, 2024

Question: use on demultiplexed single cell ONT data #326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UMI count question #254

UMI count question #254

lishuangshuang0616 commented Mar 28, 2024

mourisl commented Mar 28, 2024

lishuangshuang0616 commented Mar 29, 2024

mourisl commented Mar 29, 2024

lishuangshuang0616 commented Mar 29, 2024

mourisl commented Mar 29, 2024

lishuangshuang0616 commented Mar 29, 2024

mourisl commented Mar 29, 2024 •

edited

Loading

lishuangshuang0616 commented Mar 29, 2024

lishuangshuang0616 commented Mar 29, 2024

mourisl commented Mar 29, 2024

lishuangshuang0616 commented Mar 29, 2024

mourisl commented Mar 29, 2024

mourisl commented Mar 29, 2024

lishuangshuang0616 commented Mar 29, 2024

mourisl commented Mar 29, 2024

mourisl commented Mar 31, 2024

lishuangshuang0616 commented Apr 1, 2024

mourisl commented Apr 1, 2024

lishuangshuang0616 commented Apr 1, 2024

lishuangshuang0616 commented Apr 1, 2024

mourisl commented Apr 1, 2024

lishuangshuang0616 commented Apr 1, 2024

mourisl commented Apr 1, 2024

mourisl commented Apr 2, 2024

lishuangshuang0616 commented Apr 2, 2024

UMI count question #254

UMI count question #254

Comments

lishuangshuang0616 commented Mar 28, 2024

mourisl commented Mar 28, 2024

lishuangshuang0616 commented Mar 29, 2024

mourisl commented Mar 29, 2024

lishuangshuang0616 commented Mar 29, 2024

mourisl commented Mar 29, 2024

lishuangshuang0616 commented Mar 29, 2024

mourisl commented Mar 29, 2024 • edited Loading

lishuangshuang0616 commented Mar 29, 2024

lishuangshuang0616 commented Mar 29, 2024

mourisl commented Mar 29, 2024

lishuangshuang0616 commented Mar 29, 2024

mourisl commented Mar 29, 2024

mourisl commented Mar 29, 2024

lishuangshuang0616 commented Mar 29, 2024

mourisl commented Mar 29, 2024

mourisl commented Mar 31, 2024

lishuangshuang0616 commented Apr 1, 2024

mourisl commented Apr 1, 2024

lishuangshuang0616 commented Apr 1, 2024

lishuangshuang0616 commented Apr 1, 2024

mourisl commented Apr 1, 2024

lishuangshuang0616 commented Apr 1, 2024

mourisl commented Apr 1, 2024

mourisl commented Apr 2, 2024

lishuangshuang0616 commented Apr 2, 2024

mourisl commented Mar 29, 2024 •

edited

Loading