Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UMI count question #254

Closed
lishuangshuang0616 opened this issue Mar 28, 2024 · 25 comments
Closed

UMI count question #254

lishuangshuang0616 opened this issue Mar 28, 2024 · 25 comments

Comments

@lishuangshuang0616
Copy link

Thank you for your software.
When analyzing single-cell 10x data, although we provide UMI data, the resulting output does not include UMI counts.
How does trust4 utilize UMI data during the assembly process? In barcode_report.tsv, there is only "read_fragment_count."
Does it represent the number of reads used to assemble each sequence?

@mourisl
Copy link
Collaborator

mourisl commented Mar 28, 2024

If you provide the UMI in a single-cell data, the read count will be with respect to the number of UMIs supporting the corresponding CDR3.

@lishuangshuang0616
Copy link
Author

I would like to know what kind of relationship exists between them, because in the cdr3.out file, I noticed that the sum of the read_fragment_count for all consensus IDs of a cell is much larger than the number of UMIs. For example, for a certain cell, the sum of read_fragment_count is 933, but the numbers of reads and UMIs are 1615 and 48, respectively. So, I am confused about how UMIs are allocated and whether only reads with the same UMIs are used for overlapping assembly.
image

@mourisl
Copy link
Collaborator

mourisl commented Mar 29, 2024

The assembly step does not use the UMI information for single-cell data. After the assembly, the reads from the same UMI maybe mapped to different contigs in the quantification step, as a result, some UMIs will be overcounted. But the count for a CDR3 of a specific contig from a cell is the unique number of UMIs that mapped to the CDR3.

For the screenshot, do you mean that the count 461 (red square) is above the 48 (the UMIs for a cell), so the UMI count is wrong here?

@lishuangshuang0616
Copy link
Author

Yes, why is the count much higher at this position? Is it an error, possibly stemming from my input mistake? If it's an error, I'll reanalyze it.

@mourisl
Copy link
Collaborator

mourisl commented Mar 29, 2024

What was your running command?

@lishuangshuang0616
Copy link
Author

$ fastq-extractor -t ${threads} -f ${coordinate} -o ${outdir}/tcrbcr -1 {R2.reads} --barcode {R1cb} --UMI {R1ub} --barcodeWhitelist {barcodewhitelist} --barcodeTranslate {barcodeTranslate}
$ trust4 -t ${threads} -f ${coordinate} -u ${outdir}/tcrbcr.fq --barcode ${outdir}/tcrbcr_bc.fa --UMI ${outdir}/tcrbcr_umi.fa
$ annotator -f ${coordinate} -a ${outdir}/tcrbcr_final.out -t ${threads} -o ${outdir}/tcrbcr --barcode --UMI  --readAssignment ${outdir}/tcrbcr_assign.out -r ${outdir}/tcrbcr/tcrbcr_assembled_reads.fa --airrAlignment > ${outdir}/tcrbcr_annot.fa

I split read1's cell barcode and unique molecular identifier (UMI) into two separate fastq files, and then I needed to convert the cell IDs, so I used barcodeTranslate. This shouldn't affect anything, right?

@mourisl
Copy link
Collaborator

mourisl commented Mar 29, 2024

This should be fine. One minor thing is that the "-1" for fastq-extractor should be "-u". Otherwise, it will think this is a paired-end data sets and throw an error of unequal number of reads.

How did you calculate that there were 48 UMIs for this cell?

@lishuangshuang0616
Copy link
Author

def cell_summary(barcode_fa, umi_fa, report):
    read_count_dict = defaultdict(int)
    umi_dict = defaultdict(set)
    with pysam.FastxFile(barcode_fa) as f1, \
        pysam.FastqFile(umi_fa) as f2:

        for read_1, read_2 in zip(f1, f2):
            cb = read_1.sequence
            umi = read_2.sequence
            read_count_dict[cb] += 1
            umi_dict[cb].add(umi)

    barcode_list = list(read_count_dict.keys())
    df_count = pd.DataFrame({'cell': barcode_list,
                                'read_count': [read_count_dict[i] for i in barcode_list],
                                'UMI': [len(umi_dict[i]) for i in barcode_list]})
    df_count.sort_values(by='UMI', ascending=False, inplace=True)
    df_count.to_csv(report, sep=',', index=None)

Script statistics tcrbcr_bc.fa and tcrbcr_umi.fa

@lishuangshuang0616
Copy link
Author

This should be fine. One minor thing is that the "-1" for fastq-extractor should be "-u". Otherwise, it will think this is a paired-end data sets and throw an error of unequal number of reads.

How did you calculate that there were 48 UMIs for this cell?

Yes, it's "-u" in my pipeline . I made mistake in the issue description.

@mourisl
Copy link
Collaborator

mourisl commented Mar 29, 2024

def cell_summary(barcode_fa, umi_fa, report):
    read_count_dict = defaultdict(int)
    umi_dict = defaultdict(set)
    with pysam.FastxFile(barcode_fa) as f1, \
        pysam.FastqFile(umi_fa) as f2:

        for read_1, read_2 in zip(f1, f2):
            cb = read_1.sequence
            umi = read_2.sequence
            read_count_dict[cb] += 1
            umi_dict[cb].add(umi)

    barcode_list = list(read_count_dict.keys())
    df_count = pd.DataFrame({'cell': barcode_list,
                                'read_count': [read_count_dict[i] for i in barcode_list],
                                'UMI': [len(umi_dict[i]) for i in barcode_list]})
    df_count.sort_values(by='UMI', ascending=False, inplace=True)
    df_count.to_csv(report, sep=',', index=None)

Script statistics tcrbcr_bc.fa and tcrbcr_umi.fa

This looks right to me. I just checked my run with barcode+UMI and a quick peek did not find any discrepancies.

Could you please run
"grep barcode:XXXX tcrbcr_assembled_reads.fa | cut -f6 -d' ' | sort | uniq | wc -l", where XXX is the barcode sequence of the cell with 461 read count. This command will return the number of distinct UMIs found by TRUST4 in the assembled reads.

@lishuangshuang0616
Copy link
Author

$grep '' tcrbcr_assembled_reads.fa |cut -d' ' -f6 |sort | uniq |wc -l
48
$grep '' tcrbcr_assembled_reads.fa |cut -d' ' -f6 |wc -l
1609

The numbers look similar to the statistics

@mourisl
Copy link
Collaborator

mourisl commented Mar 29, 2024

There might be a bug in the program then. Could you please share the _assembled_reads.fa and _final.out file with me?

@mourisl
Copy link
Collaborator

mourisl commented Mar 29, 2024

How about just the reads and final.out (6 lines per contig) from the cell that you found had the issue? You can either send through the email as the attachment, or googledrive/dropbox/baiduwangpan link? Thank you.

@lishuangshuang0616
Copy link
Author

ok,your email dress?

@mourisl
Copy link
Collaborator

mourisl commented Mar 29, 2024

Li.Song@dartmouth.edu

@mourisl
Copy link
Collaborator

mourisl commented Mar 31, 2024

Thank you for sharing the file. I got a reasonable UMI count in the _cdr3.out file based on the files you provided:

CELL8384_N1_0 0 TRBV4-2*01  TRBD2*02  TRBJ2-7*01  TRBC2 CTGGGGCATAACGCT TACAACTTTAAAGAACAG  TGTGCCAGCAGCCCACTGGACGGGGGAGGGGGAAACGAGCAGTACTTC  1.00  11.00 100.00  1
CELL8384_N1_946 0 TRAV5*01  * TRAJ28*01 TRAC  GACAGCTCCTCCACCTAC  ATTTTTTCAAATATGGACATG TGTGCAGAGATGGCACCTGGGGCTGGGAGTTACCAACTCACTTTC 1.00  5.00  100.00  1
CELL8384_N1_12367 0 TRBV4-2*01  TRBD1*01  TRBJ2-7*01  * * * TGTGCCAGCAGCCCACTGGACGGGGGGAGGGGGAAACGAGCAGTACTTC 0.83  1.00  97.14 0
CELL8384_N1_13572 0 TRBV4-2*01,TRBV7-8*03 TRBD2*02  TRBJ2-7*01  * * * TGTGCCAGCAGCCCACTGGACGGGGGAGGGGGAAACGAGCAGTACTTC  1.00  1.00  100.00  0                                                 
CELL8384_N1_13641 0 TRBV4-2*01  TRBD1*01  TRBJ2-7*01  * * * TGTGCCAGCAGCCCACTGGACGGGGGAGGGGAAACGAGCAGTACTTC 1.00  2.00  97.06 0
CELL8384_N1_17754 0 TRBV11-2*01,TRBV11-3*01 TRBD1*01  TRBJ2-7*01  TRBC2 * * TGTGCCAGCAGCTTAGACTACAGGTTATATGGGGAGCAGTACTTC 0.83  2.00  100.00  0
CELL8384_N1_17959 0 TRBV4-2*01,TRBV4-3*01 * * * * * TGGTGCCCGGCCCGAAGTACTGCTCGTTTCCCCCTCCCCCGTCCAGTGGGC 0.00  1.00  0.00  0
CELL8384_N1_20036 0 TRAV5*01  * * * * * TCTGCAGAGACAGATGTTTATCCTTTTTATTCAATAGAACAGTGA 0.00  1.00  0.00  0
CELL8384_N1_20872 0 * * TRAJ49*01 TRAC  * * AGGGACACCGGTAACCAGTTCTATTTT 0.00  1.00  0.00  0

Which version of TRUST4 did you use?

@lishuangshuang0616
Copy link
Author

TRUST4 v1.0.13-r473.
I'll try the latest version to see if this problem occurs.
Thanks.

@mourisl
Copy link
Collaborator

mourisl commented Apr 1, 2024

Thank you for sharing the larger data set. I think I've found and fixed the bug that may assign a read to another barcode in the contig abundance estimation step. Could you please pull down the github repo again and give it a try? This is a pretty serious bug, if it works on your data set, I will draft a new release soon.

@lishuangshuang0616
Copy link
Author

I tested a larger dataset and obtained results for several cell IDs. Now the UMIs are working properly.
Many thanks for your help, Dr. Li.

@lishuangshuang0616
Copy link
Author

grep 'CELL1000_N3' tcrbcr_assign.out|head -n 10
E200004414L1C001R03004110063    CELL1000_N3_509
E200004414L1C002R02602089629    CELL1000_N3_509
E200004414L1C003R00300997369    CELL1000_N3_509
E200004414L1C002R03201437208    CELL1000_N3_509
E200004414L1C003R00604427789    CELL1000_N3_509
E200004414L1C002R01201985373    CELL1000_N3_509
E200004414L1C002R00603037193    CELL1000_N3_509
E200004414L1C003R02601430041    CELL1000_N3_509
E200004414L1C002R00904293832    CELL1000_N3_509
E200004414L1C002R01800009866    CELL1000_N3_509
grep 'CELL1000_N3' tcrbcr_assembled_reads.fa|head -n 10
>E200004414L1C001R03004110063 -1 58323 66418 barcode:CELL1000_N3 umi:3844
>E200004414L1C002R02602089629 -1 58323 66418 barcode:CELL1000_N3 umi:2320
>E200004414L1C003R00300997369 -1 58323 66418 barcode:CELL1000_N3 umi:3033
>E200004414L1C002R03201437208 -1 56473 66418 barcode:CELL1000_N3 umi:3941
>E200004414L1C003R00604427789 -1 56473 66418 barcode:CELL1000_N3 umi:328
>E200004414L1C002R01201985373 -1 56333 66418 barcode:CELL1000_N3 umi:3370
>E200004414L1C002R00603037193 -1 56231 66418 barcode:CELL1000_N3 umi:6312
>E200004414L1C003R02601430041 -1 55874 66418 barcode:CELL1000_N3 umi:7102
>E200004414L1C002R00904293832 -1 55874 66307 barcode:CELL1000_N3 umi:345
>E200004414L1C002R01800009866 -1 55874 66307 barcode:CELL1000_N3 umi:2299
grep -A 1 'E200004414L1C001R03004110063' /tcrbcr_umi.fa
>E200004414L1C001R03004110063
CATAACTCAG
grep -A 1 'E200004414L1C002R02602089629' tcrbcr_umi.fa
>E200004414L1C002R02602089629
CATAACTTAG
grep -A 1 'E200004414L1C003R00300997369' tcrbcr_umi.fa
>E200004414L1C003R00300997369
CATAACTCAG
grep -A 1 'E200004414L1C002R03201437208' tcrbcr_umi.fa
>E200004414L1C002R03201437208
CATAACTCAG

Why do the umi numbers of the same umi become inconsistent after assembly?

@mourisl
Copy link
Collaborator

mourisl commented Apr 1, 2024

They should be consistent. Do you see those issues from the cell barcode you shared with me?

@lishuangshuang0616
Copy link
Author

I found the issue caused by my own mistake. I included 'missing_barcode' during the analysis, which caused the problem. Removing it will be OK.
Thank you.

@mourisl
Copy link
Collaborator

mourisl commented Apr 1, 2024

It's still quite strange. "E200004414L1C001R03004110063" and "E200004414L1C003R00300997369" have the same barcode and UMI, but their converted UMI numeric value is different. Their numeric UMI should not be affected by the "missing_barcode" issue. Or the UMIs correspond to other reads?

@mourisl
Copy link
Collaborator

mourisl commented Apr 2, 2024

I think I've found the issue. Could you please pull the updated github repo and give it a try? Please let me know whether it works when there are "missing_barcode" in the data. Thank you again for scrutinizing TRUST4's results.

@lishuangshuang0616
Copy link
Author

After using the new repo, the results match those obtained after removing the missing_barcode.
The information in several files also corresponds correctly.
Thank you very much for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants