-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Differences in counts compared to MIXCR results, and out-of-frame CDR3 handling #248
Comments
When having the barcode (more accurately UMI here), the count is the number of barcodes containing this clonotype. Therefore, the count column would be more likely to correspond to the "uniqueMoleculeIdentifier" column. I think MiXCR may have some filters if a UMI has too few reads, which may be due to the UMI having some sequencing errors or some other sequencing artifacts (may need to double-check their documentation). The current TRUST4 does not have such filters, so I think it is expected to see more molecules in TRUST4 than MiXCR. If you want to impose some filter, I think you can first filter the barcode with too few reads in the ${prefix}_barcode_report.tsv file to create another tsv file, e.g. filtered.tsv, using the column with comma-separated fields:
, where read_cnt is the number of reads supporting the CDR3. Then you can run the I did not make the filters because the parameter setting was designed for scRNA-seq gene expression data, and there were many cells with very few reads coming from VDJ region. So filtering those cells might be too aggressive. By the way, I have added the option "--clean" to the run-trust4 wrapper in the github repo to clean up the intermediate files |
Thanks! On it, I let you know the outcome of it asap. Cheers! |
It might be expected to have UMI absolute value difference between TRUST4 and MiXCR as the filtering strategy could be different. I think the frequency/fraction is more meaningful, as diversity calculation is usually based on the normalized values. The cid_full_length is for indicating whether the underlying contig is full length or not. So 0 means the corresponding contig is not full length (not from 5'V to 3'J). Your observation of almost all 0s could be due to the behavior of the --repseq option. Since TCR analysis does not need full-length assembly and VJ gene assignment is sufficient, the --repseq option will drastically throw away many reads. This behavior may be changed in the next release (#241 ). For the CDR3s with gaps, do you find their corresponding CDR3 nucleotide sequence in the _cdr3.out file? Could you please also share with me your filtered barcode_report file so I can look into the issue of the trust-simplerep? Thank you! |
So for the CDR3s with gaps, the associated nucleotide sequences in the MIXCR report indeed appear in the See for example the And I can indeed see it multiple times in the Let me send you my |
Thank you for sharing the files. TRUST4 by default will suppress the out_of_frame cdr3s, as this might create false positive T or B cells in single-cell data. I have modified the trust-barcoderep.pl to keep those entries for the case of UMI-based TCR-seq data. For the filtered barcode file, it seems the file added quotes to the fields possibly due to the csv export function. I have added an option "--filterBarcoderepReadCnt" in trust-simplerep.pl to filter the barcode/UMI with read support fewer than the specified value. So you can directly obtain the filtered UMI count with:
to exclude UMIs with fewer than 2 CDR3 read support. Please let me know how these work. |
Hi @dcarbajo , thank you for the detailed exploration of TRUST4's output. I'm planning to release a new version to incorporate the recent updates. Have you found any other issues? I can try to look into them before creating the new version. Thank you! |
Thanks for your help! I think so far it all works well on my side. Looking forward to the new version! |
Hello again! This is a follow-up to issue #247, thank you so much for your insights there!
So I managed to run TRUST4 on my SMARTer data with the following command (which still took really too long, like half a day per sample):
But now that I compare one sample results with the ones previously obtained with MIXCR for that sample, I observe some discrepancies I was hoping you could help me understand.
At a glance, the things that strike me the most are the number of clonotype entries in the MIXCR report compared to the TRUST4 one. While the MIXCR file has
4320
lines, the TRUST4 one has82024
, though filtering out to TRA entries only, I come down to27637
lines (6423
without singleton clonotypes, with count=1, so still over 2000 more clonotypes found).Then the counts seem quite different; see for example the top clonotype (
TRAV-21 / TRAJ31
) with a count of361655
in MIXCR:The same clonotype in the TRUST4 report, although still at the top, has a count of just
7290
, two orders of magnitude less:So there are a lot more clonotypes in the TRUST4 report compared to the MIXCR one, but I wanted to see which clonotypes found by MIXCR were not recovered by TRUST4.
What I observed is that most of these cases contain a CDR3 sequence with gap/s in MIXCR, which might be due to an out-of-frame CDR3. All these cases are one line in the MIXCR output, but several lines in the TRUST4 one...
I extracted the "V" and "J" from these clonotypes with gaps in MIXCR, and subsetted both outputs for a few examples. Check the example below:
while the subset is just one line in the MIXCR output:
it becomes several different lines in the TRUST4 output:
Strangely, all these CDR3 sequences are quite different, and there are some that aren't real ones (like
FEASIRDENIIF
above) which concerns me a bit. Most of these entries belong to singleton clonotypes, but not all (the top 4 lines have count>1).I was wondering how to interpret this, and whether there is some aggregation or filtering that I should do downstream of TRUST4, to make the results more comprehensible (and comparable to the previous I obtained with MIXCR).
Many thanks again!
The text was updated successfully, but these errors were encountered: