-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best way to run TRUST4 on SMARTer data #247
Comments
|
Thanks for the super prompt response! Yes, so my fastq files are per donor sample, with all the cells from that sample into one file; the idea is to reconstruct the TCR repertoire in that sample (in this case 38M reads from all the cells in that sample). Is it possible as well to do the assembly by CDR3? |
Does your read file have some information about the cell information? Or essentially it is bulk RNA-seq for each donor sample? |
Yes, each fastq file is bulk for each donor sample |
Is each cell a bulk RNA-seq or targeted TCR-seq? If it is bulk RNA-seq, it shouldn't be this slow. Though it's possible your data is T cell sorted, so there are many TCR reads? In this case, you can add the option "--repseq" to accelerate the procedure. If your data is TCR-seq, is there any UMI sequence in your data? |
Hi! Thanks again for the help, sorry I overlooked the "--repseq" option, I am trying it asap. |
If you know the range of the UMI, you can regard it as a "barcode" and utilize the "--barcodeLevel molecule" to run TRUST4 in the TCR-seq UMI mode. More details is in the https://github.com/liulab-dfci/TRUST4?tab=readme-ov-file#umi section. Essentially, you shall specify the read file containing of the UMI to the --barcode option, use --readFormat option to specify the range on the read that corresponds to the UMI. TRUST4 then shall handle sequencing error correction and select the best assembly for each UMI. With these commands, TRUST4 should be fast, and, you don't need the "--repseq" option for acceleration unless it is still too slow. |
Great! Thanks for the info. So the diagram for this SMARTer sequencing looks like this: so we have the UMI in the first 12bp of the Based on that, I guess that my final TRUST4 call should look like the following (correct me if I am wrong):
Many thanks again! |
Almost, it would be:
Depending on your kit, r2 maybe r2:20:-1 if we don't include the 8bp GTAC and extra 4bp. I think you may also want to remove the first 28 bp from r1 as they maybe primers. Therefore, a conservative readFormat option could be "--readFormat bc:0:11,r2:20:-1,r1:28:-1". |
Great! I am going to try that out! I do a little bit of pre-processing with Skewer, so I will check first how the exact numbers should go, but looks good! Actually, when I set up the TRUST4 pipeline, I probably wouldn't even need to run Skewer first right? Can I just send to TRUST4 the raw |
Right, I don't think you need to run Skewer. TRUST4 internally will trim the adapters by detecting read-through events. |
Thanks for all the help! |
Quick question: is there a parameter with the |
It's not supported yet. So you may need to write your own script to remove the intermediate files. I will implement this feature in the next release. |
great to know! thanks |
The feature of removing intermediate files is added and mentioned in the thread #248 . So I'll close this issue for now. |
Hello @mourisl. Adding on this thread for similar question. We run the Takara SMART-Seq® Human TCR (with UMIs) to have TCR data. They way I process the data is the following:
Can you confirm is the right approach? Also, I tested the same data without doing the UMI correction with the following: With UMI correction I get a range count of ~ 5k (sum of column A of the report file), while without I get ~220k. Thanks |
@marcoco90 . For the second command, you shall still use the --barcode and --readFormat option, and it will regard the barcode as the UMI, as the option "--barcodeLevel" specifies. Without --barcode option, the data is processed as a regular bulk data set and the abundance (~220k) is the read count, so it is much greater than 5k. |
Thanks @mourisl for the response and support on TRUST4 So can you confirm that the first approach is the right one in terms of order and parameters (steps 1 and 2) ? Thanks |
@marcoco90 Yes, the first approach is the right one. Actually you don't need to explicitly trim reads as an additional step, you can use --readFormat to ignore those region. An example of the command would be:
Just want to confirm, the first 70bp of read1 or read2 do not contain actual read sequence? For example, if read2 starts from the middle of the left chunk, the 70bp may contain regions with UMI. |
@mourisl I confirmed with the vendor that the UMI are the first 12 bp of fastq and there is a linker 7 bp sequence after that before the actual sequence. Thanks for the support. |
@marcoco90 Small adjustment to the readFormat option: --readFormat bc:0:11,r2:19:-1 |
@marcoco90 thanks. I guess because of the indexing the first nucleotide is at 0 correct? |
@marcoco90 Yes. |
@mourisl In --readFormat I am not specifying the read structure for R1 (--readFormat 0:11,r2:19:-1), Is that necessary? Also what is the difference in output between running the following commands with raw reads as input? I do not expect major differences correct? Thanks for your extensive support |
Is the first 70bp of read1 should be removed? If so, you can add the r1:70:-1 to the --readFormat option. The hg38_bcrtcr.fa contains UTR regions of the genes, so using it can help obtain longer contigs. For CDR3 analysis, the difference should be small. |
Hello, I am interested in starting using TRUST4, and was wondering what is the best approach to run it on SMARTer data, in terms of parameters, pre-processing, etc.
I run a subset of 100K paired-end reads of one of my samples, in fastq files named
sub1.fastq
andsub2.fastq
used as inputs.I noticed you have a wrapper for SMART-Seq data, and wondered whether this would be suitable for my case.
To try things out, I run TRUST4 in 2 ways:
"Default method":
run-trust4 -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -1 sub1.fastq -2 sub2.fastq -o test1
"SMART-Seq wrapper":
perl trust-smartseq.pl -1 sub1_list.txt -2 sub2_list.txt -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -o test2
(where the txt files just list the location of
sub1.fastq
andsub2.fastq
)While the "default method" produces several outputs (in fasta, tsv, and .out format), the SMART-Seq wrapper only produces the
report.tsv
, theairr.tsv
, and theannot.fa
.What concerns me more, is that the "default" report retrieves several TCRs, the SMART-Seq wrapper only shows the top 2, and the count numbers differ.
I would appreciate if you could help me understand what the wrapper does happens under the hood, why the results are so different, and mainly what the best way to use TRUST4 on SMARTer data would be.
On a side note, could you explain what the
consensus id full length
means and why it is almost always 0 and sometimes 1?Additionally, I am running this on a subset (of 100K reads) of one sample, but the sample itself is ~38M reads. I am running it with the "default method", but it is still ongoing for 2 days and running (and I have 58 samples), what would be the best way to speed things up if possible?
Many thanks!
The text was updated successfully, but these errors were encountered: