Best way to run TRUST4 on SMARTer data #247

dcarbajo · 2024-02-01T04:01:31Z

Hello, I am interested in starting using TRUST4, and was wondering what is the best approach to run it on SMARTer data, in terms of parameters, pre-processing, etc.

I run a subset of 100K paired-end reads of one of my samples, in fastq files named sub1.fastq and sub2.fastq used as inputs.

I noticed you have a wrapper for SMART-Seq data, and wondered whether this would be suitable for my case.

To try things out, I run TRUST4 in 2 ways:

"Default method": run-trust4 -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -1 sub1.fastq -2 sub2.fastq -o test1

"SMART-Seq wrapper": perl trust-smartseq.pl -1 sub1_list.txt -2 sub2_list.txt -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -o test2
(where the txt files just list the location of sub1.fastq and sub2.fastq)

While the "default method" produces several outputs (in fasta, tsv, and .out format), the SMART-Seq wrapper only produces the report.tsv, the airr.tsv, and the annot.fa.

What concerns me more, is that the "default" report retrieves several TCRs, the SMART-Seq wrapper only shows the top 2, and the count numbers differ.

I would appreciate if you could help me understand what the wrapper does happens under the hood, why the results are so different, and mainly what the best way to use TRUST4 on SMARTer data would be.

On a side note, could you explain what the consensus id full length means and why it is almost always 0 and sometimes 1?

Additionally, I am running this on a subset (of 100K reads) of one sample, but the sample itself is ~38M reads. I am running it with the "default method", but it is still ongoing for 2 days and running (and I have 58 samples), what would be the best way to speed things up if possible?

Many thanks!

The text was updated successfully, but these errors were encountered:

mourisl · 2024-02-01T05:05:56Z

For SMART-seq-like type of data, since we only expect to be a pair of chains per cell, so it selected the pair with the highest abundance as representative (can be changed through the --representative option) after running regular TRUST4 internally. The extra TCRs are likely to be the other non-functional chain, sequencing artifacts, or assembly artifacts.
The other files, like _final.out are intermediate files, so I did not keep them in the smartseq wrapper.
Since the number of BCR/TCR reads are likely to be high in SMART-seq data and there is no class switch recombination in a cell, there is no need for extend the contigs with mate pair information. You can see the "--skipMateExtension" in the smartseq wrapper. This will create different assembly results to the default TRUST4, and may affect the abundance estimation.
cid_full_length is to represent whether the corresponding contig (cid) is full length or not. Full length means 5' of V genesto 3' of J gene.
Does SMARTer seq put all the cells data into one fastq file, or each cell has its own fastq file? Do you mean you have 38M reads for one cell?

dcarbajo · 2024-02-01T06:13:10Z

Thanks for the super prompt response! Yes, so my fastq files are per donor sample, with all the cells from that sample into one file; the idea is to reconstruct the TCR repertoire in that sample (in this case 38M reads from all the cells in that sample).

Is it possible as well to do the assembly by CDR3?

mourisl · 2024-02-01T06:27:48Z

Does your read file have some information about the cell information? Or essentially it is bulk RNA-seq for each donor sample?

dcarbajo · 2024-02-01T06:43:54Z

Yes, each fastq file is bulk for each donor sample

mourisl · 2024-02-01T15:40:36Z

Is each cell a bulk RNA-seq or targeted TCR-seq? If it is bulk RNA-seq, it shouldn't be this slow. Though it's possible your data is T cell sorted, so there are many TCR reads? In this case, you can add the option "--repseq" to accelerate the procedure.

If your data is TCR-seq, is there any UMI sequence in your data?

dcarbajo · 2024-02-02T02:35:45Z

Hi! Thanks again for the help, sorry I overlooked the "--repseq" option, I am trying it asap.
To confirm, my data comes from the SMARTer Human TCR a/b Profiling Kit v2 so it is bulk TCR-seq with UMIs. How shall I deal with the UMIs in this case? Cause I would still need a correct frequency estimation, if possible.
Many thanks again!

mourisl · 2024-02-02T02:40:54Z

If you know the range of the UMI, you can regard it as a "barcode" and utilize the "--barcodeLevel molecule" to run TRUST4 in the TCR-seq UMI mode. More details is in the https://github.com/liulab-dfci/TRUST4?tab=readme-ov-file#umi section. Essentially, you shall specify the read file containing of the UMI to the --barcode option, use --readFormat option to specify the range on the read that corresponds to the UMI. TRUST4 then shall handle sequencing error correction and select the best assembly for each UMI. With these commands, TRUST4 should be fast, and, you don't need the "--repseq" option for acceleration unless it is still too slow.

dcarbajo · 2024-02-02T03:13:19Z

Great! Thanks for the info. So the diagram for this SMARTer sequencing looks like this:

so we have the UMI in the first 12bp of the reads_2.fastq file.

Based on that, I guess that my final TRUST4 call should look like the following (correct me if I am wrong):

run-trust4 --barcodeLevel molecule
           -f hg38_bcrtcr.fa
           --ref human_IMGT+C.fa
           -1 read_1.fastq
           -2 read_2.fastq
           --barcode read_2.fastq
           --readFormat um:0:12
           -o TRUST4

Many thanks again!

mourisl · 2024-02-02T04:09:49Z

Almost, it would be:

run-trust4 --barcodeLevel molecule
           -f hg38_bcrtcr.fa
           --ref human_IMGT+C.fa
           -1 read_1.fastq
           -2 read_2.fastq
           --barcode read_2.fastq
           --readFormat bc:0:11,r2:12:-1
           -o TRUST4

Depending on your kit, r2 maybe r2:20:-1 if we don't include the 8bp GTAC and extra 4bp. I think you may also want to remove the first 28 bp from r1 as they maybe primers. Therefore, a conservative readFormat option could be "--readFormat bc:0:11,r2:20:-1,r1:28:-1".

dcarbajo · 2024-02-02T04:15:51Z

Great! I am going to try that out!

I do a little bit of pre-processing with Skewer, so I will check first how the exact numbers should go, but looks good!

Actually, when I set up the TRUST4 pipeline, I probably wouldn't even need to run Skewer first right?

Can I just send to TRUST4 the raw .fq.gz files and specify the "--readFormat" option accordingly without any prior adaptor trimming then?

mourisl · 2024-02-02T06:23:19Z

Right, I don't think you need to run Skewer. TRUST4 internally will trim the adapters by detecting read-through events.

dcarbajo · 2024-02-02T06:43:38Z

Thanks for all the help!

dcarbajo · 2024-02-19T03:36:19Z

Quick question: is there a parameter with the run-trust4 call above that allows me to only produce the main outputs and not all the intermediate files (like what the smartseq wrapper does)?
Otherwise I have to remove them on the fly, cause I run out of storage space very fast. Thanks!

mourisl · 2024-02-19T04:17:42Z

It's not supported yet. So you may need to write your own script to remove the intermediate files. I will implement this feature in the next release.

dcarbajo · 2024-02-19T04:19:45Z

great to know! thanks

mourisl · 2024-03-12T17:16:31Z

The feature of removing intermediate files is added and mentioned in the thread #248 . So I'll close this issue for now.

marcoco90 · 2024-10-01T18:58:19Z

Hello @mourisl.

Adding on this thread for similar question.

We run the Takara SMART-Seq® Human TCR (with UMIs) to have TCR data.
The read conformation is the following:

They way I process the data is the following:

Trimming the first 70 bases of R1 and R2
run TRUST4 as follows for UMI correction:
run-trust4 --barcodeLevel molecule
-f hg38_bcrtcr.fa
--ref human_IMGT+C.fa
-1 read_1.fastq
-2 read_2.fastq
--barcode read_2.fastq
--readFormat bc:0:12,r2:17:-1
-o TRUST4

Can you confirm is the right approach?

Also, I tested the same data without doing the UMI correction with the following:
run-trust4 --barcodeLevel molecule
-f hg38_bcrtcr.fa
--ref human_IMGT+C.fa
-1 read_1.fastq
-2 read_2.fastq
-o TRUST4

With UMI correction I get a range count of ~ 5k (sum of column A of the report file), while without I get ~220k.
I assume the UMI correction is getting rid of the duplicates.
But is this drop in count in terms of range observed?

Thanks

mourisl · 2024-10-02T04:32:49Z

@marcoco90 . For the second command, you shall still use the --barcode and --readFormat option, and it will regard the barcode as the UMI, as the option "--barcodeLevel" specifies. Without --barcode option, the data is processed as a regular bulk data set and the abundance (~220k) is the read count, so it is much greater than 5k.

marcoco90 · 2024-10-02T19:28:55Z

Thanks @mourisl for the response and support on TRUST4
I understand now the difference in count with and without UMI.

So can you confirm that the first approach is the right one in terms of order and parameters (steps 1 and 2) ?

Thanks

mourisl · 2024-10-02T20:30:47Z

@marcoco90 Yes, the first approach is the right one. Actually you don't need to explicitly trim reads as an additional step, you can use --readFormat to ignore those region. An example of the command would be:

run-trust4 --barcodeLevel molecule
-f hg38_bcrtcr.fa
--ref human_IMGT+C.fa
-1 read_1.fastq
-2 read_2.fastq
--barcode read_2.fastq
--readFormat bc:70:71,r2:76:-1,r1:70:-1 # Original bc:0:12 should be bc:0:11 as the range 0-based and is closed on both ends.
-o TRUST4

Just want to confirm, the first 70bp of read1 or read2 do not contain actual read sequence? For example, if read2 starts from the middle of the left chunk, the 70bp may contain regions with UMI.

marcoco90 · 2024-10-03T16:35:23Z

@mourisl I confirmed with the vendor that the UMI are the first 12 bp of fastq and there is a linker 7 bp sequence after that before the actual sequence.
So we will run the following command on NON-TRIMMED reads:
And analysis like this run-trust4 --barcodeLevel molecule
-f hg38_bcrtcr.fa
--ref human_IMGT+C.fa
-1 read_1.fastq
-2 read_2.fastq
--barcode read_2.fastq
--readFormat bc:0:12,r2:20:-1
-o TRUST4

Thanks for the support.

mourisl · 2024-10-03T17:33:02Z

@marcoco90 Small adjustment to the readFormat option: --readFormat bc:0:11,r2:19:-1

marcoco90 · 2024-10-03T20:02:36Z

@marcoco90 thanks. I guess because of the indexing the first nucleotide is at 0 correct?

mourisl · 2024-10-04T03:52:40Z

@marcoco90 Yes.

marcoco90 · 2024-10-23T21:23:02Z

@mourisl In --readFormat I am not specifying the read structure for R1 (--readFormat 0:11,r2:19:-1), Is that necessary?

Also what is the difference in output between running the following commands with raw reads as input?
1)trust4 --barcodeLevel molecule -f $Genome/hg38_bcrtcr.fa --ref $Genome/human_IMGT+C.fa -1 $read1 -2 $read2 --barcode $read2 --readFormat bc:0:11,r2:19:-1 -o $samid -t 10 (supposely the format for BAM file as input)
2) run-trust4 --barcodeLevel molecule -f $Genome/human_IMGT.fa -1 $read1 -2 $read2 --barcode $read2 --readFormat bc:0:11,r2:19:-1 -o $samid -t 10 (the format for raw reads as input).

I do not expect major differences correct?

Thanks for your extensive support

mourisl · 2024-10-24T03:25:46Z

Is the first 70bp of read1 should be removed? If so, you can add the r1:70:-1 to the --readFormat option.

The hg38_bcrtcr.fa contains UTR regions of the genes, so using it can help obtain longer contigs. For CDR3 analysis, the difference should be small.

dcarbajo mentioned this issue Feb 20, 2024

Differences in counts compared to MIXCR results, and out-of-frame CDR3 handling #248

Open

mourisl closed this as completed Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best way to run TRUST4 on SMARTer data #247

Best way to run TRUST4 on SMARTer data #247

dcarbajo commented Feb 1, 2024 •

edited

Loading

mourisl commented Feb 1, 2024

dcarbajo commented Feb 1, 2024

mourisl commented Feb 1, 2024

dcarbajo commented Feb 1, 2024

mourisl commented Feb 1, 2024

dcarbajo commented Feb 2, 2024

mourisl commented Feb 2, 2024

dcarbajo commented Feb 2, 2024

mourisl commented Feb 2, 2024

dcarbajo commented Feb 2, 2024

mourisl commented Feb 2, 2024

dcarbajo commented Feb 2, 2024

dcarbajo commented Feb 19, 2024

mourisl commented Feb 19, 2024

dcarbajo commented Feb 19, 2024

mourisl commented Mar 12, 2024

marcoco90 commented Oct 1, 2024

mourisl commented Oct 2, 2024 •

edited

Loading

marcoco90 commented Oct 2, 2024

mourisl commented Oct 2, 2024

marcoco90 commented Oct 3, 2024

mourisl commented Oct 3, 2024

marcoco90 commented Oct 3, 2024

mourisl commented Oct 4, 2024

marcoco90 commented Oct 23, 2024

mourisl commented Oct 24, 2024

Best way to run TRUST4 on SMARTer data #247

Best way to run TRUST4 on SMARTer data #247

Comments

dcarbajo commented Feb 1, 2024 • edited Loading

mourisl commented Feb 1, 2024

dcarbajo commented Feb 1, 2024

mourisl commented Feb 1, 2024

dcarbajo commented Feb 1, 2024

mourisl commented Feb 1, 2024

dcarbajo commented Feb 2, 2024

mourisl commented Feb 2, 2024

dcarbajo commented Feb 2, 2024

mourisl commented Feb 2, 2024

dcarbajo commented Feb 2, 2024

mourisl commented Feb 2, 2024

dcarbajo commented Feb 2, 2024

dcarbajo commented Feb 19, 2024

mourisl commented Feb 19, 2024

dcarbajo commented Feb 19, 2024

mourisl commented Mar 12, 2024

marcoco90 commented Oct 1, 2024

mourisl commented Oct 2, 2024 • edited Loading

marcoco90 commented Oct 2, 2024

mourisl commented Oct 2, 2024

marcoco90 commented Oct 3, 2024

mourisl commented Oct 3, 2024

marcoco90 commented Oct 3, 2024

mourisl commented Oct 4, 2024

marcoco90 commented Oct 23, 2024

mourisl commented Oct 24, 2024

dcarbajo commented Feb 1, 2024 •

edited

Loading

mourisl commented Oct 2, 2024 •

edited

Loading