Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best way to run TRUST4 on SMARTer data #247

Closed
dcarbajo opened this issue Feb 1, 2024 · 26 comments
Closed

Best way to run TRUST4 on SMARTer data #247

dcarbajo opened this issue Feb 1, 2024 · 26 comments

Comments

@dcarbajo
Copy link

dcarbajo commented Feb 1, 2024

Hello, I am interested in starting using TRUST4, and was wondering what is the best approach to run it on SMARTer data, in terms of parameters, pre-processing, etc.

I run a subset of 100K paired-end reads of one of my samples, in fastq files named sub1.fastq and sub2.fastq used as inputs.

I noticed you have a wrapper for SMART-Seq data, and wondered whether this would be suitable for my case.

To try things out, I run TRUST4 in 2 ways:

"Default method": run-trust4 -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -1 sub1.fastq -2 sub2.fastq -o test1

"SMART-Seq wrapper": perl trust-smartseq.pl -1 sub1_list.txt -2 sub2_list.txt -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -o test2
(where the txt files just list the location of sub1.fastq and sub2.fastq)

While the "default method" produces several outputs (in fasta, tsv, and .out format), the SMART-Seq wrapper only produces the report.tsv, the airr.tsv, and the annot.fa.

What concerns me more, is that the "default" report retrieves several TCRs, the SMART-Seq wrapper only shows the top 2, and the count numbers differ.

I would appreciate if you could help me understand what the wrapper does happens under the hood, why the results are so different, and mainly what the best way to use TRUST4 on SMARTer data would be.

On a side note, could you explain what the consensus id full length means and why it is almost always 0 and sometimes 1?

Additionally, I am running this on a subset (of 100K reads) of one sample, but the sample itself is ~38M reads. I am running it with the "default method", but it is still ongoing for 2 days and running (and I have 58 samples), what would be the best way to speed things up if possible?

Many thanks!

@mourisl
Copy link
Collaborator

mourisl commented Feb 1, 2024

  1. For SMART-seq-like type of data, since we only expect to be a pair of chains per cell, so it selected the pair with the highest abundance as representative (can be changed through the --representative option) after running regular TRUST4 internally. The extra TCRs are likely to be the other non-functional chain, sequencing artifacts, or assembly artifacts.
  2. The other files, like _final.out are intermediate files, so I did not keep them in the smartseq wrapper.
  3. Since the number of BCR/TCR reads are likely to be high in SMART-seq data and there is no class switch recombination in a cell, there is no need for extend the contigs with mate pair information. You can see the "--skipMateExtension" in the smartseq wrapper. This will create different assembly results to the default TRUST4, and may affect the abundance estimation.
  4. cid_full_length is to represent whether the corresponding contig (cid) is full length or not. Full length means 5' of V genesto 3' of J gene.
  5. Does SMARTer seq put all the cells data into one fastq file, or each cell has its own fastq file? Do you mean you have 38M reads for one cell?

@dcarbajo
Copy link
Author

dcarbajo commented Feb 1, 2024

Thanks for the super prompt response! Yes, so my fastq files are per donor sample, with all the cells from that sample into one file; the idea is to reconstruct the TCR repertoire in that sample (in this case 38M reads from all the cells in that sample).

Is it possible as well to do the assembly by CDR3?

@mourisl
Copy link
Collaborator

mourisl commented Feb 1, 2024

Does your read file have some information about the cell information? Or essentially it is bulk RNA-seq for each donor sample?

@dcarbajo
Copy link
Author

dcarbajo commented Feb 1, 2024

Yes, each fastq file is bulk for each donor sample

@mourisl
Copy link
Collaborator

mourisl commented Feb 1, 2024

Is each cell a bulk RNA-seq or targeted TCR-seq? If it is bulk RNA-seq, it shouldn't be this slow. Though it's possible your data is T cell sorted, so there are many TCR reads? In this case, you can add the option "--repseq" to accelerate the procedure.

If your data is TCR-seq, is there any UMI sequence in your data?

@dcarbajo
Copy link
Author

dcarbajo commented Feb 2, 2024

Hi! Thanks again for the help, sorry I overlooked the "--repseq" option, I am trying it asap.
To confirm, my data comes from the SMARTer Human TCR a/b Profiling Kit v2 so it is bulk TCR-seq with UMIs. How shall I deal with the UMIs in this case? Cause I would still need a correct frequency estimation, if possible.
Many thanks again!

@mourisl
Copy link
Collaborator

mourisl commented Feb 2, 2024

If you know the range of the UMI, you can regard it as a "barcode" and utilize the "--barcodeLevel molecule" to run TRUST4 in the TCR-seq UMI mode. More details is in the https://github.com/liulab-dfci/TRUST4?tab=readme-ov-file#umi section. Essentially, you shall specify the read file containing of the UMI to the --barcode option, use --readFormat option to specify the range on the read that corresponds to the UMI. TRUST4 then shall handle sequencing error correction and select the best assembly for each UMI. With these commands, TRUST4 should be fast, and, you don't need the "--repseq" option for acceleration unless it is still too slow.

@dcarbajo
Copy link
Author

dcarbajo commented Feb 2, 2024

Great! Thanks for the info. So the diagram for this SMARTer sequencing looks like this:

SMARTer-Human-TCRv2-dark

so we have the UMI in the first 12bp of the reads_2.fastq file.

Based on that, I guess that my final TRUST4 call should look like the following (correct me if I am wrong):

run-trust4 --barcodeLevel molecule
           -f hg38_bcrtcr.fa
           --ref human_IMGT+C.fa
           -1 read_1.fastq
           -2 read_2.fastq
           --barcode read_2.fastq
           --readFormat um:0:12
           -o TRUST4

Many thanks again!

@mourisl
Copy link
Collaborator

mourisl commented Feb 2, 2024

Almost, it would be:

run-trust4 --barcodeLevel molecule
           -f hg38_bcrtcr.fa
           --ref human_IMGT+C.fa
           -1 read_1.fastq
           -2 read_2.fastq
           --barcode read_2.fastq
           --readFormat bc:0:11,r2:12:-1
           -o TRUST4

Depending on your kit, r2 maybe r2:20:-1 if we don't include the 8bp GTAC and extra 4bp. I think you may also want to remove the first 28 bp from r1 as they maybe primers. Therefore, a conservative readFormat option could be "--readFormat bc:0:11,r2:20:-1,r1:28:-1".

@dcarbajo
Copy link
Author

dcarbajo commented Feb 2, 2024

Great! I am going to try that out!

I do a little bit of pre-processing with Skewer, so I will check first how the exact numbers should go, but looks good!

Actually, when I set up the TRUST4 pipeline, I probably wouldn't even need to run Skewer first right?

Can I just send to TRUST4 the raw .fq.gz files and specify the "--readFormat" option accordingly without any prior adaptor trimming then?

@mourisl
Copy link
Collaborator

mourisl commented Feb 2, 2024

Right, I don't think you need to run Skewer. TRUST4 internally will trim the adapters by detecting read-through events.

@dcarbajo
Copy link
Author

dcarbajo commented Feb 2, 2024

Thanks for all the help!

@dcarbajo
Copy link
Author

Quick question: is there a parameter with the run-trust4 call above that allows me to only produce the main outputs and not all the intermediate files (like what the smartseq wrapper does)?
Otherwise I have to remove them on the fly, cause I run out of storage space very fast. Thanks!

@mourisl
Copy link
Collaborator

mourisl commented Feb 19, 2024

It's not supported yet. So you may need to write your own script to remove the intermediate files. I will implement this feature in the next release.

@dcarbajo
Copy link
Author

great to know! thanks

@mourisl
Copy link
Collaborator

mourisl commented Mar 12, 2024

The feature of removing intermediate files is added and mentioned in the thread #248 . So I'll close this issue for now.

@mourisl mourisl closed this as completed Mar 12, 2024
@marcoco90
Copy link

Hello @mourisl.

Adding on this thread for similar question.

We run the Takara SMART-Seq® Human TCR (with UMIs) to have TCR data.
The read conformation is the following:
image

They way I process the data is the following:

  1. Trimming the first 70 bases of R1 and R2
  2. run TRUST4 as follows for UMI correction:
    run-trust4 --barcodeLevel molecule
    -f hg38_bcrtcr.fa
    --ref human_IMGT+C.fa
    -1 read_1.fastq
    -2 read_2.fastq
    --barcode read_2.fastq
    --readFormat bc:0:12,r2:17:-1
    -o TRUST4

Can you confirm is the right approach?

Also, I tested the same data without doing the UMI correction with the following:
run-trust4 --barcodeLevel molecule
-f hg38_bcrtcr.fa
--ref human_IMGT+C.fa
-1 read_1.fastq
-2 read_2.fastq
-o TRUST4

With UMI correction I get a range count of ~ 5k (sum of column A of the report file), while without I get ~220k.
I assume the UMI correction is getting rid of the duplicates.
But is this drop in count in terms of range observed?

Thanks

@mourisl
Copy link
Collaborator

mourisl commented Oct 2, 2024

@marcoco90 . For the second command, you shall still use the --barcode and --readFormat option, and it will regard the barcode as the UMI, as the option "--barcodeLevel" specifies. Without --barcode option, the data is processed as a regular bulk data set and the abundance (~220k) is the read count, so it is much greater than 5k.

@marcoco90
Copy link

Thanks @mourisl for the response and support on TRUST4
I understand now the difference in count with and without UMI.

So can you confirm that the first approach is the right one in terms of order and parameters (steps 1 and 2) ?

Thanks

@mourisl
Copy link
Collaborator

mourisl commented Oct 2, 2024

@marcoco90 Yes, the first approach is the right one. Actually you don't need to explicitly trim reads as an additional step, you can use --readFormat to ignore those region. An example of the command would be:

run-trust4 --barcodeLevel molecule
-f hg38_bcrtcr.fa
--ref human_IMGT+C.fa
-1 read_1.fastq
-2 read_2.fastq
--barcode read_2.fastq
--readFormat bc:70:71,r2:76:-1,r1:70:-1 # Original bc:0:12 should be bc:0:11 as the range 0-based and is closed on both ends.
-o TRUST4

Just want to confirm, the first 70bp of read1 or read2 do not contain actual read sequence? For example, if read2 starts from the middle of the left chunk, the 70bp may contain regions with UMI.

@marcoco90
Copy link

@mourisl I confirmed with the vendor that the UMI are the first 12 bp of fastq and there is a linker 7 bp sequence after that before the actual sequence.
So we will run the following command on NON-TRIMMED reads:
And analysis like this run-trust4 --barcodeLevel molecule
-f hg38_bcrtcr.fa
--ref human_IMGT+C.fa
-1 read_1.fastq
-2 read_2.fastq
--barcode read_2.fastq
--readFormat bc:0:12,r2:20:-1
-o TRUST4

Thanks for the support.

@mourisl
Copy link
Collaborator

mourisl commented Oct 3, 2024

@marcoco90 Small adjustment to the readFormat option: --readFormat bc:0:11,r2:19:-1

@marcoco90
Copy link

@marcoco90 thanks. I guess because of the indexing the first nucleotide is at 0 correct?

@mourisl
Copy link
Collaborator

mourisl commented Oct 4, 2024

@marcoco90 Yes.

@marcoco90
Copy link

@mourisl In --readFormat I am not specifying the read structure for R1 (--readFormat 0:11,r2:19:-1), Is that necessary?

Also what is the difference in output between running the following commands with raw reads as input?
1)trust4 --barcodeLevel molecule -f $Genome/hg38_bcrtcr.fa --ref $Genome/human_IMGT+C.fa -1 $read1 -2 $read2 --barcode $read2 --readFormat bc:0:11,r2:19:-1 -o $samid -t 10 (supposely the format for BAM file as input)
2) run-trust4 --barcodeLevel molecule -f $Genome/human_IMGT.fa -1 $read1 -2 $read2 --barcode $read2 --readFormat bc:0:11,r2:19:-1 -o $samid -t 10 (the format for raw reads as input).

I do not expect major differences correct?

Thanks for your extensive support

@mourisl
Copy link
Collaborator

mourisl commented Oct 24, 2024

Is the first 70bp of read1 should be removed? If so, you can add the r1:70:-1 to the --readFormat option.

The hg38_bcrtcr.fa contains UTR regions of the genes, so using it can help obtain longer contigs. For CDR3 analysis, the difference should be small.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants