Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about ouput #141

Open
Chuang1118 opened this issue Jul 7, 2022 · 8 comments
Open

Question about ouput #141

Chuang1118 opened this issue Jul 7, 2022 · 8 comments

Comments

@Chuang1118
Copy link

Hello Trust4,

Thanks for the most flexible tools.
My team developed custom 5' spatial BCR. Now I'm in the stage of preprocessing.
Spot like mini-bulk(5-20 cells) with additional spatial barcode. in my fastq Read2: 279nt & Read1: 28nt.

Below is CMD using Trust4 (Output all the required fields for AIRR format #138 version)

nohup singularity exec -B /mnt/DOSI:/mnt/DOSI /path/ccbr_trust4_1.0.7b.sif \
run-trust4 -f /opt2/TRUST4/hg38_bcrtcr.fa -t 20 --ref /opt2/TRUST4/human_IMGT+C.fa -u /path/A_102_0023_10xVisium_IgHseq_FR2_S11_L001_R2_001.fastq.gz --barcode /path/A_102_0023_10xVisium_IgHseq_FR2_S11_L001_R1_001.fastq.gz --barcodeRange 0 15 + --barcodeWhitelist /path/barcodes.txt --UMI /path/A_102_0023_10xVisium_IgHseq_FR2_S11_L001_R1_001.fastq.gz --umiRange 16 27 + -o A_102_0023_FR2 --od out_FR2 --repseq &

output : [Wed Jul 6 21:26:41 2022] TRUST4 finishes.

A_102_0023_FR2_airr_align.tsv
A_102_0023_FR2_airr.tsv
A_102_0023_FR2_annot.fa
A_102_0023_FR2_assembled_reads.fa
A_102_0023_FR2_barcode_airr.tsv
A_102_0023_FR2_barcode_report.tsv
A_102_0023_FR2_cdr3.out
A_102_0023_FR2_final.out
A_102_0023_FR2_raw.out
A_102_0023_FR2_report.tsv
A_102_0023_FR2_toassemble_bc.fa
A_102_0023_FR2_toassemble.fq
A_102_0023_FR2_toassemble_umi.fa

Questions:

1/ In my understanding, *_barcode_airr.tsv came from *_barcode_report.tsv and *_airr.tsv came from *_report.tsv. *_barcode_airr.tsv is single cell output and *_airr.tsv is bulk output.
For each barcode, Trust4 picked one highest umi count chain in file *_barcode_airr.tsv which came from *_airr.tsv. Why some sequence_id is in file *_barcode_airr.tsv, but this sequence_id isn’t in the file *_airr.tsv?

cat A_102_0023_FR2_barcode_airr.tsv | grep GCGGCTCTGACGTACC_203 | wc -l
1
cat A_102_0023_FR2_airr.tsv | grep GCGGCTCTGACGTACC_203 | wc -l
0

A_102_0023_FR2_barcode_airr.tsv is independent on A_102_0023_FR2_airr.tsv? In other words, I’m waiting for A_102_0023_FR2_report.tsv is sum of chain1 + secondary_chain1 in
A_102_0023_FR2_barcode_report.tsv. I’ve misunderstood.

2/ In my case, Read2 has 279nt which start from IGH FR2. Stage 1 assembly is necessary? Is it possible using Trust4 just skip stage1 and start from 0 read extraction.

3/ I used option UMI, where is umi information in output ?
Umi == consensus_count == read_fragment_count == read_cnt ?

4/ In my situation, I start from "secondary" chains in the A_102_0023_FR2_barcode_report.tsv file or A_102_0023_FR2_airr.tsv ?
How I can extract top n chains and rank them using trust-airr.pl if I use A_102_0023_FR2_barcode_report.tsv as input?

Thanks for your reply,

Chuang

@mourisl
Copy link
Collaborator

mourisl commented Jul 7, 2022

Thank you for testing TRUST4.

  1. Your interpretation of the relations among the files is correct. A_102_0023_FR2_airr.tsv file is for bulk and is based on the *_report.tsv file. In this output, TRUST4 will coalesce the terms with the same CDR3 and VJC genes into one entry. Therefore, for this VDJ recombination might be represented by another assembled contig from the bulk setting, and "GCGGCTCTGACGTACC_203" is hidden.

A_102_0023_FR2_barcode_airr.tsv is kind of independent from the A_102_0023_FR2_airr.tsv file. The abundance in the *_report.tsv and *_airr.tsv (bulk setting) are the sum of the cells bearing the same VJC genes, and is not the UMI count. The CDR3s in the "barcode" output file should be present in the "bulk" output file, which is chain1+secondary_chain1.

Hope this helps.

  1. Do you mean you just want to run "annotator" to get the CDR3s from read2? I think this is doable, but may need some customization. The "assembly" step will help put highly similar sequences into one contig, which will make downstream representation much cleaner and also provides a function as error correction.

  2. The UMI information is used in the abundance column in the "barcode" output. So yes, umi==consensus_count.

  3. In your case, "A_102_0023_FR2_barcode_report.tsv" file is a better start.
    trust-airr.pl only select the primary chain in each cell. One workaround might be to create multiple entries for each barcode based on the A_102_0023_FR2_barcode_report.tsv, and rename the barcode accordingly. For example, GCGGCTCTGACGTACC_203_0 for primary chain, GCGGCTCTGACGTACC_203_1 for secondary chain. After expanding the barcode_report file, you can feed that into the trust_airr.pl file to get the full airr file. I can write a script for this.

@Chuang1118
Copy link
Author

Thank you very much for this explanation!

Now, I do some post-check to see spatial pattering of the BCR has located the plasma zone using A_102_0023_FR2_barcode_airr.tsv for example.
Even if we make some effort for amplify IGH, it rest 30% cDNA as noise which aren't V, J and C genes. so candidate reads extraction stage is necessary. I saw in stage of De novo assembly,A_102_0023_FR2_assembled_reads.fa, have bc tagged and umi tagged.
I can modify trust_airr.pl, but maybe some times for me. if you can help me get the full airr file, I want it, if this script will not take much of your time. I have other project need this script "barcoding mini-bulk".
I will finish first version as you suggested.

Thanks again,
Chuang

@mourisl
Copy link
Collaborator

mourisl commented Jul 7, 2022

I just uploaded a script "barcoderep-expand.py" to create barcode entries for secondary chains in the "scripts" folder. You can run it as:
"python3 barcoderep-expand.py -b XXX_barcode_report.tsv > expanded_barcode_report.tsv". Then you can feed this into the trust-airr.pl script with other parameters to create the comprehensive AIRR output. Note that "barcoderep-expand.py" will add a suffix "_0", "_1"... for the barcode id.

By default, this script will filter the lowly abundant chains (default is 0.1), and you can adjust that with option "--frac". Will this help?
I also need to modify the script that generating the barcode_report file, so it can also output the secondary TCRs in a "B" primary chain cell.

@mourisl
Copy link
Collaborator

mourisl commented Jul 8, 2022

Just fixed a few bugs. I think the scripts should work fine now.

@Chuang1118
Copy link
Author

Hello mourisl,

option --fact is useful for me, I have used frac == 0 to get complete output.
script barcoderep-expand.py and trust-airr.pl are work, except you've forgot to comment line 71 in barcoderep-expand.py

 67         # Output the primary chain
 68         outputCols = cols[:]
 69         outputCols[0] = barcode + "_0"
 70         print("\t".join(outputCols))
 71         print(len(cols))                        --------> # print(len(cols))
 72         # Expand the secondary chain
 73         secondaryEntry = cols[3 + chain]

Question:
In file expanded_barcode_report.tsv, I saw some barcode which mark "out_of_frame" as cdr3aa. These barcodes are transformed productive == F in file expanded_barcode_airr.tsv, that's right ?

cat A_102_0023_FR2_barcode_airr.tsv | cut -f4 | sort | uniq -c
      1 productive
   2065 T
cat expanded_barcode_airr_frac0.tsv | cut -f4 | sort | uniq -c
  32035 F
      1 productive
   2516 T

I just get less 500 BCR with these efforts.

Thanks for your help,
Chuang

@mourisl
Copy link
Collaborator

mourisl commented Jul 8, 2022

Thank you for noticing this issue. I have removed the debugging print line from the github repo.
For the "productive" column, yes, if it is out of frame, has stop codon or has "N" in the nucleotide sequence, they will be marked as "F".

One posssible reason that you did not get much more BCR is that B cells from the same location may bear the same BCR as the proliferation feature?

@Chuang1118
Copy link
Author

@mourisl
I observed no productive barcode doesn't have cdr3, maybe 279nt from FR2 is not enough sometimes, but I got constant region sometimes. I have other version from FR1, here 279nt is not enough reach cdr3 level.
I think the principle BCR we captured come from PC or memoryB in quiescent state.
Good new, I observed high umi in the location of plasma cell, low umi in TLS location.

Now is 6:30 PM for me, have a good weekend.
Chuang

@mourisl
Copy link
Collaborator

mourisl commented Jul 8, 2022

The primary chain fields in the barcode_report file only have the productive chains. So if none of the chains in a barcode is productive, this barcode will be skipped in the report file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants