-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
10X whitelist issue #74
Comments
Could you please show me the full command you used? Thanks. |
Hi, sure thing, this was for chemistry v3, but I have tried it for V1 and V2, same thing happens. Each time I adjusted the ranges to fit the chemistries ` ` |
Could you please show me the full on-screen output from TRUST4? I need to make sure that the crash happened in the fastq-extractor program. Thanks. |
Sorry, our cluster locks out copy and pasting the output and I have to keep the study IDs of the samples confidential. But yes it occurs just as the fastq-extractor starts. It then prints out "system" followed by the input commands as above. Then followed by failed: 11 at run-trust4 line 48. Is that enough? if not I can email you a screen capture |
Just want to make sure, the 3M-february-2018.txt file is compressed in the cellranger package, so have you decompressed it before running TRUST4? Did you see the line " Start to extract candidate reads from read files." output on the screen? |
Hi, yes I took it from a local installation of cellranger and uploaded it on our cluster within a custom directory that I then supply the absolute path to. The reason was because our cell ranger is a module installation that cannot be edited by end-users. But the file is the decompressed txt. What I see is:
|
This is very strange. Could you please show me the first few lines for XXX_1.fastq.gz, XXX_2.fastq.gz and 3M-february-2018.txt files? |
it is just the absolute paths to the locations: for our system its I would be surprised if that is an issue as trust4 runs to completion when I do not include the whitelist. It will extract the barcodes and umis and reads fine and generate the outputs. It seems to be something to do with the whitelist matching and correction. The only thing might be if you call directly on any shared libraries or /bin located files, we do not have access to those. Or if under the hood you are calling on cell ranger? but I didn't see that in the script. |
Sorry, I mean the first few lines of the content in those files, so I can check whether the lengths, especially in the whitelist file, match the files on our server. Thanks. |
Oh right. sorry. R1 R2 whitelist: |
Sorry, can I just add a request onto the same issue chain. Is it possible to index the annot.fa with the report.tsv or cdr3.out? or Add read counts into the annot.fa names? I am trying to re-annotate the annot.fa assembly contigs with Igblast but there is no way to index back to the read counts. Since annot.fa is a consensus contig, the read fragment count is lost. This would be really helpful as a feature since I predominantly use Immcantation as my analysis workflow to get mutation counts etc. But in its current state, I cannot do it in the same way as I can with MIXCR. In MIXCR indexing is possible as they provide the "targetsequence" (consensus assembly) in their report file. Thanks very much for the consideration. |
The first column in the cdr3.out file is the contig id in the annot.fa file, which can be used for index. Or do you mean other way of indexing? For the whitelist issue, it finished without any error on our data even when testing with different settings. I may need some more time to test. In our previous experiment, the barcode correction marginally improved the results. |
Thanks for the update on the whitelist issue and ongoing troubleshooting. For the indexing, what I have noticed, is that there are multiple duplicated assemble ids. e.g. assemble0 occurs up to 20 times for me. When I look closer, this is the case in the report also, and it seems this is down to the fact that trust4 allows for SMH, and therefore multiple cids can be assigned to different sequence contigs. The cdr3. out file nicely contains a consensus_id index, but this is not present in the annot.fa and therefore we cannot index the exact sequence in the annot.fa file to the report or cdr3.out file. Also the cdr3.out file oddly has read counts that look like averages? 76.08 for example, whereas the report has round numericals |
The sequence contig in the annot.fa file is the consensus and encodes/compresses highly similar sequences, such as from SHMs as you mentioned. Therefore, in the _cdr3.out file, TRUST4 also outputs those minor CDR3s encoded in the consensus. In other words, the CDR3s are all from the sequence in annot.fa, and they are listed by the second column "index_within_consensus" in cdr3.out. I think for the CDR3s from the same contig/consensus, you can pick the one with highest abundance as representative. As for the decimal numbers of the abundance, it is because that when TRUST4 tries to decode the CDR3 encoded in the consensus, a read partially overlapped with CDR3 region could support more than 1 CDR3 type. Therefore, TRUST4 applied EM algorithm to quantify the CDR3s from the same consensus. Hope these explanations help. |
Thanks, that makes sense. However, I remain a little confused how trust4 works in this case:
Lastly, I would prefer to index all annot.fa contigs with those in the cdr3.out or report.tsv files. The trouble with picking a highly abundant one is that the nucleotide sequences are not identical, therefore for diversity analysis, you are losing information if you just pick by consensus or abundance for use in other tools. This seems like it could be easily resolved by adding a index_within_consensus ID into the annot.fa file? |
Yes, you can regard the consensus assembly being the most abundant clonotype. I think I can write a simple script to break up consensuses in annot.fa file with information from _cdr3.out file, so they have a one-to-one mapping. |
That would be great. Thank you. Alternatively, if the fully assembled contig could be added to the report.tsv and cdr3.out file, that could be good as a user can then generate a fasta file directly from the contig. re: how trust4 recovers sequences: In a B cell receptor I would consider any variations of V, junction, J to be potentially relevant and warrants assembly into a full contig so long as the read fragments were of good quality and the overlaps with high confidence. This is how Bracer and MIXCR handle their contigs. I understand the logic from the methods section of trust4, that from a computational efficiency perspective the consensus system is faster. However, from an analysis side, grouping to a single assembly loses a lot of information and granularity on how diverse an individual clonal family is. Many BCR analysis specifically look at the phylogeny of individual clones of interest to understand clonal selection/affinity maturation. Is there a possibility to generate variations of a consensus sequence by knitting reads of variance into their own individual contigs. e.g. if a read has 2 mutations in CDR1 region, but otherwise aligns exactly to the consensus, then annot.fa will add an additional contig that is the consensus but deviates in those 2 nucleotides in CDR1? |
If there are enough variations, TRUST4 will break them up into different contigs. The full contigs are for the highly expressed receptors. |
Thanks for explaining, that makes sense now. In that case the only desired addition, if possible, would be to break up annot.fa to match cdr3.out and report.tsv or to have consensus full length sequences as an additional column in the report.tsv. That would help a lot with indexing back to the read counts for all within consensus ids |
I just add a perl script AddSequenceToCDR3File.pl to the scripts folder in the github repo. You can run it as: Please let me know whether it works. |
Hello,
Here I changed all the home directory to '~' to delete my names in it. Also sorry for the long folder names of data. I noticed one thing that in README the example code used --barcodeWhiteList and the parameter is --barcodeWhitelist with lower case in L of List. Will this make any differences? Please tell me more about these. Thank you. |
That is a typo in README. It should be --barcodeWhitelist. I will fix that right away. Thank you! |
I tried several different data sets on our server and could not reproduce this whitelist error. What system were you using? Was it MacOS? |
For me, I am using a Xshell terminal to connect to a server and the system is CentOS Linux release 7.5.1804 (Core) . Is this information helpful? |
Our cluster structure is a little complex, but each node runs a version of linux. Just to sanity check, are there any hardcoded parts in the script that require the whitelist to be in a cellranger library? because I stored the whitelists in a project specific directory. I had a look and it didn't appear to be the case, but thought I would check. |
Hi, I have a same problem with 'Nusob888'.
and error is like this.
Those fasta files came from 10X, and some infomation is here. R1 :
R2 :
whitelist came from cellranger :
Thanks to read. |
@ttab963 I think the reason for your case might be different. There are two issues with your command:
|
Thanks for your reply.
|
@ttab963 Your whitelist is from the cellranger translation folder. It contains two columns maybe for the translation of multiome data. TRUST4 assumes each row contains one barcode, so you should use the barcode in the path /mnt/S3/data2/workbench/Users/ktkim/program/cellranger4.0.0/lib/python/cellranger/barcodes/ instead of that in the translation folder. (You still need to unzip the whitelist there). |
Sorry for bothering you. I realize my data is actually build with 10x 5' library, so I change whitelist to Cellranger process with those data has any problems so data is not corrupted. Thanks again, @mourisl. |
Hello, I just wanted to mention that I am getting this same error when running TRUST4 with the whitelist ( Thanks! |
I have the exact same error at the fastq-extractor when running with my own whitelist, (when removed it works fine): Here's my input: my barcode_whitelist_24.txt:
I'm using it on a remote server running CentOS Linux, version 7. Thanks! |
@rawlings-lab Which TRUST4 version are you using? This issue should be fixed with a recent update. |
My fault on that. I installed through conda and it looks like it installed 1.05 and I hadn't noticed. Got it installed through github and the error is gone. Much appreciated for the quick response. |
Hi @Nusob888 @mourisl @jmostrom013 @rawlings-lab ,
It seems that the issue has nothing to do with the white list, as I was getting the error even on bulk-RNAseq datasets. The issue was that TRUST4 was not creating the output directory specified in
|
Hi,
Sorry for spamming the issues. I am having an error come up whenever I try to add a whitelist.
essentially the run report gives me no additional explanation other than:
failed: 11 at run-trust4 line 48
I have tried to look over the script but can't really identify why it should fail there.
If I remove the barcodeWhitelist parameter, it carries on as normal. I have also triple checked the whitelists and can match them to the dataset fine.
Is it possible to check if this is a reproducible error? or something specific to my run?
It occurs on all datasets I have tried it on.
Thanks
The text was updated successfully, but these errors were encountered: