-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apply for reference data #1
Comments
The CRAMs that were uploaded to EGA are not aligned - they're just a space efficient form of unaligned read storage. You should convert the CRAMs to FASTQs, and then you're free to align them as you see fit. This is the command I'd use locally to perform this task for a given
|
Thanks for your kind and quick reply! I will have a try following your suggestion! Best regards! |
Hi, ktpolanski Wei Liu |
We passed the following file as the |
Thanks for your quick response and showing me the command information. But I am a little confused. For example, I known that C1-PBMC is a mix of four donors including NP16, NP41, NP15, NP20. Each donor corresponds to an oligonucleotide(sequence). However, there is no information of donor ID information. What do the first two columns "id" and "name" represent ("AB_"+"genesymbol")? Thanks! |
I'd like to direct you to "Demultiplexing and doublet removal of PBMC samples" in the manuscript. |
Hi, ktpolanski |
Hi, ktpolanski |
Hi, ktpolanski! Thanks you! |
The known genotypes are not necessary if you plan your experiments accordingly. Let's assume for simplicity's sake that you've got two donors, A and B. You have one sample that's donor A only, one sample that's donor B only, and then some number of samples that are a mix of the two. You possess sufficient information to correctly identify the donors without additional genotyping. This is how this was handled for this study. @RikLindeboom is going to get back to you with details. |
Hi, ktpolanski! Thanks you! |
Yeah, the only element of uncertainty when talking with Rik was about the nasal data. In that case we're good and no need for Rik to pop down here :) |
Hi, ktpolanski! |
Hi Wei, Apologies for the late reply. I've been investigating if we can update the EGA submission with genotypes. You're right that for some samples we have matched nasal samples, and in these cases you can try to match those genotypes with souporcell cluster genotypes. Unfortunately, we don't have matched nasal / tracheal samples for all PBMCs, so this won't work for all samples. While we figure out how to share the genotype vcfs that we have generated, I think the easiest solution for you would be to do a semi-supervised souporcell analysis. After you have rerun souporcell, you can compare the output with the ID labels that we provide in the freely available h5ad file on our data portal ( https://www.covid19cellatlas.org/ ). This should match quite nicely and you can then just assign the best overlapping id to each souporcell cluster. Btw, when running souporcell please note that it doesn't always work perfectly, and you might have to tinker around with parameters to get a good deconvolution of the genotypes. We found that in some cases it's required to run souporcell with one more cluster than expected, as it appeared that sometimes noise within one sample 'overshadowed' other real genotypes. Hope this helps for now, and I'll get in touch once our legal and data wrangling teams have advised on sharing the genotypes. Many thanks and with best wishes, Rik |
Hi Rik G. H. Lindeboom, Yours! |
Hi Rik @RikLindeboom, Hopping on to this thread because we are in a similar situation as Wei above, as we are interested in the GEX and V(D)J data of healthy PBMC only, which were pooled samples. I think that with your pointers on the semi-supervised souporcell analysis we can run it, however, I wanted to ask if you were allowed to share the genotype vcfs that you generated? Many thanks! |
Hi all, Thanks for bringing this up and bearing with us. This is just to confirm that we have now send a VCF for deconvolution to EGA, so it should be available through EGA soon. With best wishes, |
Hi,
Recently, I have get the access to dataset EGAD00001007718 from EGA, however all the data are cram format. If I want to covert the data to bam format, the reference genome is necessary for me.
I have follow the instructions from methods part and download the human and virus genome, but failed. It seems the genome are not correct.
I have downloaded the human genome file from http://ftp.ensembl.org/pub/release-93/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz, is it the genome version you used in this paper? if not, could you please show me the correct genome links?
About the virus genome. I have downloaded these genomes but failed in converting. Could you share me the virus genome directly?
Could you please help me get the correct genome so that I can convert the cram file to bam file? I am looking forward to your help! Thanks!
Wei Liu
liuwei3@sysucc.org.cn
The text was updated successfully, but these errors were encountered: