Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guppy v6 #64

Closed
aafshinfard opened this issue Jul 27, 2022 · 17 comments
Closed

guppy v6 #64

aafshinfard opened this issue Jul 27, 2022 · 17 comments

Comments

@aafshinfard
Copy link

Just wanted to ask if there are any plans on releasing a guppy >= v6 base calling of the reads?
Thanks.

@skoren
Copy link
Member

skoren commented Aug 1, 2022

No immediate plans since we're not actively working on CHM13 and we've not found much benefit going to guppy 6+ with our hybrid assembly method.

@aafshinfard
Copy link
Author

Thanks for the response @skoren

@hasindu2008
Copy link

Given that I recently downloaded the whole raw signal dataset, I am planning to do a Guppy 6 rebasecall. If it succeeds (and not sure how much time it will take) and if your AWS storage can host more data @skoren , I can share it to be shared.

@aafshinfard
Copy link
Author

@hasindu2008 That would be awesome!

@hasindu2008
Copy link

@aafshinfard I have recently converted all the raw data to bloe5 format and have basecalled using Guppy 6.1.3 hac model. Given the large size of the files, I am not sure how I could share, Any suggestions?

@aafshinfard
Copy link
Author

@hasindu2008 Nice to hear you did it. How large are the files?

@aafshinfard
Copy link
Author

@hasindu2008 Would be nice if the T2T team can host this (@skoren), but another option would be Zenodo. I heard they support up to 50GB and even more in special cases...
https://www.youtube.com/watch?v=S1qK_TA52e4&t=251s

@arangrhie
Copy link
Collaborator

@aafshinfard how big is the total file size?

@aafshinfard
Copy link
Author

@arangrhie, I opened the issue and @hasindu2008 kindly did the job; waiting for them to respond about the size of the dataset.

@hasindu2008
Copy link

@arangrhie @aafshinfard

The basecalled fastq files gzipped are relatively small and I think can be easily hosted.
288G hg2_merged_pass.fastq.gz
39G hg2_merged_fail.fastq.gz

The raw signal data converted to BLOW5 are 3.4 TB. I had to convert that 5TB+ FAST5 compressed tarballs to BLOW5; otherwise, base-calling using FAST5 would have taken a few weeks. It would be useful for the future if those BLOW5 can be hosted to allow direct base-calling from S3 storage mounted locally, as well as partial download of certain genomic regions when necessary (see #63). Compressed tarballs of FAST5 for this kind of large dataset is not easily accissible and diminishes the value of a useful dataset like this in my opinion.

@hasindu2008
Copy link

@aafshinfard You may download the merged Guppy 6 basecalls for the whole dataset here:

https://slow5test.s3.amazonaws.com/tmp/chm13_merged_pass.fastq.gz
https://slow5test.s3.amazonaws.com/tmp/chm13_merged_fail.fastq.gz

Note that this is not a free S3 storage like the one used for hosting CHM13, so I will be grateful if you can let me know after you download it so that I can delete it then. Otherwise, AWS keeps on charging.

@skoren CHM13 maintainers feel free to copy this file into their free S3 storage if you think it will be useful to anyone in future.

Software and versions used for the basecalling are explained below:
Nanopore raw signal data were downloaded, extracted and then converted to BLOW5 format using slow5tools. Then, they were basecalled using buttery-eel under Guppy 6.3.7 high accuracy mode. Qscore 7 was used for pass and fail cut-off.

Base-calling commands:

#basecall gridION data

buttery-eel  -i  min_grid.blow5  --guppy_bin /install/ont-guppy-6.3.7/bin/  --config dna_r9.4.1_450bps_hac.cfg -x cuda:all -q 7 -o reads_min_grid.fastq --port 5555  --use_tcp

#basecall promethION data
buttery-eel  -i  prom.blow5  --guppy_bin /install/ont-guppy-6.3.7/bin/  --config dna_r9.4.1_450bps_hac_prom.cfg -x cuda:all -q 7 -o reads_prom.fastq --port 5556  --use_tcp

@aafshinfard
Copy link
Author

@hasindu2008 Awesome, thank you so much!

@aafshinfard
Copy link
Author

@hasindu2008 Just started downloading; should be done tonight. Will confirm after it has finished. Thanks again.

@aafshinfard
Copy link
Author

@hasindu2008 Just confirming that my download was completed. Thank you so much for your help.

@hasindu2008
Copy link

@aafshinfard
No problem, glad to help. If this becomes useful in your work please consider citing BLOW5 which allowed us to do this basecalling with very little budget, which otherwise would require to spend a fortune.

@aafshinfard
Copy link
Author

Sure thing, thank you @hasindu2008

@skoren
Copy link
Member

skoren commented Jun 11, 2024

Thanks for contributing these, sorry this dropped of my radar. I put a link to the NCBI hosted files for both now.

@skoren skoren closed this as completed Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants