Juno have two methods to generate reads fastq.
- Download the real fastq submitted to NCBI SRA from the contributors
- Simulate the "fake" fastq
If you want to develope genomic tools but has no real data, juno can generate the read fastq for your testing.
Juno is also available as a public online resource: https://juno.hlin.tw
- Linux
- Python >= 3.6
https://pypi.org/project/bio-juno/
pip install bio-juno
git clone https://github.com/hunglin59638/juno.git
cd juno
python3 setup.py install
/data
is the entry point
docker pull hunglin59638/juno:1.0.0
docker run --rm -v /your/mount/point:/data juno sra -a $RUN_ACCESSION -o /data/outdir
juno -h
usage: juno [-h] SUBCOMMAND ...
Juno: read data generator
optional arguments:
-h, --help show this help message and exit
subcommands:
subcommands
SUBCOMMAND
sra Download reads from SRA database
simulate Simulating reads by reference genome
juno sra -a SRR19400588 -o /path/to/directory
There are two way to simulate read fastq
- Input your genome fasta
juno simulate -r /your/genome/fasta -o /path/to/directory --compressed --depth 200
- Input RefSeq assembly accession and its genome will be downloaded from NCBI
juno simulate -a GCF_002004995.1 -o /path/to/directory --compressed --depth 200
Tips: depth is greater than 200x is the better parameter for bacteria
juno simulate --update
juno
The server port is 8051
http://localhost:8051
from juno.data import Assembly
assembly = Assembly()
assembly.update_assembly()
df = assembly.dataframe
df.head()
assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm
0 GCF_000001215.4 PRJNA164 SAMN02803731 reference genome 7227 7227 Drosophila melanogaster latest Chromosome Major Full 2014/08/01 Release 6 plus ISO1 MT The FlyBase Consortium/Berkeley Drosophila Genome Project/Celera Genomics GCA_000001215.4
1 GCF_000001405.40 PRJNA168 reference genome 9606 9606 Homo sapiens latest Chromosome Patch Full 2022/02/03 GRCh38.p14 Genome Reference Consortium GCA_000001405.29
2 GCF_000001635.27 PRJNA169 reference genome 10090 10090 Mus musculus latest Chromosome Major Full 2020/06/24 GRCm39 Genome Reference Consortium GCA_000001635.9
3 GCF_000001735.4 PRJNA116 SAMN03081427 reference genome 3702 3702 Arabidopsis thaliana ecotype=Columbia latest Chromosome Minor Full 2018/03/15 TAIR10.1 The Arabidopsis Information Resource (TAIR) GCA_000001735.2
4 GCF_000001905.1 PRJNA70973 SAMN02953622 AAGU00000000.3 representative genome 9785 9785 Loxodonta africana ISIS603380 latest Scaffold Major Full 2009/07/15 Loxafr3.0 Broad Institute GCA_000001905.1
from juno.data import Assembly
assembly = Assembly()
genome_path = assembly.download("GCF_002004995.1", "/your/output/directory")
from juno.simulator import Simulator
sm = Simulator()
sm.generate(reference="/your/genome/fasta/path", out_dir="/your/output/directory", depth=200.0, platform="nanopore", compressed=True)
tips: The genome fasta is supported by gz format.
- pysradb: A Python package to query next-generation sequencing metadata and data from NCBI Sequence Read Archive (https://f1000research.com/articles/8-532/v1)
- PBSIM2: a simulator for long-read sequencers with a novel generative model of quality scores (https://doi.org/10.1093/bioinformatics/btaa835)