LRTK-SIM was written by python3 and six packages should be preinstalled: sys,multiprocessing,numpy,os,gzip,collections
gen_fasta.py
can generate two haploid fasta files with variants from vcf and reference genome sequence.
usage: python gen_fasta.py -v sample.vcf -r ref.fasta -p newref -o ./work
-v --vcf, the input path of compressed or uncompressed vcf file
-r --reference, the path of compressed or uncompressed reference sequence
-p --prefix, prefix of new reference sequence
-o --out, the path to output
-h --help, help info
python LRTK-SIM.py <path to configuration files>
Multiple libraries are allowed to simulated simutaneously and the parameters should be written in the correspinding configuration files.
E.g. python LRTK-SIM.py ./diploid_config
The diploid_config
folder includes two config files for two libraries with different parameters:
config1.txt (for library 1) and config2.txt (for library 2)
The simulated fastq files are written to the folder named with lib_config1
and lib_config2
, respectively.
Two examples of config file (config1.txt and config2.txt) are prepared in the diploid_config
folder.
line2
and line3
: Path_Fastahap1
and Path_Fastahap2
, the two haploid reference serquences. LRTK-SIM allows one or two fasta files to perform haploid and diploid simulation. The diploid reference sequences can be generated by gen_fasta.py
that inserting variants to the reference genome (only SNVs for this version). You can remove Path_Fastahap2=XXX
in line3
and set Hap=1
in line33
for haploid simulation.
line5
: processors
, the maximum number of CPUs allowed
line7
: CF
, coverage of long DNA fragments
line9
: CR
, covergae of short reads for each fragment
line11
: N_FP
, average number of fragments for each droplet
line13
: Mu_F
, average length for long DNA fragment (Kb)
line15
: SR
, length of short reads (bp)
line21
: Error_rate
, sequencing error rate
line27
: Mu_IS
, the average of insert size for short reads (bp)
line29
: Std_IS
, standard deviation of insert size for short reads (bp)
line33
: Hap
, Haploid (Hap=1) or Diploid (Hap=2)
Different from dipoid assemblies, the species abundances in metagenomics sequencing may be significantly different. LRTK-SIM
can make use of the species abundances in simulation that should be provided by a flat file with two columns: 1. sequence name (muct be identical with the name of reference sequence) 2. abundance (sum to 1 for all the species). One example is given in meta_config
folder
LRTK-SIM
generates one folder for each library named after lib_X (X is the name of configuration file) in the same path of config files. Two gzipped fastq files are generated and started by X_S1_L001_R1_001.fastq.gz and X_S1_L001_R2_001.fastq.gz to represent forward and reverse reads. The simulated data have been tested and accepted by 10X official pipelines Long Ranger and Supernova.
Contact: zhanglu2@stanford.edu