Skip to content

Latest commit

 

History

History
74 lines (36 loc) · 3.61 KB

README.md

File metadata and controls

74 lines (36 loc) · 3.61 KB

LRTK-SIM: Linked read simulator for 10X Chromium System.

Prerequired software

LRTK-SIM was written by python3 and six packages should be preinstalled: sys,multiprocessing,numpy,os,gzip,collections

Auxiliary program

gen_fasta.py can generate two haploid fasta files with variants from vcf and reference genome sequence.

usage: python gen_fasta.py -v sample.vcf -r ref.fasta -p newref -o ./work

-v --vcf, the input path of compressed or uncompressed vcf file

-r --reference, the path of compressed or uncompressed reference sequence

-p --prefix, prefix of new reference sequence

-o --out, the path to output

-h --help, help info

Basic usage

python LRTK-SIM.py <path to configuration files>

Multiple libraries are allowed to simulated simutaneously and the parameters should be written in the correspinding configuration files.

E.g. python LRTK-SIM.py ./diploid_config

The diploid_config folder includes two config files for two libraries with different parameters:

config1.txt (for library 1) and config2.txt (for library 2)

The simulated fastq files are written to the folder named with lib_config1 and lib_config2, respectively.

Important parameters in config file

Two examples of config file (config1.txt and config2.txt) are prepared in the diploid_config folder.

line2 and line3: Path_Fastahap1 and Path_Fastahap2, the two haploid reference serquences. LRTK-SIM allows one or two fasta files to perform haploid and diploid simulation. The diploid reference sequences can be generated by gen_fasta.py that inserting variants to the reference genome (only SNVs for this version). You can remove Path_Fastahap2=XXX in line3 and set Hap=1 in line33 for haploid simulation.

line5: processors, the maximum number of CPUs allowed

line7: CF, coverage of long DNA fragments

line9: CR, covergae of short reads for each fragment

line11: N_FP, average number of fragments for each droplet

line13: Mu_F, average length for long DNA fragment (Kb)

line15: SR, length of short reads (bp)

line21: Error_rate, sequencing error rate

line27: Mu_IS, the average of insert size for short reads (bp)

line29: Std_IS, standard deviation of insert size for short reads (bp)

line33: Hap, Haploid (Hap=1) or Diploid (Hap=2)

Simulation for metagenomics

Different from dipoid assemblies, the species abundances in metagenomics sequencing may be significantly different. LRTK-SIM can make use of the species abundances in simulation that should be provided by a flat file with two columns: 1. sequence name (muct be identical with the name of reference sequence) 2. abundance (sum to 1 for all the species). One example is given in meta_config folder

Output

LRTK-SIM generates one folder for each library named after lib_X (X is the name of configuration file) in the same path of config files. Two gzipped fastq files are generated and started by X_S1_L001_R1_001.fastq.gz and X_S1_L001_R2_001.fastq.gz to represent forward and reverse reads. The simulated data have been tested and accepted by 10X official pipelines Long Ranger and Supernova.

Contact: zhanglu2@stanford.edu