LRTK-SIM: Linked read simulator for 10X Chromium System.

Prerequired software

LRTK-SIM was written by python3 and six packages should be preinstalled: sys,multiprocessing,numpy,os,gzip,collections

Auxiliary program

gen_fasta.py can generate two haploid fasta files with variants from vcf and reference genome sequence.

usage: python gen_fasta.py -v sample.vcf -r ref.fasta -p newref -o ./work

-v --vcf, the input path of compressed or uncompressed vcf file

-r --reference, the path of compressed or uncompressed reference sequence

-p --prefix, prefix of new reference sequence

-o --out, the path to output

-h --help, help info

Basic usage

python LRTK-SIM.py <path to configuration files>

Multiple libraries are allowed to simulated simutaneously and the parameters should be written in the correspinding configuration files.

E.g. python LRTK-SIM.py ./diploid_config

The diploid_config folder includes two config files for two libraries with different parameters:

config1.txt (for library 1) and config2.txt (for library 2)

The simulated fastq files are written to the folder named with lib_config1 and lib_config2, respectively.

Important parameters in config file

Two examples of config file (config1.txt and config2.txt) are prepared in the diploid_config folder.

line2 and line3: Path_Fastahap1 and Path_Fastahap2, the two haploid reference serquences. LRTK-SIM allows one or two fasta files to perform haploid and diploid simulation. The diploid reference sequences can be generated by gen_fasta.py that inserting variants to the reference genome (only SNVs for this version). You can remove Path_Fastahap2=XXX in line3 and set Hap=1 in line33 for haploid simulation.

line5: processors, the maximum number of CPUs allowed

line7: CF, coverage of long DNA fragments

line9: CR, covergae of short reads for each fragment

line11: N_FP, average number of fragments for each droplet

line13: Mu_F, average length for long DNA fragment (Kb)

line15: SR, length of short reads (bp)

line21: Error_rate, sequencing error rate

line27: Mu_IS, the average of insert size for short reads (bp)

line29: Std_IS, standard deviation of insert size for short reads (bp)

line33: Hap, Haploid (Hap=1) or Diploid (Hap=2)

Simulation for metagenomics

Different from dipoid assemblies, the species abundances in metagenomics sequencing may be significantly different. LRTK-SIM can make use of the species abundances in simulation that should be provided by a flat file with two columns: 1. sequence name (muct be identical with the name of reference sequence) 2. abundance (sum to 1 for all the species). One example is given in meta_config folder

Output

LRTK-SIM generates one folder for each library named after lib_X (X is the name of configuration file) in the same path of config files. Two gzipped fastq files are generated and started by X_S1_L001_R1_001.fastq.gz and X_S1_L001_R2_001.fastq.gz to represent forward and reverse reads. The simulated data have been tested and accepted by 10X official pipelines Long Ranger and Supernova.

Contact: zhanglu2@stanford.edu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LRTK-SIM: Linked read simulator for 10X Chromium System.

Prerequired software

Auxiliary program

Basic usage

Important parameters in config file

Simulation for metagenomics

Output

Files

README.md

Latest commit

History

README.md

File metadata and controls

LRTK-SIM: Linked read simulator for 10X Chromium System.

Prerequired software

Auxiliary program

Basic usage

Important parameters in config file

Simulation for metagenomics

Output