The CracTools-SimCT software (or simply SimCT) is a complete and modular workflow to simulate RNA-Seq data from a reference genome and known transcript annotations.
SimCT works in three steps :
- The first step introduces a set of variants in the reference genome. SimCT then generates a haploid mutated genome in FASTA format and a set of GTF annotations whose coordinates are converted for this modified reference.
- We then process the modified reference files using FluxSimulator, that offers good performance in generating a random expression profile and simulating reads by reproducing a complete ”in silico” RNA-Seq protocol.
- In the final step, the reads and corresponding alignments produced by FluxSimulator are sent to post-processing where alignment coordinates are converted to the coordinates of the original reference genome. The errors are then extracted from read sequence (encoded in lowercase by FluxSimulator) and a new FASTQ file is produced with alignments and errors encoded in the read name.
- Perl5 distribution
- cpanm (CPAN minus)
- FluxSimulator Make sure that the binary "flux-simulator" is available in the $PATH to be executed by SimCT.
- CracTools-core perl package. It will be automaically installed by cpanm along with all other CPAN dependancies.
This is the simpliest way to install simCT.
- Go to the release page of the github projet.
- Download the latest tarball (tar.gz) release :
wget https://github.com/jaudoux/simct/releases/download/$VERSION/CracTools-SimCT-$VERSION.tar.gz
- Install the package with cpanm :
cpanm [-l local_dir] CracTools-SimCT.tar.gz
If you do not have admin rights, you can use the option -l
to specify cpanm a local directory to install simCT.
To install SimCT from the sources, you will need Dist::Zilla software, which is a Perl package manager.
- Clone SimCT repository :
git clone https://github.com/jaudoux/simct.git
- Build and install :
dzil install --install-command 'cpanm [-l LOCAL_INSTALLATION_DIRECTORY].'
(ommit the-l
option if you want to install simCT to the system).
Basic 1M Paired-end reads simulation:
simCT -g genome_dir -a annotations.gtf --nb-reads 1000000
40M reads with 100 fusion genes:
simCT -g genome_dir -a annotations.gtf --nb-reads 40000000 --nb-fusions 100
40M reads with 5% of mutations taken from a VCF files:
simCT -g genome_dir -a annotations.gtf --nb-reads 40000000 --vcf-file cosmic.vcf --vcf-ration 0.05
In order to simulate RNA-Seq datasets with SimCT, all you need is a reference genome as a set a FASTA files for each chromosome and a GTF file containing annotations of transcripts. Then run simCT as follow by indicationg the reference genome directory (where the FASTA are located), the GTF file, and the output directory :
simCT -g reference_genome/ -a annotations.gtf [-o my_simulation]
Many options are available, run simCT --help
for more documentation.
The mandatory input files for simCT are a directory containaing the reference genome as a set of FASTA files for each chromosome and a GTF file containing the annotations.
SimCT expect the genome to be a directory that contains a FASTA file for each chromosome of the genome. For example :
├── genome_dir
│ ├── chr1.fa
│ ├── chr2.fa
│ ├── chr3.fa
If you have a genome in multi-FASTA format, you can use this splitFasta.pl perl script] to split it in multiple FASTA files.
# Create a directory for your genome_dir
mkdir genome_dir && cd genome_dir
# Invoke the splitFasta script
./splitFasta.pl genome.fa
SimCT should work with all GTF files (Ensembl, UCSC, GENCODE, ...), since we use only generic fields (chr, start, end, strand, gene_id, transcript_id) and only exon features.
The output directory of a simCT simulation should be structured :
├── gene-counts.tsv.gz
├── transcript-counts.tsv.gz
├── mutations.vcf.gz
├── reads_1.fastq.gz
├── reads_2.fastq.gz
├── splices.bed.gz
├── chimeras.tsv.gz
├── info.txt
├── FluxSimulator/
├── simulated_genome/
│ ├── chr{N}.fa
│ ├── annotations.gtf
│ ├── fusion{N}.fa
The following table describes the content of each of these files/directories.
File name | Description |
---|---|
gene-counts.tsv.gz | A tabulated file with two colomns : feature and truth . Feature old the gene_id from the GTF add truth the number of read mapped to this gene. |
transcript-counts.tsv.gz | A tabulated file with two colomns : feature and truth . Feature old the transcript_id from the GTF add truth the number of read mapped to this transcript. |
mutations.vcf.gz | A VCF containing the inserted mutations that are supported by 1 read at least. The ID column contains the IDs of the reads supporting the mutation (separated with a colon). |
splices.bed.gz | A 6 column bed file containing the splice junctions supported by at least 1 read. The name column contains the IDs of the reads supporting the splice junction (separated with a colon). |
chimeras.gz | A tabulated file (with no header line) containing the chimeric junctions. The file contains 7 columns. The first 6 column describes the breakpoint : chr1 , pos1 , strand1 , chr2 , pos2 , strand2 . The last column contains the IDs of the reads supporting the chimeric junction (separated with a colon). |
reads_{1,2}.fastq.gz | The FASTQ file(s) holding the generated reads. Read names holds the truth for read alignments to the reference genome. Read the 'read name encoding'' section for more informations. |
info.txt | A text file describing the number of element for each feature (read, SNV, Splice, Fusion, SNV, Indel) generated in the simulations. |
FluxSimulator/ | The directory contains output files generated by FluxSimulator. |
simulated_genome/ | The directory contains the fasta sequences of the simulated genome as well as the annotations that have been liftover this genome. If Fusions have been generated, each fusion will be printed in a separated fasta file. |
Option name | Default value | Description |
---|---|---|
-o,--output-dir | simCT_simulation |
Location of the outout directory |
-g,--genome-dir | mandatory | xReference genome directory (with chromosomes splited in individual FASTA files) |
-a,--annotations | mandatory | Annotations file in GTF format |
-s,--substitution-rate | 0.001 | Rate a wich substitutions are randomly inserted in the reference genome |
-i,--insertion-rate | 0.0005 | Rate a wich insertions are randomly inserted in the reference genome |
-d,--deletion-rate | 0.0005 | Rate a wich deletions are randomly inserted in the reference genome |
-f,--nb-fusions | 0 | Number of fusions introduced in the simulated genome |
--vcf-file | NA | A VCF file that contains mutations |
--vcf-ratio | NA | Ratio of mutations that taken from the VCF source instead of random(default: 0.8) |
--flux-par | NA | Flux parameter file (prioritary over the following option) |
--single-end | NA | Single-end protocol |
--nb-molecules | 5000000 | Number of moleclules in the sample |
--nb-reads | 1000000 | Number of reads sequenced |
--reads-length | 100 | Length of sequenced reads |
--fragment-length | 300 | Mean fragment length |
--fragment-sd | 75 | Standard deviation of fragment length |
--disable-error-encoding | NA | Remove error encoding from read names |
--uniq-ids | NA | Read names are identical for both pairs |
SimCT encodes the read alignement and error position inside the read name. This way, it is very easy to verify an alignment from a SAM entry.
The read format is defined as : read_id:(chr,(-)pos,cigar(;)?)+,base64(err_pos?)+
The read name is composed of three components:
- the read id from 0 to
nb_reads - 1
- a list of SAM-like read alignments (chr, starting position and cigar chain)
- the positions of sequencing errors (if any) encoded in base64.
For paired-end reads, a single read name is generated for both reads by using a single read\_id
, concatenating the alignments and merging the positions of sequencing errors.
Example of a read name: 0:1,12345,20M2I3X30M;2:-6789:20M:CABgAAAKAAEAABgCAAAABABAQAAgAAACDD
The encoded error list (CABgAAAKAAEAABgCAAAABABAQAAgAAACDD
) is decoded as as follow : 1,12,23,43,45,62,78,89,91,120,132,148,167,187,192,193,198,199
Jérôme Audoux - jaudoux@cpan.org Nicolas Philippe - nphilippe@cpan.org Mikaël Salson - mikael.salson@univ-lille1.fr