-
Notifications
You must be signed in to change notification settings - Fork 7
/
README
491 lines (414 loc) · 21.1 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
pIRS (profile based Illumina pair-end Reads Simulator)
Contents
========
1. Introduction
2. Program framework
3. Usage
4. Examples
5. Output file format
6. Notes
1 Introduction
==============
pIRS is a program for simulating paired-end reads from a reference genome. It
is optimized for simulating reads similar to those generated from the Illumina
platform.
See `INSTALL' for installation instructions.
There are two subcommands: `pirs simulate' and `pirs diploid'. See section 3
for more details, or run `pirs simulate -h' or `pirs diploid -h'.
2 Program framework
===================
2.1 Profile Generator
Six tools are supplied. SOAP2 or BWA, soap.coverage (http://soap.genomics.org.cn/soapaligner.html)
are required. The full process is shown in getprofile.sh.example as an example.
2.1.1 GC%-Depth Profile Stat.
a). Run soap and soap.coverage to get .depth single file(s). gzip is OK to over it.
b). Run gc_coverage_bias on all depth single files. You will get gc-depth stat by 1 GC% and other files.
c). Run gc_coverage_bias_plot on the gc-depth stat file. You'll get PNG plot and a .gc file by 5 GC%.
d). Manually check the .gc file for any abnormal levels due to the lower depth on certain GC% windows.
2.1.2 Base-Calling Profile Stat:
a). Run soap or bwa to get .{soap,single} or .sam file(s).
b). Run error_matrix_calculator on those file(s). You will get *.{count,ratio}.matrix .
c). You can use error_matrix_merger to merge several .{count,ratio}.matrix files.
However, it is up to you to keep the read length matches.
2.1.3 InDel Profile Stat:
a). Choose samples with NO polymorphism InDel, such as the Coliphage samples that shipped with Illumina Sequencers.
b). Run bwa to get .sam/.bam file.
c). Run indelstat_sam_bam to get the profile.
2.1.4 Insert size & mapping ratio stat:
a). Run soap or bwa to get .{soap,single} or .sam file(s).
b). Run alignment_stator *.
* alignment_stator cannot stat. mapping ratio for sam files now.
2.2 Simulator
Two commands:
pirs diploid: use for generating diploid genome sequence. Read the input genome sequence and
then simulate SNP, InDel, SV(structure variation) on it. At last, output the
result genome sequence.
pirs simulate: use for simulating Illumina data, output PE-read file.
Note:
a) If you only want to simulate the diploid genome sequence, "pirs diploid" is enough.
b) If you want to simulate sequencing data of haploid genome, only you need is "pirs simulate".
c) If you want to simulate sequencing data of diploid genome, you first need to run "pirs diploid"
to get the other diploid genome sequence, and then run "pirs simulate" using both the original
genome sequence and the previous output sequence as the input.
3 Usage
=======
pirs <command> [option]
diploid generate diploid genome.
simulate simulate Illumina reads.
3.1 pirs diploid
Usage: pirs diploid [OPTIONS...] REFERENCE
Simulate a diploid genome by creating a copy of a haploid genome with
heterozygosity introduced. REFERENCE specifies a FASTA file containing
the reference genome. It may be compressed (gzip). It may contain multiple
sequences (scaffolds or chromosomes), each marked with a separate FASTA tag
line. The introduced heterozygosity takes the form of SNPs, indels, and
large-scale structural variation (insertions, deletions and inversions).
If REFERENCE is '-', the reference sequence is read from stdin, but it must be
uncompressed.
The probabilities of SNPs, indels, and large-scale structural variation can be
specified with the '-s', '-d', and '-v' options, respectively. You can also
set the ratio of transitions to transversions (for SNPs) with the '-R' option.
Indels are split evenly between insertions and deletions. The length
distribution of the indels is as follows and is derived from panda
re-sequencing data:
1bp 64.82%
2bp 17.17%
3bp 7.20%
4bp 7.29%
5bp 2.18%
6bp 1.34%
Large-scale structural variation is split evenly among large-scale insertions,
deletions, and inversions. By default, the length distribution of these
large-scale features is as follows:
100bp 70%
200bp 20%
500bp 7%
1000bp 2%
2000bp 1%
`pirs diploid' does not use multiple threads, even if pIRS was configured with
--enable-multiple threads.
OPTIONS:
-s, --snp-rate=RATE A floating-point number in the interval [0, 1] that
specifies the heterozygous SNP rate. Default: 0.001
-d, --indel-rate=RATE A floating-point number in the interval [0, 1] that
specifies the heterozygous indel rate.
Default: 0.0001
-v, --sv-rate=RATE A floating-point number in the interval [0, 1] that
specifies the large-scale structural variation
(insertion, deletion, inversion) rate in the diploid
genome. Default: 0.000001
-R, --transition-to-transversion-ratio=RATIO
In a SNP, a transition is when a purine or pyrimidine
is changed to one of the same (A <=> G, C <=> T)
while a transversion is when a purine is changed
into a pyrimidine or vice-versa. This option
specifies a floating-point number RATIO that gives
the ratio of the transition probability to the
transversion probability for simulated SNPs.
Default: 2.0
-o, --output-prefix=PREFIX
Use PREFIX as the prefix of the output file and logs.
Default: "pirs_diploid"
-O, --output-file=FILE
Use FILE as the name of the output file. Use '-'
for standard output; this also moves the
informational messages from stdout to stderr.
-c, --output-file-type=TYPE
The string "text" or "gzip" to specify the type of
the output FASTA file containing the diploid copy
of the genome, as well as the log files.
Default: "text"
-n, --no-logs Do not write the log files.
-S, --random-seed=SEED Use SEED as the random seed. Default:
time(NULL) * getpid()
-q, --quiet Do not print informational messages.
-h, --help Show this help and exit.
-V, --version Show version information and exit.
EXAMPLE:
./pirs diploid ref_sequence.fa -s 0.001 -d 0.0001 -v 0.000001\
-o ref_sequence >pirs.out 2>pirs.err
3.2 pirs simulate
Usage: ./pirs simulate [OPTION]... REFERENCE.FASTA...
pIRS is a program for simulating paired-end reads from a genome. It is
optimized for simulating reads from the Illumina platform. The input to
pIRS is any number of reference sequences. Typically you would just provide
one FASTA file containing your reference sequence, but you may provide two
if you have generated a diploid sequence with `pirs diploid', or if you have
chromosomes split up into multiple FASTA files. The output of pIRS is two
FASTQ files containing the simulated paired-end reads, as well as several log
files.
Input sequences are assumed to be haploid. If you instead want to simulate
reads from a diploid genome, you must give the --diploid option so that
the diploidy is taken into account when computing coverage. If you do
not do this, you will get twice as many reads as you wanted.
pIRS simulates a normally-distributed insert (fragment) length using the
Box-muller method. Usually you want the insert length standard deviation to
be 1/20 or 1/10 of the insert length mean (see the -m and -v options).
This program also simulates Illumina sequencing error, quality score and
GC bias based on empirical distribution profiles. Users may use the default
profiles in this package, which are generated by large real sequencing data,
or they may generate their own profiles.
OPTIONS:
-l LEN, --read-len=LEN
Generate reads having a length of LEN. Default: 100
-x VAL, --coverage=VAL
Set the average sequencing coverage (sometimes called depth).
It may be either a floating-point number or an integer.
-m LEN, --insert-len-mean=LEN
Generate inserts (fragments) having an average length of LEN.
Default: 180
-v LEN, --insert-len-sd=LEN
Set the standard deviation of the insert (fragment) length.
Default: 10% of insert length mean.
-j, --jumping, --cyclicize
Make the paired-end reads face away from either other, as
in a jumping library. Default: the reads face towards each
other.
-d, --diploid
This option asserts that reads are being simulated from a
diploid genome. It causes the program to abort if there
are not exactly two reference sequences; in addition, the
coverage is divided in half, since the two reference
sequences are in reality the same genome. This option
is not required to simulate diploid reads, but you must
set the coverage correctly otherwise (it will be half
as much as you think).
-B FILE, --base-calling-profile=FILE, --subst-error-profile=FILE
Use FILE as the base-calling profile. This profile will be
used to simulate substitution errors. Default:
PREFIX/share/pirs/Base-Calling_Profiles/humNew.PE100.matrix.gz
-I FILE, --indel-error-profile=FILE, --indel-profile=FILE
Use FILE as the indel-error profile. This profile will be
used to simulate insertions and deletions in the reads that
are artifacts of the sequencing process. Default:
PREFIX/share/pirs/InDel_Profiles/phixv2.InDel.matrix
-G FILE, --gc-bias-profile=FILE, --gc-content-bias-profile=FILE
Use FILE as the GC content bias profile. This profile will
adjust the read coverage based on the GC content of
fragments. Defaults:
PREFIX/share/pirs/GC-depth_Profiles/humNew.gcdep_100.dat,
PREFIX/share/pirs/GC-depth_Profiles/humNew.gcdep_150.dat,
PREFIX/share/pirs/GC-depth_Profiles/humNew.gcdep_200.dat,
depending on the mean insert length.
-e FILE, --error-rate=RATE, --subst-error-rate=RATE
Set the substitution error rate. The base-calling profile
will still be used, but the average frequency of errors will
be changed to RATE. Set to 0 to disable substitution errors
completely. In that case, the base-calling profile will not
be used. Default: default error rate of base-calling
profile.
Note: since pIRS parameterizes the error rate by
several parameters, it is very difficult to determine exactly
what needs to be done to make the error rate be a given
value. We try to adjust the probabilities of getting each
quality score in order to accomodate the user-supplied error
rate. However, depending on your input sequences, the actual
error rate simulated by pIRS could be off by 20% or more.
Please check the informational output to see the final error
rate that was actually simulated.
-A ALGO, --substitution-error-algorithm=ALGO, --subst-error-algo=ALGO
Set the algorithm used for simulating substitition errors.
It may be set to the string "dist" or "qtrans". The
default is "qtrans".
The "dist" algorithm looks up the substitution error rate
for each base pair based on the current cycle and the true
base. This lookup produces a quality score and a called base
that may or may not be the same as the true base. In the
base-calling profile, the matrix we use is marked as the
[DistMatrix].
The "qtrans" algorithm is a Markov-chain model based on the
previous quality score and current cycle. The next quality
score is looked up with a certain probability based on these
parameters. The matrix used for this is marked as
[QTransMatrix] in the base-calling profile. Then, the the
DistMatrix is used to find a called base for the quality score.
The DistMatrix is also used to call the base in the first
cycle.
-M MODE, --mask=MODE, --eamss=MODE
Use the EAMSS algorithm for masking read quality. MODE may be
the string "quality" or "lowercase". The EAMSS algorithm
identifies low-quality, GC-rich regions near the ends of reads.
"quality" mode will change the quality scores on these
regions to (2 + quality_shift), while "lowercase" mode
will change the base pairs to lower case, but not change
the quality values. Default: Do not use EAMSS.
-Q VAL, --quality-shift=VAL, --phred-offset=VAL
Set the ASCII shift of the quality value (usually 64 or 33 for
Illumina data). Default: 33
--no-quality-values
--fasta
Do not simulate quality values. The simulated reads will be
written as a FASTA file rather than a FASTQ file.
Substitution errors may still be done; if you do not want
to simulate any substition errors, provide --error-rate=0 or
--no-substitution-errors.
--no-subst-errors
--no-substitution-errors
Do not simulate substitution errors. Equivalent to
--error-rate=0.
--no-indels
--no-indel-errors
Do not simulate indels. The indel error profile will not be
used.
--no-gc-bias
--no-gc-content-bias
Do not simulate GC bias. The GC bias profile will not be
used.
-o PREFIX, --output-prefix=PREFIX
Use PREFIX as the prefix of the output files. Default:
"pirs_reads"
-c TYPE, --output-file-type=TYPE
The string "text" or "gzip" to specify the type of
the output FASTQ files containing the simulated reads
of the genome, as well as the log files. Default: "text"
-z, --compress
Equivalent to -c gzip.
-n, --no-logs, --no-log-files
Do not write the log files.
-S SEED, --random-seed=SEED
Use SEED as the random seed. Default:
time(NULL) * getpid(). Note: If pIRS was not compiled with
--disable-threads, each thread actually uses its own random
number generator that is seeded by this base seed added to
the thread number; also, if you need pIRS's output to be
exactly reproducible, you must specify the random seed as well
as use only 1 simulator thread (--threads=1, or configure
with --disable-threads, or run on system with 4 or fewer
processors).
-t, --threads=NUM_THREADS
Use NUM_THREADS threads to simulate reads. This option is
not available if pIRS was compiled with the --disable-threads
option. Default: number of processors minus 2 if writing
uncompressed output, or number of processors minus 3 if
writing compressed output, or 1 if there are not this many
processors
-q, --quiet Do not print informational messages.
-h, --help Show this help and exit.
-V, --version Show version information and exit.
4 Examples
==============
4.1 Simulating a diploid genome sequence.
Example command line:
pirs diploid Human_ref.fa -s 0.001 -R 2 -d 0.00001 -v 0.000001 -o Human >simulate_seq.o 2>simulate_seq.e
Output files:
a) Human.snp.indel.invertion.fa: another diploid genome sequence.
b) Human_indel.lst: InDel information list.
c) Human_snp.lst: SNP information list.
d) Human_invertion.lst: invertion information list.
e) simulate_seq.o, simulate_seq.e: records of the program running information.
4.2 Simulate Illumina paired-end reads from a haploid genome.
Example command line:
pirs simulate Human_ref.fa -m 170 -l 90 -x 5 -v 10 -o Human >simulate_170.o 2>simulate_170.e
Output files:
a) Human_90_170_1.fq, Human_90_170_2.fq: the paired-end read files.
b) Human_90_170.error_rate.distr: the error distribution file.
c) Human_90_170.insert_len.distr: the insert length distribution file.
d) Human_90_170.read.info: information about every simulated reads
e) simulate_170.o, simulate_170.e: records of the program running information.
4.3 Simulate Illumina paired-end reads from a diploid genome.
Example command line:
pirs simulate --diploid Human_ref.fa Human.snp.indel.invertion.fa.gz -m 800 -l 70 -x 5 -v 10
-o Human >simulate_800.o 2>simulate_800.e
Output files:
a) Human_70_800_1.fq, Human_70_800_2.fq: the pair-end read files.
b) Human_70_800.error_rate.distr: the error distribution file.
c) Human_70_800.insertsize.distr: the insert size distribution file.
d) Human_70_800.read.info.gz: record the information of every reads.
e) simulate_800.e, simulate_800.o: records of the program running information.
5 Output file format
====================
a) *.fq/*.fq.gz
FASTQ files containing the simulated reads. The files are given names
similar to PREFIX_70_800_1.fq, where PREFIX is the prefix provided by the -o
option (default: "pirs_reads"), 70 is the read length, 800 is the mean
insert length, and 1 means the file for read 1 of the read pairs. These
files will be written as GZIP files with the .gz suffix if you provide the
`-c gzip' option.
@read_800_21/1
ACGGAAAAGTTACGCTATCGCATGCGTGTAAGAACACTGCTCCTACGCCCATTTTATCGATGGCGCCCAG
+
egcggdggfgfgggggfeggggYbcgegfgggbggg^e]egfegggfbSeggdggegg`^eJgggcbEeb
@read_800_22/1
CACGGGGGGACTTTATTTAATGAGCGGCTGTAACTTGGTCCGTCGTTTGAGAGGGGACACCTCATATGAT
+
gggggegcgeggggcgfcgc_gf_ggfefcgVgegcfcdgf`geggdd[ge`ggafeggggdgdgee^gg
b) *.read.info
Information about each simulated read.
Column 1: reads ID.
Column 2: Name of the reference file; this is mainly provided so that you
tell which chromosome set a read came from if you simulate reads
from a diploid genome produced with the `pirs diploid' command.
Column 3: FASTA/FASTQ sequence ID of the contig/scaffold/chromosome.
Column 4: position(1-based) in chromosome.
Column 5: "+" forward direction; "-" reverse direction.
Column 6: real insert size.
Column 7: length of read-end masking by EAMSS algorithm.
Column 8: read-position(1-based) of substitution error and raw base(->)error base.
Column 9: read-position(1-based) of insertion error and the base of insertion.
Column 10: read-position(1-based) of deletion error and the base of deletion
c) *_snp.lst
For example:
I 3 3 G A
I 45342 45355 C T
I 104775 104680 C T
.....
Column 1: chromosome sequence ID.
Column 2: position(1-based) of SNP in reference chromosome.
Column 3: position(1-based) of SNP in simulated diploid chromosome.
Column 4: base of reference chromosome sequence.
Column 5: the base of SNP.
d) *_indel.lst
For example:
I - 3 1 C
IV + 5 2 AC
.....
Column 1: chromosomesequence ID.
Column 2: "-" Deletion; "+" Insertion.
Column 3: position(1-based) of InDel in the reference chromosome sequence.
For deletions, it is the position of the first deleted base. For
insertions, it is the position of the reference base corresponding
to the base directly before the insertion.
Column 4: position(1-based) of InDel in the diploid sequence. For
deletions, it is the position of the diploid sequence base
corresponding to the reference base directly before the deletion.
For insertions, it is the position of the first inserted base.
Column 5: length of InDel (always positive).
Column 6: sequence of the InDel.
I - 3 3 1 C
Ref: a t G c a // 3 is the position of G base in the chromosome base.
: a t G - a // following the G base is a deletion of "C" base.
IV + 5 5 2 A C
Ref: t t g c A - - g t t// 5 is the position of A base in the chromosome base.
: t t g c A A C g t t// following the A base is a insertion of "AC" bases.
e) *_inversion.lst
For example:
I 50191 50195 100
I 948984 948903 200
Column 1: chromosome sequence ID.
Column 2: position(1-based) of beginning of inversion in the reference
chromosome sequence.
Column 3: position(1-based) of beginning of inversion of the diploid
chromosome sequence.
Column 3: length of inversion.
6 Notes
======
a) pIRS does not simulate reads containing "N" char. If your input reference
contains the "N" character, reads generated in this in this region will be
discarded.
b) When running a simulation without the --no-subst-errors option, the maximum
length of the simulated reads depends on the number of cycles recorded in
the base-calling profile. The user must set the read length to no more
than half the number of cycles recorded in the base-calling profile.
c) When masking quality values, the program uses the same EAMSS algorithm
from CASAVA v1.8.0. This is only done if the --eamss option is supplied.
d) The program parses one chromosome at a time. Reads are evenly distributed
among the chromosomes.
e) To re-do a simulation and get the exact same results, you must set the
random seed with the -S parameter. In addition, you must use the
single-threaded version of pIRS, or else set --threads=4 (for only one
simulator thread).
f) pIRS will shift quality values when the substitution-error rate setting by
user is different from that in profile. The quality score in the output
data range from 2 to the maximum score of profile. You can find the real
substitution-error rate of output data in file *.error_rate.distr.
For update & support, please refer to http://code.google.com/p/pirs/ .