-
Notifications
You must be signed in to change notification settings - Fork 21
/
Copy pathREADME.txt
653 lines (498 loc) · 25.6 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
[![Edwards
Lab](https://img.shields.io/badge/Bioinformatics-EdwardsLab-03A9F4)](https://edwards.sdsu.edu/research)
[![DOI](https://www.zenodo.org/badge/60999054.svg)](https://www.zenodo.org/badge/latestdoi/60999054)
[![License:
MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
![GitHub language
count](https://img.shields.io/github/languages/count/linsalrob/PhiSpy)
[![Build
Status](https://travis-ci.org/linsalrob/PhiSpy.svg?branch=master&label=Travis%20Build)](https://travis-ci.org/linsalrob/PhiSpy)
[![PyPi](https://img.shields.io/pypi/pyversions/phispy.svg?style=flat-square&label=PyPi%20Versions)](https://pypi.org/project/PhiSpy/)
[![BioConda
Install](https://img.shields.io/conda/dn/bioconda/phispy.svg?style=flat-square&label=BioConda%20install)](https://anaconda.org/bioconda/phispy)
[![Downloads](https://img.shields.io/github/downloads/linsalrob/PhiSpy/total?style=flat-square)](https://github.com/linsalrob/PhiSpy/releases)
What is PhiSpy?
===============
PhiSpy identifies prophages in Bacterial (and probably Archaeal)
genomes. Given an annotated genome it will use several approaches to
identify the most likely prophage regions.
Initial versions of PhiSpy were written by
Sajia Akhter (sajia\@stanford.edu) [Edwards Bioinformatics
Lab](http://edwards.sdsu.edu/research/)
Improvements, bug fixes, and other changes were made by
Katelyn McNair [Edwards Bioinformatics
Lab](http://edwards.sdsu.edu/research/) and Przemyslaw Decewicz [DEMB at
the University of Warsaw](http://ddlemb.com/)
Installation
============
Conda
-----
The easiest way to install for all users is to use `bioconda`.
``` {.bash}
conda install -c bioconda phispy
```
PIP
---
`python-pip` requires a C++ compiler and the Python header files. You
should be able to install it like this:
``` {.bash}
sudo apt install -y build-essential python3-dev python3-pip
python3 -m pip install --user PhiSpy
```
This will install `PhiSpy.py` in `~/.local/bin` which should be in your
`$PATH` but might not be (see
[this](https://bugs.launchpad.net/ubuntu/+source/bash/+bug/1588562)
detailed discussion). See the tips and tricks below for a solution to
this.
Advanced Users
--------------
For advanced users, you can clone the git repository and use that
(though `pip` is the recommended install method).
``` {.bash}
git clone https://github.com/linsalrob/PhiSpy.git
cd PhiSpy
python3 setup.py install --user --record installed_files.txt
```
Note that we recommend using --record to save a list of all the files
that were installed by `PhiSpy`. If you ever want to uninstall it, or to
remove everything to reinstall e.g. from `pip`, you can simply use the
contents of that file:
cat installed_files.txt | xargs rm -f
If you have root and you want to install globally, you can change the
setup command.
``` {.bash}
git clone https://github.com/linsalrob/PhiSpy.git
cd PhiSpy
python3 setup.py install
```
For ease of use, you may wish to add the location of PhiSpy.py to your
\$PATH.
Software Requirements
---------------------
PhiSpy requires following programs to be installed in the system. Most
of these are likely already on your system or will be installed using
the mechanisms above.
1. `Python` - version 3.4 or later
2. `Biopython` - version 1.58 or later
3. `gcc` - GNU project C and C++ compiler - version 4.4.1 or later
4. The `Python.h` header file. This is included in `python3-dev` that
is available on most systems.
Testing PhiSpy.py
=================
Download the [Streptococcus pyogenes M1
genome](https://raw.githubusercontent.com/linsalrob/PhiSpy/master/tests/Streptococcus_pyogenes_M1_GAS.gb)
``` {.bash}
curl -Lo Streptococcus_pyogenes_M1_GAS.gb https://bit.ly/37qFArb
PhiSpy.py -o Streptococcus.phages Streptococcus_pyogenes_M1_GAS.gb
```
or to run it with the `Streptococcus` training set:
``` {.bash}
PhiSpy.py -o Streptococcus.phages -t data/trainSet_160490.61.txt Streptococcus_pyogenes_M1_GAS.gb
```
This uses the `GenBank` format file for *Streptococcus pyogenes* M1 GAS
that we provide in the [tests/](tests/) directory, and we use the
training set for *S. pyogenes* M1 GAS that we have pre-calculated. This
quickly identifies the four prophages in this genome, runs the repeat
finder on all of them, and outputs the answers.
You will find the output files from this query in `output_directory`.
Download more testing data
--------------------------
You can also download all the genomes in [tests/](tests). These are not
installed with PhiSpy if you use pip/conda, but will be if you clone the
repository. Please note that these are stored on [git
lfs](https://git-lfs.github.com/), and so if you notice an error that
the files are small and don't ungzip, you may need to (i) install
`git lfs` and (ii) use `git lfs fetch` to update this data.
Running PhiSpy.py
=================
The simplest command is:
``` {.bash}
PhiSpy.py genbank_file -o output_directory
```
where: - `genbank file`: The input DNA sequence file in GenBank format.
- `output directory`: The output directory is the directory where the
final output file will be created.
If you have new genome, we recommend annotating it using the [RAST
server](http://rast.nmpdr.org/rast.cgi) or
[PROKKA](https://github.com/tseemann/prokka). RAST has a server that
allows you to upload and download the genome (and can handle lots of
genomes), while PROKKA is stand-alone software.
### phage\_genes
By default, `PhiSpy.py` uses *strict* mode, where we look for two or
more genes that are likely to be a phage in each prophage region. If you
increase the value of `--phage_genes` that will reduce the number of
prophages that are predicted. Conversely, if you reduce this, or set it
to `0` we will overcall mobile elements.
When `--phage_genes` is set to `0`, `PhiSpy.py` will identify other
mobile elements like plasmids, integrons, and pathogenicity islands.
Somewhat unexpectedly, it will also identify the ribosomal RNA operons
as likely being mobile since they are unlike the host's backbone!
### color
If you add the `--color` flag, we will color the CDS based on their
function. The colors are primarily used in
[artemis](https://sanger-pathogens.github.io/Artemis/) for visualizing
phage regions.
### file name prefixes
By default the outputs from `PhiSpy.py` have standard names. If you
supply a file name prefix it will be prepended to all the file so that
you can run `PhiSpy.py` on multiple genomes and have the outputs in the
same directory without overwriting each other.
### gzip support
`PhiSpy.py` natively supports both reading and writing files in `gzip`
format. If you provide a `gzipped` input file, we will write a `gzipped`
output file.
### HMM Searches
When also considering the signal from HMM profile search:
``` {.bash}
PhiSpy.py genbank_file -o output_directory --phmms hmm_db --threads 4 --color
```
where: - `hmm_db`: reference HMM profiles database to search with
genome-encoded proteins (at the moment)
Training sets were searched with [pVOG
database](http://dmk-brain.ecn.uiowa.edu/pVOGs) HMM profiles:
[AllvogHMMprofiles.tar.gz](http://dmk-brain.ecn.uiowa.edu/pVOGs/downloads/All/AllvogHMMprofiles.tar.gz).
To use it:
``` {.bash}
wget http://dmk-brain.ecn.uiowa.edu/pVOGs/downloads/All/AllvogHMMprofiles.tar.gz
tar -zxvf AllvogHMMprofiles.tar.gz
cat AllvogHMMprofiles/* > pVOGs.hmm
```
Then use `pVOGs.hmm` as `hmm_db`.
Since extra step before the regular processing of PhiSpy is performed,
input `genbank file` is updated and saved in `output_directory`. When
`--color` flag is used, additional qualifier `/color` will be added in
the updated GenBank file so that the user could easily distinguished
proteins with hits to `hmm_db` while viewing the file in
[Artemis](https://www.sanger.ac.uk/science/tools/artemis)
When running PhiSpy again on the same input data and with `--phmms`
option you can skip the search step by `--skip_search` flag.
Another database that maybe of interest is the
[VOGdb](http://vogdb.org/) database. You can download all their VOGs,
and the press them into a compiled format for `hmmer`:
``` {.bash}
curl -LO http://fileshare.csb.univie.ac.at/vog/latest/vog.hmm.tar.gz
mkdir vog
tar -C vog -xf vog.hmm.tar.gz
cat vog/* > VOGs.hmms
hmmpress VOGs.hmms
```
### Metrics
We use several different metrics to predict regions that are prophages,
and there are some optional metrics you can add. The default set of
metrics are:
- `orf_length_med`: median ORF length
- `shannon_slope`: the slope of Shannon's diversity of *k*-mers across
the window under consideration. You can also expand this with the
`--expand_slope` option.
- `at_skew`: the normalized AT skew across the window under
consideration
- `gc_skew`: the normalized GC skew across the window under
consideration
- `max_direction`: The maximum number of genes in the same direction
You can specify each of these options with the `--metrics` flag, for
example:
PhiSpy.py --metrics shannon_slope
or
PhiSpy.py --metrics gc_skew
If you wish to specify more than one metric, you can either use one
`--metrics` flag and list your options, e.g.
PhiSpy.py --metrics shannon_slope gc_skew
or provide each one, e.g.:
PhiSpy.py --metrics shannon_slope --metrics gc_skew
The default is all of these, and so ommitting a `--metrics` flag is
equivalent to
PhiSpy.py --metrics orf_length_med shannon_slope at_skew gc_skew max_direction
The choice(s) you provide are recorded in the log file.
You can also add a few other options
- `phmms`: The [phmm](#HMM-Searches) search results
- `phage_genes`: The number of genes that must be annotated as phage
in the region
- `nonprophage_genegaps` : The maximum number of non-phage genes
between two phage-like regions that will enable them to be merged
Help
====
For the help menu use the `-h` option:
``` {.bash}
python PhiSpy.py -h
```
Output Files
============
`PhiSpy` has the option of creating multiple output files with the
prophage data:
1. **prophage\_coordinates.tsv** (code: 1)
This is the coordinates of each prophage identified in the genome, and
their *att* sites (if found) in tab separated text format.
The columns of the file are: - 1. Prophage number - 2. The contig upon
which the prophage resides - 3. The start location of the prophage - 4.
The stop location of the prophage If we can detect the *att* sites, the
additional columns are: - 11. start of *attL*; - 12. end of *attL*; -
13. start of *attR*; - 14. end of *attR*; - 15. sequence of *attL*; -
16. sequence of *attR*; - 17. The explanation of why this *att* site was
chosen for this prophage.
2. **GenBank format output** (code: 2)
We provide a duplicate GenBank record that is the same as the input
record, but we have inserted the prophage information, including *att*
sites into the record.
If the original GenBank file was provided in `gzip` format this file
will also be created in gzip format.
3. **prophage and bacterial sequences** (code: 4)
`PhiSpy` can automatically separate the DNA sequences into prophage and
bacterial components. If this output is chosen, we generate both fasta
and GenBank format outputs: - *GenBank files*: Two files are made, one
for the bacteria and one for the phages. Each contains the appropriate
fragments of the genome annotated as in the original. - *fasta files*:
Two files are made, the first contains the entire genome, but the
prophage regions have been masked with `N`s. We explicitly chose this
format for a few reasons: (i) it is trivial to convert this format into
separate contigs without the Ns but it is more complex to go from
separate contigs back to a single joined contig; (ii) when read mapping
against the genome, understanding that reads map either side of a
prophage maybe important; (iii) when looking at insertion points this
allows you to visualize the where the prophage was lying.
4. **prophage\_information.tsv** (code: 8)
This is a tab separated file, and is the key file to assess prophages in
genomes (see [assessing predictions](#assessing-predictions), below).
The file contains all the genes of the genome, one per line. The tenth
colum represents the status of a gene. If this column is 0 then we
consider this a bacterial gene. If it is non-zero it is probably a phage
gene, and the higher the score the more likely we believe it is a phage
gene. This is the raw data that we use to identify the prophages in your
genome.
This file has 16 columns: - 1. The id of each gene; - 2. function:
function of the gene (or `product` from a GenBank file); - 3. contig; -
4. start: start location of the gene; - 5. stop: end location of the
gene; - 6. position: a sequential number of the gene (starting at 1); -
7. rank: rank of each gene provided by random forest; - 8. my\_status:
status of each gene based on random forest; - 9. pp: classification of
each gene based on their function; - 10. Final\_status: the status of
each gene. For prophages, this column has the number of the prophage as
listed in prophage.tbl above; If the column contains a 0 we believe that
it is a bacterial gene. Otherwise we believe that it is possibly a phage
gene.
If we can detect the *att* sites, the additional columns are: - 11.
start of *attL*; - 12. end of *attL*; - 13. start of *attR*; - 14. end
of *attR*; - 15. sequence of *attL*; - 16. sequence of *attR*;
5. **prophage.tsv** (code: 16)
This is a simpler version of the *prophage\_coordinates.tsv* file that
only has prophage number, contig, start, and stop.
6. **GFF3 format** (code: 32)
This is the prophage information suitable for insertion into a
[GFF3](https://m.ensembl.org/info/website/upload/gff3.html). This is a
legacy file format, however, since GFF3 is no longer widely supported,
this only has the prophage coordinates. Please post an issue on GitHub
if more complete GFF3 files are required.
7. **prophage.tbl** (code: 64)
This file has two columns separated by tabs \[prophage\_number,
location\]. This is a also a legacy file that is not generated by
default. The prophage number is a sequential number of the prophage
(starting at 1), and the location is in the format: contig\_start\_stop
that encompasses the prophage.
8. **test data** (code: 128)
This file has the data used in the random forest. The columns are: -
Identifier - Median ORF length - Shannon slope - Adjusted AT skew -
Adjusted GC skew - The maxiumum number of ORFs in the same direction -
PHMM matches - Status
The numbers are averaged across a window of size specified by
`--window_size`
Choosing which output files are created.
----------------------------------------
We have provided the option (`--output_choice`) to choose which output
files are created. Each file above has a code associated with it, and to
include that file add up the codes:
Code File
------ ------------------------------------------------------
1 prophage\_coordinates.tsv
2 GenBank format output
4 prophage and bacterial sequences
8 prophage\_information.tsv
16 prophage.tsv
32 GFF3 format output of just the prophages
64 prophage.tbl
128 test data used in the random forest
256 GFF3 format output for the annotated genomic contigs
So for example, if you want to get `GenBank format output` (2) and
`prophage_information.tsv` (8), then enter an `--output_choice` of 10.
The default is 3: you will get both the `prophage_coordinates.tsv` and
`GenBank format output` files.
*Note:* Choice `32` will only output the prophages themselves in GFF3
format. In contrast, choice `256` outputs annotated genomes. This is
probably the best choice to bring the genome into Artemis as it will
handle multiple contigs correctly.
If you want *all* files output, use `--output_choice 512`.
Example Data
============
- *Streptococcus pyogenes* M1 GAS which has a single genome contig.
The genome contains four prophages.
To analyze this data, you can use:
PhiSpy.py -o output_directory -t data/trainSet_160490.61.txt tests/Streptococcus_pyogenes_M1_GAS.gb.gz
And you should get a prophage table that has this information (for
example, take a look at `output_directory/prophage.tbl`).
Prophage number Contig Start Stop
----------------- ------------ --------- ---------
pp\_1 NC\_002737 529631 569288
pp\_2 NC\_002737 778642 820599
pp\_3 NC\_002737 1192630 1222549
pp\_4 NC\_002737 1775862 1782822
Assessing predictions
=====================
As with any software, it is critical that you assess the output from
`phispy` to see if it actually makes sense! We start be ensuring we have
the `prophage_information.tsv` file output (this is not output by
default, and requires adding 8 to the `--output-choice` flag).
That is a tab-separated text file that you can import into Microsoft
Excel, LibreOffice Calc, Google Sheets, or your favorite spreadsheet
viewing program.
There are a few columns that you should pay attention to: - *position*
(the 6<sup>th</sup> column) is the position of the gene in the genome.
If you sort by this column you will always return the genome to the
original order. - *Final status* (the 10<sup>th</sup> column) is whether
this region is predicted to be a prophage or not. The number is the
prophage number. If the entry is 0 it is not a prophage. - *pp* and *my
status* (the 8<sup>th</sup> and 9<sup>th</sup> columns) are interim
indicators about whether this gene is potentially part of a phage.
We recommend: 1. Freeze the first row of the spreadsheet so you can see
the column headers 2. Sort the spreadsheet by the *my status* column and
color any row red where the value in this column is greater than 0 3.
Sort the spreadsheet by the *final status* column and color those rows
identified as a prophage green. 4. Sort the spreadsheet by the
*position* column.
Now all the prophages are colored green, while all the potential
prophage genes that are not included as part of a prophage are colored
red. You can easily review those non-prophage regions and determine
whether *you* think they should be included in prophages. Note that in
most cases you can adjust the `phispy` parameters to include regions you
think are prophages.
**Note:** Ensure that while you are reviewing the results, you pay
particular attention to the *contig* column. In partial genomes, contig
breaks are very often located in prophages. This is usual because
prophages often contain sequences that are repeated around the genome.
We have an [open issue](https://github.com/linsalrob/PhiSpy/issues/33)
open issue to try and resolve this in a meaningful way.
Interactive PhiSpy
==================
We have created a [jupyter
notebook](https://github.com/linsalrob/PhiSpy/blob/master/jupyter_notebooks/PhiSpy.ipynb)
example where you can run `PhiSpy` to test the effect of the different
parameters on your prophage predictions. Change the name of the genbank
file to point to your genome, and change the values in `parameters` and
see how the prophage predictions vary!
Tips, Tricks, and Errors
========================
If you are feeling lazy, you actually only need to use
`sudo apt install -y python3-pip; python3 -m pip install phispy` since
python3-pip requires `build-essential` and `python3-dev`!
If you try `PhiSpy.py -v` and get an error like this:
``` {.bash}
$ PhiSpy.py -v
-bash: PhiSpy.py: command not found
```
Then you can either use the full path:
``` {.bash}
~/.local/bin/PhiSpy.py -v
```
or add that location to your `$PATH`:
``` {.bash}
echo "export PATH=\$HOME/.local/bin:\$PATH" >> ~/.bashrc
source ~/.bashrc
PhiSpy.py -v
```
Exit (error) codes
==================
We use a few different error codes to signify things that we could not
compute. So far, we have:
----------------------------------------------------------------------------------------------
Exit Code Meaning Suggested solution
----------------------- ----------------------- ----------------------------------------------
2 No input file provided We need a file to work with!
3 No output directory We need somewhere to write the results to!
provided
10 No training sets This should be in the default install. Please
available check your installation
11 The specific training Check the argument passed to the
set is not available `--training_set` parameter
13 No kmers file found This should be in the default install. Please
check your installation
20 IO Error There was an error reading your input file.
25 Non nucleotide base Check for a non-standard base in your sequence
found
26 An ORF with no bases This is probably a really short ORF and should
be deleted.
30 No contigs We filter contigs by length, and so try
adjusting the `--min_contig_size` parameter,
though the default is 5,000 bp and you will
need some adjacent genes!
40 No ORFs in your genbank Please annotate your genome, e.g. using
file [RAST](http://rast.nmpdr.org/) or
[PROKKA](https://github.com/tseemann/prokka)
41 Less than 100 ORFs are Please annotate your genome, e.g. using
in your annotated [RAST](http://rast.nmpdr.org/) or
genome. This is not [PROKKA](https://github.com/tseemann/prokka)
enough to find a
prophage
----------------------------------------------------------------------------------------------
Making your own training sets
=============================
If within reference datasets, close relatives to bacteria of your
interest are missing, you can make your own training sets by providing
at least a single genome in which you indicate prophage proteins. This
is done by adding a new qualifier to GenBank annotation for each CDS
feature within a prophage region: `/is_phage="1"`. This allows PhiSpy to
distinguish the signal from bacterial/phage regions and make a training
set to use afterwards during classification with random forest
algorithm.
We provide a script - `mark_prophage_features.py`, to automate that
process. It updates GenBank files based on PhiSpy's
prophage\_predictions.tsv file format or user's tab-delimited table with
the following information in columns for each prophage region: 1. path
to GenBank file 2. replicon id 3. prophage start coordinate 4. prophage
end coordinate
To make training sets out of your files use `make_training_sets.py`
script. It allows you to update/extend PhiSpy's default training sets or
overwrite them with just your data.
`make_training_sets.py` prepares all required input files, i.e. it makes
phage/bacteria-specific kmers sets based on `/is_phage="1"` qualifiers,
reads information about taxonomy (if requested for grouping with
`--use_taxonomy`), calls PhiSpy in a training mode and prepares training
sets.
``` {.bash}
make_training_sets.py -d input_directory -g groups_file --use_taxonomy -k kmer_size -t kmers_type --phmms hmm_db --threads num_threads --retrain
```
where: - `input_directory`: a directory where all GenBank files for
training are stored. Note that provided path will be added to file names
in `groups_file`. - `groups_file`: a file mapping GenBank file names
with extension and the name of group they will make; each file can be
assigned to more than one group - take a look at how the reference data
grouping file was constructed at `test_genbank_files/groups.txt`. -
`use_taxonomy`: this option creates groups of training sets based on
taxonomy within analyzed GenBank files. If taxonomy information is
missing, genome is assigned to *Bacteria* group. - `kmer_size`: is the
size of kmers that will be produces. By default it's 12. If changed,
remember to also change that parameter while running PhiSpy with
produced training sets. - `kmers_type`: type of generated kmers. By
default 'all' means generating kmers by 1 nt. If changed, remember to
also change that parameter while running PhiSpy with produced training
sets.
Beside the flags that allow training with phmm signal, there are also
`--retrain` and `--absolute_retrain` flags. Each of them triggers
complete reanalysis of input files but were added for different reasons.
The first should be used whenever any file previously used for training
has changed, e.g. more/less phage proteins were marked with
`/is_phage="1"`, as it triggers preparation of new kmers files. The
second additionally ignores `trainingGenome_list.txt` file and therefore
allows to ommit PhiSpy's default reference genomes. The same will happen
when `trainingGenome_list.txt` is missing in PhiSpy's installation
directory.
All files created while training, i.e. phage/bacteria kmers and testSet
for each GenBank file are stored in `PhiSpyModules/data/testSets/`
directory in PhiSpy's installation directory. This allows to save a bit
of time when adding new genomes and retraining.
Preparing GenBank files
-----------------------
- it is recommended to mark prophage proteins even from prophage
remnants/disrupted regions composed of a few proteins with
`/is_phage="1"` to minimize the loss of good signal, kmers in
particular,
- don't use too many genomes (e.g. a 100) as you may end up with a
small set of phage-specific kmers,
- try to pick several genomes with different prophages to increase the
diversity.