-
Notifications
You must be signed in to change notification settings - Fork 16
/
README_OTHER_PROGRAMS
155 lines (94 loc) · 4.44 KB
/
README_OTHER_PROGRAMS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
##########################
# Other Scripts #
##########################
Other Scripts
Created By: Justin Miller, Brandon Pickett, Perry Ridge
Email: jmiller@byu.edu
##########################
combineOrthoGroups.py
The purpose of this program is to take the pairwise output from many JustOrthologs output
files and combine them into ortholog groups. This script requires one-to-one orthology, meaning
multiple orthologous genes cannot be present in the same species. It also requires that the same
species names be used in the input files (i.e., Homo sapiens is different than humans).
REQUIREMENTS:
combineOrthoGroups.py is written in python version 2.7 and requires the following libraries:
1. sys
2. argparse
These dependencies should already be satisfied if you have run any of the other programs.
Input Files:
This program takes as input, paths to JustOrthologs output files or the path to a
directory with JustOrthologs output files.
An optional path to an output file can be provided.
Run python combineOrthoGroups.py -h for usage arguments.
EXAMPLE USAGE:
python combineOrthoGroups.py -id smallTest/testCombineOrthoGroups/ -o output
##########################
The following scripts are included in the wrapper and can be used together in a single-step process
described in README_WRAPPER
##########################
gff3_parser.py
The purpose of this script is to parse a gff3 file and an accompanying reference fasta file.
This script will extract all CDS regions from the fasta file and combine them into a single
record, with each CDS region followed by an asterisk ('*') to signify the end of the exon.
The FASTA header lines must be formatted in one of two ways:
1. >NC_######
OR
2. >gi|####|ref|NC_#####
where NC_#### is the identifier used in the gff3 file.
The current reference genome and gff3 file can be found here: https://www.ncbi.nlm.nih.gov/genome/guide/human/
REQUIREMENTS:
gff3_parser uses Python version 2.7
Python libraries that must be installed include:
1. sys
2. argparse
3. gzip
If any of those libraries is not currently in your Python Path, use the following command:
pip install --user [library_name]
to install the library in your path.
Input Files:
This program takes as input a path to a gff3 file, a path to a fasta file, and an optional path to an
output file.
EXAMPLE USAGE:
python gff3_parser.py -g smallTest/wrapperTest/small_human.gff3 -f smallTest/wrapperTest/small_human.fasta.gz -o output.fasta
python gff3_parser.py -g smallTest/wrapperTest/small_human.gff3 -f smallTest/wrapperTest/small_human.fasta.gz > output.fasta
##########################
getNoException.py
The purpose of this script is to eliminate any annotated exceptions
from a fasta file. Often, gff3 files have gene annotations for partial
proteins or known error sequences. To eliminate the effects of error
sequences on our analysis, these sequences are eliminated. Furthermore,
multiple isoforms of the same gene are often annotated. This script removes
all but the longest isoform.
REQUIREMENTS:
getNoException uses Python version 2.7
Python libraries that must be installed include:
1. sys
2. argparse
3. gzip
If any of those libraries is not currently in your Python Path, use the following command:
pip install --user [library_name]
to install the library in your path.
Input Files:
This program takes as input path to an input fasta file, and an optional path to an
output file.
EXAMPLE USAGE:
python getNoException.py -i smallTest/testNoException/test.fasta -o output.fasta
python getNoException.py -i smallTest/testNoException/test.fasta > output.fasta
##########################
sortFastaBySeqLen.sh
The purpose of this script is to sort a fasta file by the number of
CDS regions in a particular gene. The input fasta file requires CDS regions
to be marked with an asterisk ("*"). The provided script, gff3_parser, can
create a fasta file formatted correctly for sortFastaBySeqLen.sh.
REQUIREMENTS:
sortFastaBySeqLen is a bash shell script that should be run on Linux or MacOS.
If you are using MacOS, make sure that gsed is installed:
brew install gnu-sed
Input Files:
This program takes as input path to an input fasta file, and an optional path to an
output file. It can also take standard in as the input file.
EXAMPLE USAGE:
bash sortFastaBySeqLen.sh smallTest/testNoException/test.fasta output.fasta
bash sortFastaBySeqLen.sh smallTest/testNoException/test.fasta > output.fasta
##########################
Thank you, and happy researching!!