-
Notifications
You must be signed in to change notification settings - Fork 77
/
TUTORIAL
153 lines (113 loc) · 6.23 KB
/
TUTORIAL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
Bowtie: an Ultrafast, Lightweight Short Read Aligner
Bowtie Getting Started Guide
============================
Download and extract the appropriate Bowtie binary release from
http://bowtie-bio.sf.net into a fresh directory. Change to that
directory.
Performing alignments
---------------------
The Bowtie source and binary packages come with a pre-built index of
the E. coli genome, and a set of 1,000 35-bp reads simulated from that
genome. To use Bowtie to align those reads, issue the following
command. If you get an error message "command not found", try adding
a "./" before the "bowtie".
bowtie e_coli reads/e_coli_1000.fq
The first argument to bowtie is the basename of the index for the
genome to be searched. The second argument is the name of a FASTQ file
containing the reads.
Depending on your computer, the run might take a few seconds up to
about a minute. You will see bowtie print many lines of output. Each
line is an alignment for a read. The name of the aligned read appears
in the leftmost column. The final line should say "Reported 698
alignments to 1 output stream(s)" or something similar.
Next, issue this command:
bowtie -t e_coli reads/e_coli_1000.fq e_coli.map
This run calculates the same alignments as the previous run, but the
alignments are written to e_coli.map (the final argument) rather than
to the screen. Also, the -t option instructs Bowtie to print timing
statistics. The output should look something like this:
Time loading forward index: 00:00:00
Time loading mirror index: 00:00:00
Seeded quality full-index search: 00:00:00
# reads processed: 1000
# reads with at least one reported alignment: 699 (69.90%)
# reads that failed to align: 301 (30.10%)
Reported 699 alignments to 1 output stream(s)
Time searching: 00:00:00
Overall time: 00:00:00
Installing a pre-built index
----------------------------
Download the pre-built S. cerevisiae genome package from the Bowtie
FTP site:
ftp://ftp.cbcb.umd.edu/pub/data/bowtie_indexes/s_cerevisiae.ebwt.zip
All pre-built indexes are packaged as .zip archives, and the S.
cerevisiae archive is named s_cerevisiae.ebwt.zip. When it has
finished downloading, extract the archive into the Bowtie 'indexes'
subdirectory using your preferred unzip tool. The index is now
installed.
To test that the index is properly installed, issue this command from
the Bowtie install directory:
bowtie -c s_cerevisiae ATTGTAGTTCGAGTAAGTAATGTGGGTTTG
This command searches the S. cerevisiae index with a single read. The
-c argument instructs Bowtie to obtain read sequences directly from
the command line rather than from a file. If the index is installed
properly, this command should print a single alignment and then exit.
If you would rather install pre-built indexes somewhere other than the
'indexes' subdirectory of the Bowtie install directory, simply set the
BOWTIE_INDEXES environment variable to point to your preferred
directory and extract indexes there instead.
Building a new index
--------------------
The pre-built E. coli index included with Bowtie is built from the
sequence for strain 536, known to cause urinary tract infections. We
will create a new index from the sequence of E. coli strain O157:H7, a
strain known to cause food poisoning. Download and decompress the
sequence file from:
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/all_assembly_versions/GCF_000513035.1_E._coli_O157/GCF_000513035.1_E._coli_O157_genomic.fna.gz
Once it has been downloaded and decompressed, move it to the Bowtie
install directory and issue this command:
bowtie-build GCF_000513035.1_E._coli_O157_genomic.fna e_coli_O157_H7
The command should finish quickly, and print several lines of status
messages. When the command has completed, note that the current
directory contains four new files named e_coli_O157_H7.1.ebwt,
e_coli_O157_H7.2.ebwt, e_coli_O157_H7.rev.1.ebwt, and
e_coli_O157_H7.rev.2.ebwt. These files constitute the index. Move
these files to the indexes subdirectory to install it.
To test that the index is properly installed, issue this command:
bowtie -c e_coli_O157_H7 GAACCGTATTCACCCGCCATCCCCATGCCG
If the index is installed properly, this command should print a single
alignment and then exit.
Finding variations with SAMtools
--------------------------------
SAMtools (http://samtools.sf.net) is a suite of tools for storing,
manipulating, and analyzing alignments such as those output by Bowtie.
SAMtools understands alignments in either of two complementary
formats: the human-readable SAM format, or the binary BAM format.
Because Bowtie can output SAM (using the -S/--sam option), and SAM can
can be converted to BAM using SAMtools, Bowtie users can make full use
of the analyses implemented in SAMtools, or in any other tools
supporting SAM or BAM.
We will use SAMtools to find SNPs in a set of simulated reads included
with Bowtie. The reads cover the first 10,000 bases of the pre-built
E. coli genome and contain 10 SNPs throughout. First, we run 'bowtie'
to align the reads, being sure to specify the -S option. We also
specify an output file that we will use as input for the next step
(though pipes can be used to accomplish the same thing without the
intermediate file):
bowtie -S e_coli reads/e_coli_10000snp.fq ec_snp.sam
Next, we convert the SAM file to BAM in preparation for sorting. We
assume that SAMtools is installed and that the samtools binary is
accessible in the PATH.
samtools view -bS -o ec_snp.bam ec_snp.sam
Next, we sort the BAM file, in preparation for SNP calling:
samtools sort ec_snp.bam ec_snp.sorted
We now have a sorted BAM file called ec_snp.sorted.bam. Sorted BAM is
a useful format because the alignments are both compressed, which is
convenient for long-term storage, and sorted, which is conveneint for
variant discovery. Finally, we call variants from the Sorted BAM:
samtools pileup -cv -f genomes/NC_008253.fna ec_snp.sorted.bam
For this sample data, the 'samtools pileup' command should print
records for 10 distinct SNPs, the first being at position 541 in the
reference.
See the SAMtools web site for details on how to use these and other
tools in the SAMtools suite: http://samtools.sf.net/.