-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathquestions.txt
226 lines (187 loc) · 7.72 KB
/
questions.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
================================
Questions asked by the attendees
================================
General/unclassified
--------------------
Identify correct next gen sequencing platform!
"it's complicated"
How do I assemble, call for SNPs, compare genetic elements?
see list for assemblers; not covered but interesting (mapping vs
graph-based like SGA, Cortex, ??); mummer & mauve
Guided vs Mapped vs Denovo assemblies
- nobody likes guided assemblies, apparently.
Can you salvage a run without enough data?
- define "salvage"
- you can use low amounts of data to develop strategy but you need
more data
Compare m/f expression levels in D. suzuki.
- not covered
Should assembly workflows for complex metagenomes with a large
fraction of eukaryotes use metagenome tools or euk tools? Or, perhaps
which steps in a workflow for complex metagenomes with a large
fraction of eukaryotes should use euk tools and which steps should use
metagenomic tools?
- different assemblers but read QC, etc. is fairly similar
- different expectations
- see tables
What and how much should cores support? Standard protocols, basic
software install, hands on consultation, ???
- it's a spectrum
- your core may have specialties in terms of sample prep
- mate pair libraries need experienced cores
Technology choice
-----------------
At what read length should we be moving towards OLC assemblers
instead of De Bruijn?
- no obvious answer; further investigation needed
- 500-1kb probably (JTS)
What are the best tools (assemblers) and workflows for Eukaryote
assemblies? Do they differ according to predicted genome size (80Mb
nematodes vs. 1GB euk)
- see table
How different are the workflows aimed at metagenome (gene contigs)
vs. whole-genome assembly?
- different assemblers may be useful; scale may be different; in table.
- note that contamination is increasingly endemic
- tools before and after assembly are very similar
Develop scalable, cloud based data analysis and visualization pipeline
- sorta, khmer-protocols etc.
- visualization is still lacking although igv, ray browser
Metagenomes and other mixtures
------------------------------
How to deal with "mixed assemblies" (e.g. an endosymbiont). Better
to filter reads or contigs? Does presence of other sequences
compromise assembly?
- blobology shows you the problems
- different coverage levels may mislead single-genome assemblers
- better to filter contigs because of specificity, if you get the
contigs :)
- if lots of contamination, spades/diginorm/binning/etc. may be
useful
How to assemble genomes from "messy" samples (non-clean tissue,
data more like a metagenome)? How can we separate out potential
symbionts and/or secondarily abundant microbes?
- see above
How does metagenomic assembly not collapse information from
homologs between strains/species?
- very situation/data specific
What can one expect to find based on coverage depth?
- preqc (and eventually khmer) may help diagnose low coverage,
high errors
- argument mihai pop (cite?) about doing a little exploratory
sequencing then designing next round of libraries based on
repeat content
How to deal with contamination?
- see above.
- concoct etc.
- kitomics; is there a simple way to remove kit fingerprint? (no.)
- this is a need.
- do your negative controls thoroughly, folk
Remove endosymbiont (i.e. Wolbachia) reads from fly and butterfly
data.
- see above. no standard protocol.
From shotgun data:
- alpha-diversity
* phylosift or other??
- beta-diversity
* phylosift or other
- Key taxa at multiple taxonomic levels
* phylosift or other
- Functional genes for a variety of pathways
* phylosift cures all (WebMGA is also good)
* MG-rAST pretends
* HUMANn
Genome assembly
---------------
How closely related are my insects?
- probably, very. This is really a genotyping question...?
- can do with k-mer distributions (JTS?)
- but this is complicated and not necessarily easy
Will inbreeding be useful or necessary?
- useful, but not necessary (just more $$ sequencing)
Seq indiv to chromosomes or at least meaningful
synteny maps? (inversions)
- see table of sequencing approaches;
- optical maps, linkage maps, PacBio
- How to assemble Eukaryotes??
* see table
- Hybrid needed? What prep?
* see table
- WGA for MP lib OK?
* try to avoid
- Make references? (Should we make strain/haplotype-specific references)
* "ancestral graph reconstruction" theory
* tools needed to make a reference graph
- Iteratively improve reference genome with new sequences?
* can be done; no strong recommendation. Depends on the sequence
data you have. (PBJelly with PacBio, GAA for merging assemblies,
etc.) Talk to Titus because he has funding to do this on chicken.
- Make pop specific "type" refs?
* see above, 'make references'
- What exactly is "finishing"
* No one wanted to answer, no one does it.
* No such thing as a "finished" genome
- Bias when doing mapping vs de novo?
* There is allele-specific mapping bias. (For RNAseq, Wittkopp
reference)
Assembly outcomes, metrics & evaluation
---------------------------------------
Do we care about the difference between a 150 contig and a 100
contig assembly? What about 10 or 1000 contigs?
- depends on your question
- will be limited by your data, requires careful planning; see
15-17 commandments
- How large is the genome?
* preqc can tell you this, khmer can estimate badly
- What do I need to do to improve genome quality?
* more or less the topic of this entire workshop
Can we still use bad quality data?
* all data is bad quality data, so, yes
* limits your conclusions. It's called Science, folks.
* if data is randomly bad, OK; if it's unknown systematic error, then more is worse.
Quality/completion metrics.
* see "finishing", above.
* reads mapped back; marker genes; cegma; reapr; frcbam; visual
inspection; bridgemapper
What quality genome can I expect?
* preqc can help tell you this; don't do wheat, soil, or fish.
What's a good way to evaluate completeness?
- Mapping of reads. High percentage is good, and tells you if there
is room for improvement.
- "treat your assembly as a hypothesis and see how well your reads
match"
- bring in orthogonal evidence (see Lex's list)
Sample prep
-----------
How really bad is Nextera (insertion bias and dual
mode insert size) compared to TruSeq?
* you just need to be aware that Nextera is messy in different ways
* we want Nick to write a blog post
* Nextera is cheaper and easier, but has insertion bias and req PCR
* high GC organisms worse with Nextera, and Illumina overall
How much is it worth striving for PCR-free library preps to avoid
amplification bias? and/or going for single-molecule sequencing?
* depends on details: GC, your question, amount of DNA, and whether
you want to attach gold molecules to each sample
Predicting the future
---------------------
It is hard to predict things, especially the future.
- Niels Bohr
predictions:
- @ryneches: by Dec 31, 2015, Illumina reads will be longer than
Sanger reads ever were. (> 1kb)
- @arturgreensward: by Dec 31, 2015, PacBio will be the only thing
used to sequence bacterial genomes
- @pathogenomenick: by Dec 31, 2014, there will be two Nanopore
platforms usefully available
- OH: by Dec 31, 2013, Mick will ask for an embargo
on new assemblers
- @lexnederbragt: 2014-2015, diploid aware assemblers will emerge
- @pathogenomenick: by end of 2015, graph based variant calling will
be default
- @ryneches: by 2016, computation will be considerably more expensive
than sequencing
- @ctitusbrown: by 2016, the field will have solved the expensive
computation problem and > 50% of non-biomedical computational
analysis will be a commodity service (although it may not be
a particularly good commodity service)