-
Notifications
You must be signed in to change notification settings - Fork 0
/
symbiota-support-hub-2023-09-11.html
387 lines (385 loc) · 13.3 KB
/
symbiota-support-hub-2023-09-11.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang xml:lang>
<head>
<meta charset="utf-8" />
<meta name="generator" content="pandoc" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
<meta name="author" content="Jorrit H. Poelen" />
<meta name="author" content="Jenn Yost" />
<meta name="author" content="Katelin Pearson" />
<meta name="dcterms.date" content="2023-09-11" />
<title>Building a Digital Extended Specimen One Association at a Time: What Does It Take to Extend OBI Herbarium Records with their Associated GenBank Sequences?</title>
<style>
html {
color: #1a1a1a;
background-color: #fdfdfd;
}
body {
margin: 0 auto;
max-width: 36em;
padding-left: 50px;
padding-right: 50px;
padding-top: 50px;
padding-bottom: 50px;
hyphens: auto;
overflow-wrap: break-word;
text-rendering: optimizeLegibility;
font-kerning: normal;
}
@media (max-width: 600px) {
body {
font-size: 0.9em;
padding: 12px;
}
h1 {
font-size: 1.8em;
}
}
@media print {
html {
background-color: white;
}
body {
background-color: transparent;
color: black;
font-size: 12pt;
}
p, h2, h3 {
orphans: 3;
widows: 3;
}
h2, h3, h4 {
page-break-after: avoid;
}
}
p {
margin: 1em 0;
}
a {
color: #1a1a1a;
}
a:visited {
color: #1a1a1a;
}
img {
max-width: 100%;
}
svg {
height; auto;
max-width: 100%;
}
h1, h2, h3, h4, h5, h6 {
margin-top: 1.4em;
}
h5, h6 {
font-size: 1em;
font-style: italic;
}
h6 {
font-weight: normal;
}
ol, ul {
padding-left: 1.7em;
margin-top: 1em;
}
li > ol, li > ul {
margin-top: 0;
}
blockquote {
margin: 1em 0 1em 1.7em;
padding-left: 1em;
border-left: 2px solid #e6e6e6;
color: #606060;
}
div.abstract {
margin: 2em 2em 2em 2em;
text-align: left;
font-size: 85%;
}
div.abstract-title {
font-weight: bold;
text-align: center;
padding: 0;
margin-bottom: 0.5em;
}
code {
font-family: Menlo, Monaco, Consolas, 'Lucida Console', monospace;
font-size: 85%;
margin: 0;
hyphens: manual;
}
pre {
margin: 1em 0;
overflow: auto;
}
pre code {
padding: 0;
overflow: visible;
overflow-wrap: normal;
}
.sourceCode {
background-color: transparent;
overflow: visible;
}
hr {
background-color: #1a1a1a;
border: none;
height: 1px;
margin: 1em 0;
}
table {
margin: 1em 0;
border-collapse: collapse;
width: 100%;
overflow-x: auto;
display: block;
font-variant-numeric: lining-nums tabular-nums;
}
table caption {
margin-bottom: 0.75em;
}
tbody {
margin-top: 0.5em;
border-top: 1px solid #1a1a1a;
border-bottom: 1px solid #1a1a1a;
}
th {
border-top: 1px solid #1a1a1a;
padding: 0.25em 0.5em 0.25em 0.5em;
}
td {
padding: 0.125em 0.5em 0.25em 0.5em;
}
header {
margin-bottom: 4em;
text-align: center;
}
#TOC li {
list-style: none;
}
#TOC ul {
padding-left: 1.3em;
}
#TOC > ul {
padding-left: 0;
}
#TOC a:not(:hover) {
text-decoration: none;
}
code{white-space: pre-wrap;}
span.smallcaps{font-variant: small-caps;}
div.columns{display: flex; gap: min(4vw, 1.5em);}
div.column{flex: auto; overflow-x: auto;}
div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
ul.task-list[class]{list-style: none;}
ul.task-list li input[type="checkbox"] {
font-size: inherit;
width: 0.8em;
margin: 0 0.8em 0.2em -1.6em;
vertical-align: middle;
}
.display.math{display: block; text-align: center; margin: 0.5rem auto;}
div.csl-bib-body { }
div.csl-entry {
clear: both;
}
.hanging-indent div.csl-entry {
margin-left:2em;
text-indent:-2em;
}
div.csl-left-margin {
min-width:2em;
float:left;
}
div.csl-right-inline {
margin-left:2em;
padding-left:1em;
}
div.csl-indent {
margin-left: 2em;
} </style>
<!--[if lt IE 9]>
<script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script>
<![endif]-->
</head>
<body>
<header id="title-block-header">
<h1 class="title">Building a Digital Extended Specimen One Association
at a Time: What Does It Take to Extend OBI Herbarium Records with their
Associated GenBank Sequences?</h1>
<p class="author">Jorrit H. Poelen</p>
<p class="author">Jenn Yost</p>
<p class="author">Katelin Pearson</p>
<p class="date">2023-09-11</p>
<div class="abstract">
<div class="abstract-title">Abstract</div>
<p>Specimen from Natural History Collections are physical repositories
of genetic information. Genetic sequences extracted from specimen are
stored in genetic sequence databases like the openly accessible GenBank
at NCBI, DNA DataBank of Japan, or the European Nucleotide Archive
(ENA). While researchers and collection managers make efforts to
associate (or link) Natural History Collection records with their
derived genetic accession records, extra work is need to make these
associations explicit. We describe how a collaboration between a
biodiversity informatics expert and collection managers of the
Hoover/OBI Herbarium at CalPoly, San Luis Obispo, CA was forged with the
aim to extend OBI specimen records to include their associated GenBank
records. In addition, we quantify the costs of creating these specimen
extensions, and discuss the socio-economic capacity needed to repeat
this digital specimen extension process for the hundreds of millions of
specimen records available globally today.</p>
</div>
</header>
<nav id="TOC" role="doc-toc">
<ul>
<li><a href="#bibliography" id="toc-bibliography">References</a></li>
</ul>
</nav>
<p>Hello again Lindsay, Jenn, Katie,</p>
<p>Thinking about our Extended Specimen Workshop / OBI-GenBank
collaboration kept my poor little brain quite active today and
yesterday. In fact, I had trouble sleeping because of it.</p>
<p>As I am stewing on an abstract for your 2023-09-11 Symbiota Support
Hub session, I reflected on the effort it took for us to nurture our
collaboration:</p>
<ol type="1">
<li><p>time investment to organize extended specimen workshop at Digital
Data 2023 at Tempe, AZ</p></li>
<li><p>time investment to package NCBI GenBank and OBI Herbarium digital
resources <span class="citation" data-cites="Poelen_2023_a Poelen_2023_b">(Poelen, Pearson, and Yost
2023; Poelen 2023)</span></p></li>
<li><p>time investment to build a custom (off-line enabled) workflow
based on (archived) digital resources of known origin</p></li>
<li><p>time investment to archive the NCBI GenBank Plant flat files at
ASU’s BioKIC via Globus facilitated by Greg Post and Nico Franz</p></li>
<li><p>time investment by Katie and Jorrit to collaborate on a shared
google sheets to propose (Jorrit) and verify (Katie) GenBank<>OBI
association claims</p></li>
<li><p>time investment by Katie to populate OBI’s Symbiota records with
associated GenBank sequences</p></li>
<li><p>transfer of symbolic reward (a tub of TJs Ginger Cookies) by
Symbiota developer Ed Gilbert at the July 2023 workshop on imagining a
Biological Action Center. This workshop itself required a 3 day time
investment on my part.</p></li>
</ol>
<p>and now . . .</p>
<p>(pending)</p>
<ol start="8" type="1">
<li>More time investment (mostly by Jorrit) to publicize a novel
workflow to discover GenBank associations in existing natural history
collections as published through DwC-A.</li>
</ol>
<p>In an effort to do a little cost/benefit analysis, I made a quick
back-of-the-napkin calculation of the (socio-)economics aspects of our
experiment: I spent about 16 hours of work (I kept track of my time,
this excludes writing this text) and got a tub of 78 TJs Ginger Cookies
(13 servings at 6 cookies a serving). (Thank you!) Noting that the
resale value of TJ Ginger Cookies is probably about $1 or less, I favor
using cookies as a unit instead of a dollar. So, converting that to an
hourly effort would: 78 cookies / 16h ~ 5 cookies / hour. Another way
quantify the “value” of our method is to estimate the number of cookies
gained per created specimen GenBank association (as measured from
https://cch2.org/portal/content/dwca/OBI_DwC-A.zip with signature
hash://sha256/cd9de973510975dac3394952bba9c486a482762b3beab05ecb678037b99ab85b
as seen on 2023-07-19T14:46:11.145Z):</p>
<p>78 cookies / 25 GenBank associations = 3 cookies / OBI-GenBank
association</p>
<p>Assuming that someone is willing to work for 3 cookies per
association, and assuming that OBI is representative collection as far
as 0.03% of specimen (i.e., 94,031 OBI preserved specimen) having
GenBank associations (25 detected genbank associations), and estimating
about 200 million digitized preserved specimen (GBIF claims 225M,
iDigBio claims 138M as of 2023-08-10), you’d have to buy = 200M * 0.03%
* 3 = 60k associations * 3 cookies / association = 180k cookies or about
2300 tubs of TJs Ginger Cookies.</p>
<p>The case for making our method to efficiently produce/distribute the
number of cookies needed per GenBank association:</p>
<ol type="1">
<li>monitor the availability of specimen-GenBank links in DwC-A and
GenBank (link out) over time</li>
<li>estimate time needed to discover new, and maintain existing,
specimen-GenBank links</li>
<li>produce re-usable methods, reducing development time</li>
<li>advertise the method to avoid rework</li>
<li>improve methods when needed through available means (standarization,
“smart” algorithms, efficient semi-automated link suggestion/curation
workflows)</li>
<li>express the value of having specimen-GenBank links readily/openly
available</li>
</ol>
<p>So, now my open questions are:</p>
<p><strong>Q1. How can we estimate the methods needed to support the
discovery and maintenance of Specimen-GenBank records such that it can
be sustained by those valueing the availability of Specimen-GenBank
records links?</strong></p>
<p>This is taking into account that one person cannot possibly eat 180k
cookies before they go stale, even if someone is willing (and able) to
source the 2300 cookie tubs needed to discover the specimen-genbank
claims. So, distribution of the cookies should be factored into the
development of the Specimen<>GenBank association discovery and
recording method.</p>
<p>As it stands, the work that Jorrit has done to showcase a method to
extend digital representations preserved specimen with their GenBank
associations is not yet valued in terms of “real” monetary units. Which
means that his work is, economically speaking, valueless.</p>
<p><strong>Q2. What is the value of Jorrit’s work so far? And who should
compensate him? What is 16 hours of my time worth to you? A $10 tub of
cookies?</strong></p>
<p>And, perhaps more importantly,</p>
<p><strong>Q3. do the current funding mechanisms allow for rapid
development of ideas to address immediate needs in the biodiversity
informatics community?</strong></p>
<p>I’d like very much to contribute to your support hub event, and
before doing so, I’d like to discuss the thoughts above. The reason for
taking on this prototype challenge was to gather some evidence to
support the idea that current funding mechanisms and project management
are insufficient to build something as dynamic and complex as the
digital extended specimen.</p>
<p>It make me think of a proverb I first encountered as a 6 year old,
and found inspirational only later in life:</p>
<blockquote>
<p>“Everybody want to go back to nature, but nobody wants to walk.”</p>
</blockquote>
<p>which can be reworded in terms of the digital extended specimen:</p>
<blockquote>
<p>“Everybody wants to have the digital extended specimen, but someone
else has to build it.”</p>
</blockquote>
<p>This may be a bit extreme of a statement in context of our
collaboration, especially because I know that you’ve invested plenty of
time in extending existing digital specimen with their GenBank sequences
and beyond.</p>
<p>In short, I’d like to have more discussion around effective
collaboration that help nurture, and sustain, the socio-economical
aspects of the digital extended specimen. Without these productive
collaborations, the massive amount of work needed to better integrate
the biodiversity / biology and informatics disciplines continues to be
as is - sporadic and mostly based on volunteer work. A similar, but more
broad argument, can be made for creating good working conditions to
promote quality research <span class="citation" data-cites="Rahal_2023">(Rahal et al. 2023)</span>.</p>
<p>Curious to hear your thoughts,</p>
<p>thx, -jorrit</p>
<h1 class="unnumbered" id="bibliography">References</h1>
<div id="refs" class="references csl-bib-body hanging-indent" role="list">
<div id="ref-Poelen_2023_b" class="csl-entry" role="listitem">
Poelen, Jorrit H. 2023. <span>“GenBank PLN (Plantae, Fungi, Algae)
Sequence Index in TSV, CSV, JSONL Formats
Hash://Sha256/Bc7368469e50020ce8ae27b9d6a9a869e0b9a2a0a9b5480c69ce6751fa4b870e
Hash://Md5/F6f78f64e3b3ff06adc3229badbd578b.”</span> Zenodo. <a href="https://doi.org/10.5281/zenodo.8117720">https://doi.org/10.5281/zenodo.8117720</a>.
</div>
<div id="ref-Poelen_2023_a" class="csl-entry" role="listitem">
Poelen, Jorrit H., Katelin Pearson, and Jenn Yost. 2023.
<span>“Extending OBI Herbarium Records to Include Associated NCBI
GenBank Sequences.
Hash://Sha256/Be5605e58d2644baedcb160604080d9f02ce528064b7fbb13a5b556dd55cfeb6.”</span>
GitHub. <a href="https://github.com/jhpoelen/obi-genbank">https://github.com/jhpoelen/obi-genbank</a>.
</div>
<div id="ref-Rahal_2023" class="csl-entry" role="listitem">
Rahal, Rima-Maria, Susann Fiedler, Adeyemi Adetula, Ronnie P.-A.
Berntsson, Ulrich Dirnagl, Gordon B. Feld, Christian J. Fiebach, et al.
2023. <span>“Quality Research Needs Good Working Conditions.”</span>
<em>Nature Human Behaviour</em> 7 (2): 164–67. <a href="https://doi.org/10.1038/s41562-022-01508-2">https://doi.org/10.1038/s41562-022-01508-2</a>.
</div>
</div>
</body>
</html>