-
Notifications
You must be signed in to change notification settings - Fork 3
/
hands-on.qmd
330 lines (230 loc) · 11 KB
/
hands-on.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
---
title: "Hands-on DuckDB & dplyr"
execute:
warning: false
---
## The dataset
ARCTIC SHOREBIRD DEMOGRAPHICS NETWORK [https://doi.org/10.18739/A2222R68W](https://doi.org/10.18739/A2222R68W){target="_blank"}
Data set hosted by the NSF Arctic Data Center (<https://arcticdata.io>)
Field data on shorebird ecology and environmental conditions were collected from 1993-2014 at 16 field sites in Alaska, Canada, and Russia.
![Shorebird, copyright NYT](https://static01.nyt.com/images/2017/09/10/nyregion/10NATURE1/10NATURE1-superJumbo.jpg?quality=75&auto=webp){width=60%}
Data were not collected in every year at all sites. Studies of the population ecology of these birds included nest-monitoring to determine timing of reproduction and reproductive success; live capture of birds to collect blood samples, feathers, and fecal samples for investigations of population structure and pathogens; banding of birds to determine annual survival rates; resighting of color-banded birds to determine space use and site fidelity; and use of light-sensitive geolocators to investigate migratory movements. Data on climatic conditions, prey abundance, and predators were also collected. Environmental data included weather stations that recorded daily climatic conditions, surveys of seasonal snowmelt, weekly sampling of terrestrial and aquatic invertebrates that are prey of shorebirds, live trapping of small mammals (alternate prey for shorebird predators), and daily counts of potential predators (jaegers, falcons, foxes). Detailed field methods for each year are available in the ASDN_protocol_201X.pdf files. All research was conducted under permits from relevant federal, state and university authorities.
See `01_ASDN_Readme.txt` provided in the `data` folder for full metadata information about this data set.
## Analyzing the bird dataset using csv files (raw data)
Loading the necessary packages. DuckDB has its own R package that is mostly a wrapper around dbplyr and DBI.
```{r}
#| message: false
library(tidyverse)
```
Import the csv files with the bird species information:
```{r}
# Import the species
species_csv <- read_csv("data/species.csv")
glimpse(species_csv)
```
Let's explore what is in the `Relevance` attribute/column:
```{r}
species_csv %>%
group_by(Relevance) %>%
summarize(num_species = n())
```
We are interested in the `Study species` because according to the metadata, they are the species that are included in the data sets for banding, resighting, and/or nest monitoring.
Let us extract the species and sort them in alphabetical order:
```{r}
# list of the bird species included in the study
species_study <- species_csv %>%
filter(Relevance=="Study species") %>%
select(Scientific_name, Code) %>%
arrange(Scientific_name)
species_study
```
### Average egg volume
:::{.callout-tip}
## Analysis
***We would like to know what is the average egg size for each of those bird species. How would we do that?***
:::
We will need more information that what we have in our species table. Actually we will need to also retrieve information from the nests and eggs monitoring table.
An egg is in a nest, and a nest is associated with a species
```{r}
# information about the nests
nests_csv <- read_csv("data/ASDN_Bird_nests.csv")
# information about the
eggs_csv <- read_csv("data/ASDN_Bird_eggs.csv")
```
How do we join those tables?
```{r}
glimpse(eggs_csv)
```
`Nest_Id` seems like promising as a foreign key!!
```{r}
glimpse(nests_csv)
```
`Species` is probably the field we will use to join nest to the species
OK let's do it:
First, we need to compute the average of the volume of an egg. We can use the following formula:
$Volume=\frac{\Pi}6W^2L$
Where W is the width and L the length of the egg
We can use mutate to do so:
```{r}
eggs_volume_df <- eggs_csv %>%
mutate(egg_volume = pi/6*Width^2*Length)
```
Now let's join this information to the nest table, and average by species
```{r}
species_egg_volume_avg <- left_join(eggs_volume_df, nests_csv, by="Nest_ID") %>%
group_by(Species) %>%
summarise(egg_volume_avg = mean(egg_volume, na.rm = TRUE)) %>%
arrange(desc(egg_volume_avg)) %>%
drop_na()
species_egg_volume_avg
```
Ideally we would like the scientific names...
```{r}
species_egg_vol_avg <- species_study %>%
inner_join(species_egg_volume_avg, by = join_by(Code == Species))
species_egg_vol_avg
```
## Let's connect to our first database
```{r}
#| message: false
library(dbplyr) # to query databases in a tidyverse style manner
library(DBI) # to connect to databases
# install.packages("duckdb") # install this package to get duckDB API
library(duckdb) # Specific to duckDB
```
### Load the bird database
This database has been built from the csv files we just analyzed, so the data should be very similar - note we did not say identical more on this in the last section:
```{r}
conn <- dbConnect(duckdb::duckdb(), dbdir = "./data/bird_database.duckdb")
```
List all the tables present in the database:
```{r}
dbListTables(conn)
```
Let's have a look at the Species table
```{r}
species_db <- tbl(conn, "Species")
species_db
```
You can filter the data and select columns:
```{r}
species_db %>%
filter(Relevance=="Study species") %>%
select(Scientific_name) %>%
arrange(Scientific_name) %>%
head(3)
```
:::{.callout-note}
## Note
_Note that those are **not** data frames but tables. What `dbplyr` is actually doing behind the scenes is translating all those dplyr operations into SQL, sending the SQL code to query the database, retrieving results, etc._
:::
#### How can I get a "real" data frame?
You add `collect()` to your query.
```{r}
species_db %>%
filter(Relevance=="Study species") %>%
select(Scientific_name) %>%
arrange(Scientific_name) %>%
head(3) %>%
collect()
```
Note it means the full query is going to be ran and save in your R environment. This might slow things down, so you generally want to collect on the smallest data frame you can.
#### How can you see the SQL query?
Adding `show_query()` at the end of your code block will let you see the SQL code that has been used to query the database.
```{r}
# Add show_query() to the end to see what SQL it is sending!
species_db %>%
filter(Relevance=="Study species") %>%
select(Scientific_name) %>%
arrange(Scientific_name) %>%
head(3) %>%
show_query()
```
This is a great way to start getting familiar with the SQL syntax, because although you can do a lot with `dbplyr` you can not do everything that SQL can do. So at some point you might want to start using SQL directly.
Here is how you could run the query using the SQL code directly:
```{r}
# query the database using SQL
dbGetQuery(conn, "SELECT Scientific_name FROM Species WHERE (Relevance = 'Study species') ORDER BY Scientific_name LIMIT 3")
```
You can do pretty much anything with these quasi-tables, including grouping, summarization, joins, etc.
Let's count how many species there are per Relevance categories:
```{r}
species_db %>%
group_by(Relevance) %>%
summarize(num_species = n())
```
Does that code looks familiar? But this time, here is really the query that was used to retrieve this information:
```{r}
species_db %>%
group_by(Relevance) %>%
summarize(num_species = n()) %>%
show_query()
```
### Average egg volume analysis
Let's reproduce the egg volume analysis we just did. We can calculate the average bird eggs volume per species directly on the database:
```{r}
# loading all the necessary tables
eggs_db <- tbl(conn, "Bird_eggs")
nests_db <- tbl(conn, "Bird_nests")
```
Compute the volume using the same code as previously!! Yes, you can use mutate to create new columns on the tables object
```{r}
# Compute the egg volume
eggs_volume_db <- eggs_db %>%
mutate(egg_volume = pi/6*Width^2*Length)
```
:::{.callout-caution}
_Limitation: no way to add or update data in the database, `dbplyr` is view only. If you want to add or update data, you'll need to use the `DBI` package functions._
:::
Now let's join this information to the nest table, and average by species
```{r}
# Join the egg and nest tables to compute average
species_egg_volume_avg_db <- left_join(nests_db, eggs_volume_db, by="Nest_ID") %>%
group_by(Species) %>%
summarise(egg_volume_avg = mean(egg_volume, na.rm = TRUE)) %>%
arrange(desc(egg_volume_avg)) %>%
collect() %>%
drop_na()
species_egg_volume_avg_db
```
What does this SQL query looks like?
```{r}
species_egg_volume_avg_db <- left_join(eggs_volume_db, nests_db, by="Nest_ID") %>%
group_by(Species) %>%
summarise(egg_volume_avg = mean(egg_volume, na.rm = TRUE)) %>%
arrange(desc(egg_volume_avg)) %>%
show_query()
```
:::{.callout-note}
## Question
***Why does the SQL query include the volume computation?***
:::
### Disconnecting from the database
Before we close our session, it is good practice to disconnect from the database first
```{r}
DBI::dbDisconnect(conn, shutdown = TRUE)
```
## How did we create this database
You might be wondering how we created this database from our csv files. Most databases provide functions to import data from csv and other types of files. It is also possible to load data into the database programmatically from within R, one row at a time, using insert statements, but it is more common to load data from csv files. Note that since there is little data modeling within a csv file (the data does not have to be normalized or tidy), and no data type or value constraints can be enforced, a lot things can go wrong. Putting data in a database is thus a great opportunity to implement QA/QC and help you keep your data clean and tidy moving forward as new data are collected.
To look at one example, below is the SQL code that was used to create the `Bird_eggs` table:
```{sql eval=FALSE}
CREATE TABLE Bird_eggs (
Book_page VARCHAR,
Year INTEGER NOT NULL CHECK (Year BETWEEN 1950 AND 2015),
Site VARCHAR NOT NULL,
FOREIGN KEY (Site) REFERENCES Site (Code),
Nest_ID VARCHAR NOT NULL,
FOREIGN KEY (Nest_ID) REFERENCES Bird_nests (Nest_ID),
Egg_num INTEGER NOT NULL CHECK (Egg_num BETWEEN 1 AND 20),
Length FLOAT NOT NULL CHECK (Length > 0 AND Length < 100),
Width FLOAT NOT NULL CHECK (Width > 0 AND Width < 100),
PRIMARY KEY (Nest_ID, Egg_num)
);
COPY Bird_eggs FROM 'ASDN_Bird_eggs.csv' (header TRUE);
```
DuckDB's `COPY` SQL command reads a csv file into a database table. Had we not already created the table in the previous statement, DuckDB would have created it automatically and guessed at column names and data types. But by explicitly declaring the table, we are able to add more characterization to the data. Notable in the above:
- `NOT NULL` indicates that missing values are not allowed.
- Constraints (e.g., `Egg_num BETWEEN 1 and 20`) express our expectations about the data.
- A `FOREIGN KEY` declares that a value must refer to an existing value in another table, i.e., it must be a reference.
- A `PRIMARY KEY` identifies a quantity that should be unique within each row, and that serves as a row identifier.
Understand that a table declaration serves as more than documentation; the database actually enforces constraints.