forked from hadley/adv-r
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathData-structures.Rmd
551 lines (378 loc) · 22.9 KB
/
Data-structures.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
# Data structures {#data-structures}
This chapter summarises the most important data structures in base R. You've probably used many (if not all) of them before, but you may not have thought deeply about how they are interrelated. In this brief overview, I won't discuss individual types in depth. Instead, I'll show you how they fit together as a whole. If you need more details, you can find them in R's documentation.
R's base data structures can be organised by their dimensionality (1d, 2d, or nd) and whether they're homogeneous (all contents must be of the same type) or heterogeneous (the contents can be of different types). This gives rise to the five data types most often used in data analysis:
| | Homogeneous | Heterogeneous |
|----|---------------|---------------|
| 1d | Atomic vector | List |
| 2d | Matrix | Data frame |
| nd | Array | |
Almost all other objects are built upon these foundations. In [the OO field guide](#oo) you'll see how more complicated objects are built of these simple pieces. Note that R has no 0-dimensional, or scalar types. Individual numbers or strings, which you might think would be scalars, are actually vectors of length one.
Given an object, the best way to understand what data structures it's composed of is to use `str()`. `str()` is short for structure and it gives a compact, human readable description of any R data structure. \indexc{str()}
##### Quiz
Take this short quiz to determine if you need to read this chapter. If the answers quickly come to mind, you can comfortably skip this chapter. You can check your answers in [answers](#data-structure-answers).
1. What are the three properties of a vector, other than its contents?
1. What are the four common types of atomic vectors? What are the two
rare types?
1. What are attributes? How do you get them and set them?
1. How is a list different from an atomic vector? How is a matrix different
from a data frame?
1. Can you have a list that is a matrix? Can a data frame have a column
that is a matrix?
##### Outline
* [Vectors](#vectors) introduces you to atomic vectors and lists, R's 1d
data structures.
* [Attributes](#attributes) takes a small detour to discuss attributes,
R's flexible metadata specification. Here you'll learn about factors,
an important data structure created by setting attributes of an atomic
vector.
* [Matrices and arrays](#matrices-and-arrays) introduces matrices and arrays,
data structures for storing 2d and higher dimensional data.
* [Data frames](#data-frames) teaches you about the data frame, the most
important data structure for storing data in R. Data frames combine
the behaviour of lists and matrices to make a structure ideally suited for
the needs of statistical data.
## Vectors {#vectors}
The basic data structure in R is the vector. Vectors come in two flavours: atomic vectors and lists. They have three common properties:
* Type, `typeof()`, what it is.
* Length, `length()`, how many elements it contains.
* Attributes, `attributes()`, additional arbitrary metadata.
They differ in the types of their elements: all elements of an atomic vector must be the same type, whereas the elements of a list can have different types.
NB: `is.vector()` does not test if an object is a vector. Instead it returns `TRUE` only if the object is a vector with no attributes apart from names. Use `is.atomic(x) || is.list(x)` to test if an object is actually a vector.
### Atomic vectors
There are four common types of atomic vectors that I'll discuss in detail: logical, integer, double (often called numeric), and character. There are two rare types that I will not discuss further: complex and raw. \index{atomic vectors} \index{vectors!atomic|see{atomic vectors}}
Atomic vectors are usually created with `c()`, short for combine: \indexc{c()}
```{r}
dbl_var <- c(1, 2.5, 4.5)
# With the L suffix, you get an integer rather than a double
int_var <- c(1L, 6L, 10L)
# Use TRUE and FALSE (or T and F) to create logical vectors
log_var <- c(TRUE, FALSE, T, F)
chr_var <- c("these are", "some strings")
```
Atomic vectors are always flat, even if you nest `c()`'s:
```{r}
c(1, c(2, c(3, 4)))
# the same as
c(1, 2, 3, 4)
```
Missing values are specified with `NA`, which is a logical vector of length 1. `NA` will always be coerced to the correct type if used inside `c()`, or you can create `NA`s of a specific type with `NA_real_` (a double vector), `NA_integer_` and `NA_character_`. \indexc{NA}
#### Types and tests
Given a vector, you can determine its type with `typeof()`. \indexc{typeof()}
Use "is" functions with care. `is.character()`, `is.double()`, `is.integer()`, `is.logical()` are ok. The following are suprising:
* `is.vector()` tests for vectors with no attributes apart from names
* `is.atomic()` tests for atomic vectors or NULL
* `is.numeric()` tests for the numerical-ness of a vector,
not whether it's built on top of an integer or double.
#### Coercion
All elements of an atomic vector must be the same type, so when you attempt to combine different types they will be __coerced__ to the most flexible type. Types from least to most flexible are: logical, integer, double, and character. \index{coercion}
For example, combining a character and an integer yields a character:
```{r}
str(c("a", 1))
```
When a logical vector is coerced to an integer or double, `TRUE` becomes 1 and `FALSE` becomes 0. This is very useful in conjunction with `sum()` and `mean()`
```{r}
x <- c(FALSE, FALSE, TRUE)
as.numeric(x)
# Total number of TRUEs
sum(x)
# Proportion that are TRUE
mean(x)
```
Coercion often happens automatically. Most mathematical functions (`+`, `log`, `abs`, etc.) will coerce to a double or integer, and most logical operations (`&`, `|`, `any`, etc) will coerce to a logical. You will usually get a warning message if the coercion might lose information. If confusion is likely, explicitly coerce with `as.character()`, `as.double()`, `as.integer()`, or `as.logical()`.
### Lists
Lists are different from atomic vectors because their elements can be of any type, including lists. You construct lists by using `list()` instead of `c()`: \index{lists} \index{vectors!lists|see{lists}}
```{r}
x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
str(x)
```
Lists are sometimes called __recursive__ vectors, because a list can contain other lists. This makes them fundamentally different from atomic vectors.
```{r}
x <- list(list(list(list())))
str(x)
is.recursive(x)
```
`c()` will combine several lists into one. If given a combination of atomic vectors and lists, `c()` will coerce the vectors to lists before combining them. Compare the results of `list()` and `c()`:
```{r}
x <- list(list(1, 2), c(3, 4))
y <- c(list(1, 2), c(3, 4))
str(x)
str(y)
```
The `typeof()` a list is `list`. You can test for a list with `is.list()` and coerce to a list with `as.list()`. You can turn a list into an atomic vector with `unlist()`. If the elements of a list have different types, `unlist()` uses the same coercion rules as `c()`.
Lists are used to build up many of the more complicated data structures in R. For example, both data frames (described in [data frames](#data-frames)) and linear models objects (as produced by `lm()`) are lists:
```{r}
is.list(mtcars)
mod <- lm(mpg ~ wt, data = mtcars)
is.list(mod)
```
### Exercises
1. What are the six types of atomic vector? How does a list differ from an
atomic vector?
1. What makes `is.vector()` and `is.numeric()` fundamentally different to
`is.list()` and `is.character()`?
1. Test your knowledge of vector coercion rules by predicting the output of
the following uses of `c()`:
```{r, eval=FALSE}
c(1, FALSE)
c("a", 1)
c(list(1), "a")
c(TRUE, 1L)
```
1. Why do you need to use `unlist()` to convert a list to an
atomic vector? Why doesn't `as.vector()` work?
1. Why is `1 == "1"` true? Why is `-1 < FALSE` true? Why is `"one" < 2` false?
1. Why is the default missing value, `NA`, a logical vector? What's special
about logical vectors? (Hint: think about `c(FALSE, NA_character_)`.)
## Attributes {#attributes}
All objects can have arbitrary additional attributes, used to store metadata about the object. Attributes can be thought of as a named list (with unique names). Attributes can be accessed individually with `attr()` or all at once (as a list) with `attributes()`. \index{attributes}
```{r}
y <- 1:10
attr(y, "my_attribute") <- "This is a vector"
attr(y, "my_attribute")
str(attributes(y))
```
The `structure()` function returns a new object with modified attributes: \indexc{structure()}
```{r}
structure(1:10, my_attribute = "This is a vector")
```
By default, most attributes are lost when modifying a vector:
```{r}
attributes(y[1])
attributes(sum(y))
```
The only attributes not lost are the three most important:
* Names, a character vector giving each element a name, described in
[names](#vector-names).
* Dimensions, used to turn vectors into matrices and arrays,
described in [matrices and arrays](#matrices-and-arrays).
* Class, used to implement the S3 object system, described in [S3](#s3).
Each of these attributes has a specific accessor function to get and set values. When working with these attributes, use `names(x)`, `dim(x)`, and `class(x)`, not `attr(x, "names")`, `attr(x, "dim")`, and `attr(x, "class")`.
#### Names {#vector-names}
You can name a vector in three ways: \index{attributes|names}
* When creating it: `x <- c(a = 1, b = 2, c = 3)`.
* By modifying an existing vector in place:
`x <- 1:3; names(x) <- c("a", "b", "c")`. \indexc{names()}
Or:
`x <- 1:3; names(x)[[1]] <- c("a")`. \indexc{names()}
* By creating a modified copy of a vector:
`x <- setNames(1:3, c("a", "b", "c"))`. \indexc{setNames()}
Names don't have to be unique. However, character subsetting, described in [subsetting](#lookup-tables), is the most important reason to use names and it is most useful when the names are unique.
Not all elements of a vector need to have a name. If some names are missing when you create the vector, the names will be set to an empty string for those elements. If you modify the vector in place by setting some, but not all variable names, `names()` will return NA (more specifically, NA_character_) for them. If all names are missing, `names()` will return `NULL`.
```{r}
y <- c(a = 1, 2, 3)
names(y)
v <- c(1, 2, 3)
names(v) <- c('a')
names(v)
z <- c(1, 2, 3)
names(z)
```
You can create a new vector without names using `unname(x)`, or remove names in place with `names(x) <- NULL`.
### Factors
One important use of attributes is to define factors. A factor is a vector that can contain only predefined values, and is used to store categorical data. Factors are built on top of integer vectors using two attributes: the `class`, "factor", which makes them behave differently from regular integer vectors, and the `levels`, which defines the set of allowed values. \index{factors|(}
```{r}
x <- factor(c("a", "b", "b", "a"))
x
class(x)
levels(x)
# You can't use values that are not in the levels
x[2] <- "c"
x
# NB: you can't combine factors
c(factor("a"), factor("b"))
```
Factors are useful when you know the possible values a variable may take, even if you don't see all values in a given dataset. Using a factor instead of a character vector makes it obvious when some groups contain no observations:
```{r}
sex_char <- c("m", "m", "m")
sex_factor <- factor(sex_char, levels = c("m", "f"))
table(sex_char)
table(sex_factor)
```
Sometimes when a data frame is read directly from a file, a column you'd thought would produce a numeric vector instead produces a factor. This is caused by a non-numeric value in the column, often a missing value encoded in a special way like `.` or `-`. To remedy the situation, coerce the vector from a factor to a character vector, and then from a character to a double vector. (Be sure to check for missing values after this process.) Of course, a much better plan is to discover what caused the problem in the first place and fix that; using the `na.strings` argument to `read.csv()` is often a good place to start.
```{r}
# Reading in "text" instead of from a file here:
z <- read.csv(text = "value\n12\n1\n.\n9")
typeof(z$value)
as.double(z$value)
# Oops, that's not right: 3 2 1 4 are the levels of a factor,
# not the values we read in!
class(z$value)
# We can fix it now:
as.double(as.character(z$value))
# Or change how we read it in:
z <- read.csv(text = "value\n12\n1\n.\n9", na.strings=".")
typeof(z$value)
class(z$value)
z$value
# Perfect! :)
```
Unfortunately, most data loading functions in R automatically convert character vectors to factors. This is suboptimal, because there's no way for those functions to know the set of all possible levels or their optimal order. Instead, use the argument `stringsAsFactors = FALSE` to suppress this behaviour, and then manually convert character vectors to factors using your knowledge of the data. A global option, `options(stringsAsFactors = FALSE)`, is available to control this behaviour, but I don't recommend using it. Changing a global option may have unexpected consequences when combined with other code (either from packages, or code that you're `source()`ing), and global options make code harder to understand because they increase the number of lines you need to read to understand how a single line of code will behave. \indexc{stringsAsFactors}
While factors look (and often behave) like character vectors, they are actually integers. Be careful when treating them like strings. Some string methods (like `gsub()` and `grepl()`) will coerce factors to strings, while others (like `nchar()`) will throw an error, and still others (like `c()`) will use the underlying integer values. For this reason, it's usually best to explicitly convert factors to character vectors if you need string-like behaviour. In early versions of R, there was a memory advantage to using factors instead of character vectors, but this is no longer the case. \index{factors|)}
### Exercises
1. An early draft used this code to illustrate `structure()`:
```{r}
structure(1:5, comment = "my attribute")
```
But when you print that object you don't see the comment attribute.
Why? Is the attribute missing, or is there something else special about
it? (Hint: try using help.) \index{attributes!comment}
1. What happens to a factor when you modify its levels?
```{r, results = "none"}
f1 <- factor(letters)
levels(f1) <- rev(levels(f1))
```
1. What does this code do? How do `f2` and `f3` differ from `f1`?
```{r, results = "none"}
f2 <- rev(factor(letters))
f3 <- factor(letters, levels = rev(letters))
```
## Matrices and arrays {#matrices-and-arrays}
Adding a `dim` attribute to an atomic vector allows it to behave like a multi-dimensional __array__. A special case of the array is the __matrix__, which has two dimensions. Matrices are used commonly as part of the mathematical machinery of statistics. Arrays are much rarer, but worth being aware of. \index{arrays|(} \index{matrices|see{arrays}}
Matrices and arrays are created with `matrix()` and `array()`, or by using the assignment form of `dim()`:
```{r}
# Two scalar arguments to specify rows and columns
a <- matrix(1:6, ncol = 3, nrow = 2)
# One vector argument to describe all dimensions
b <- array(1:12, c(2, 3, 2))
# You can also modify an object in place by setting dim()
c <- 1:6
dim(c) <- c(3, 2)
c
dim(c) <- c(2, 3)
c
```
`length()` and `names()` have high-dimensional generalisations:
* `length()` generalises to `nrow()` and `ncol()` for matrices, and `dim()`
for arrays. \indexc{nrow()} \indexc{ncol()} \indexc{dim()}
* `names()` generalises to `rownames()` and `colnames()` for matrices, and
`dimnames()`, a list of character vectors, for arrays. \indexc{rownames()}
\indexc{colnames()} \indexc{dimnames()}
```{r}
length(a)
nrow(a)
ncol(a)
rownames(a) <- c("A", "B")
colnames(a) <- c("a", "b", "c")
a
length(b)
dim(b)
dimnames(b) <- list(c("one", "two"), c("a", "b", "c"), c("A", "B"))
b
```
`c()` generalises to `cbind()` and `rbind()` for matrices, and to `abind()` (provided by the `abind` package) for arrays. You can transpose a matrix with `t()`; the generalised equivalent for arrays is `aperm()`. \indexc{cbind()} \indexc{rbind()} \indexc{abind()} \indexc{aperm()}
You can test if an object is a matrix or array using `is.matrix()` and `is.array()`, or by looking at the length of the `dim()`. `as.matrix()` and `as.array()` make it easy to turn an existing vector into a matrix or array.
Vectors are not the only 1-dimensional data structure. You can have matrices with a single row or single column, or arrays with a single dimension. They may print similarly, but will behave differently. The differences aren't too important, but it's useful to know they exist in case you get strange output from a function (`tapply()` is a frequent offender). As always, use `str()` to reveal the differences. \index{arrays!1d}
```{r}
str(1:3) # 1d vector
str(matrix(1:3, ncol = 1)) # column vector
str(matrix(1:3, nrow = 1)) # row vector
str(array(1:3, 3)) # "array" vector
```
While atomic vectors are most commonly turned into matrices, the dimension attribute can also be set on lists to make list-matrices or list-arrays: \index{arrays!list-arrays} \index{list-arrays}
```{r}
l <- list(1:3, "a", TRUE, 1.0)
dim(l) <- c(2, 2)
l
```
These are relatively esoteric data structures, but can be useful if you want to arrange objects into a grid-like structure. For example, if you're running models on a spatio-temporal grid, it might be natural to preserve the grid structure by storing the models in a 3d array. \index{arrays|)}
### Exercises
1. What does `dim()` return when applied to a vector?
1. If `is.matrix(x)` is `TRUE`, what will `is.array(x)` return?
1. How would you describe the following three objects? What makes them
different to `1:5`?
```{r}
x1 <- array(1:5, c(1, 1, 5))
x2 <- array(1:5, c(1, 5, 1))
x3 <- array(1:5, c(5, 1, 1))
```
## Data frames {#data-frames}
A data frame is the most common way of storing data in R, and if [used systematically](http://vita.had.co.nz/papers/tidy-data.pdf) makes data analysis easier. Under the hood, a data frame is a list of equal-length vectors. This makes it a 2-dimensional structure, so it shares properties of both the matrix and the list. This means that a data frame has `names()`, `colnames()`, and `rownames()`, although `names()` and `colnames()` are the same thing. The `length()` of a data frame is the length of the underlying list and so is the same as `ncol()`; `nrow()` gives the number of rows. \index{data frames|(}
As described in [subsetting](#subsetting), you can subset a data frame like a 1d structure (where it behaves like a list), or a 2d structure (where it behaves like a matrix).
### Creation
You create a data frame using `data.frame()`, which takes named vectors as input:
```{r}
df <- data.frame(x = 1:3, y = c("a", "b", "c"))
str(df)
```
Beware `data.frame()`'s default behaviour which turns strings into factors. Use `stringsAsFactors = FALSE` to suppress this behaviour: \indexc{stringsAsFactors}
```{r}
df <- data.frame(
x = 1:3,
y = c("a", "b", "c"),
stringsAsFactors = FALSE)
str(df)
```
### Testing and coercion
Because a `data.frame` is an S3 class, its type reflects the underlying vector used to build it: the list. To check if an object is a data frame, use `class()` or test explicitly with `is.data.frame()`:
```{r}
typeof(df)
class(df)
is.data.frame(df)
```
You can coerce an object to a data frame with `as.data.frame()`:
* A vector will create a one-column data frame.
* A list will create one column for each element; it's an error if they're
not all the same length.
* A matrix will create a data frame with the same number of columns and rows as the matrix.
### Combining data frames
You can combine data frames using `cbind()` and `rbind()`: \indexc{cbind()} \indexc{rbind()}
```{r}
cbind(df, data.frame(z = 3:1))
rbind(df, data.frame(x = 10, y = "z"))
```
When combining column-wise, the number of rows must match, but row names are ignored. When combining row-wise, both the number and names of columns must match. Use `plyr::rbind.fill()` to combine data frames that don't have the same columns.
It's a common mistake to try and create a data frame by `cbind()`ing vectors together. This doesn't work because `cbind()` will create a matrix unless one of the arguments is already a data frame. Instead use `data.frame()` directly:
```{r}
bad <- data.frame(cbind(a = 1:2, b = c("a", "b")))
str(bad)
good <- data.frame(a = 1:2, b = c("a", "b"),
stringsAsFactors = FALSE)
str(good)
```
The conversion rules for `cbind()` are complicated and best avoided by ensuring all inputs are of the same type.
### Special columns
Since a data frame is a list of vectors, it is possible for a data frame to have a column that is a list: \index{data frames!list in column}
```{r}
df <- data.frame(x = 1:3)
df$y <- list(1:2, 1:3, 1:4)
df
```
However, when a list is given to `data.frame()`, it tries to put each item of the list into its own column, so this fails:
```{r, error = TRUE}
data.frame(x = 1:3, y = list(1:2, 1:3, 1:4))
```
A workaround is to use `I()`, which causes `data.frame()` to treat the list as one unit:
```{r}
dfl <- data.frame(x = 1:3, y = I(list(1:2, 1:3, 1:4)))
str(dfl)
dfl[2, "y"]
```
`I()` adds the `AsIs` class to its input, but this can usually be safely ignored. \indexc{I()}
Similarly, it's also possible to have a column of a data frame that's a matrix or array, as long as the number of rows matches the data frame: \index{data frames!array in column}
```{r}
dfm <- data.frame(x = 1:3, y = I(matrix(1:9, nrow = 3)))
str(dfm)
dfm[2, "y"]
```
Use list and array columns with caution: many functions that work with data frames assume that all columns are atomic vectors. \index{data frames|)}
### Exercises
1. What attributes does a data frame possess?
1. What does `as.matrix()` do when applied to a data frame with
columns of different types?
1. Can you have a data frame with 0 rows? What about 0 columns?
## Answers {#data-structure-answers}
1. The three properties of a vector are type, length, and attributes.
1. The four common types of atomic vector are logical, integer, double
(sometimes called numeric), and character. The two rarer types are
complex and raw.
1. Attributes allow you to associate arbitrary additional metadata to
any object. You can get and set individual attributes with `attr(x, "y")`
and `attr(x, "y") <- value`; or get and set all attributes at once with
`attributes()`.
1. The elements of a list can be any type (even a list); the elements of
an atomic vector are all of the same type. Similarly, every element of
a matrix must be the same type; in a data frame, the different columns
can have different types.
1. You can make "list-array" by assigning dimensions to a list. You can
make a matrix a column of a data frame with `df$x <- matrix()`, or
using `I()` when creating a new data frame `data.frame(x = I(matrix()))`.