-
Notifications
You must be signed in to change notification settings - Fork 978
/
datatable-reshape.Rmd
255 lines (156 loc) · 10.1 KB
/
datatable-reshape.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
---
title: "Efficient reshaping using data.tables"
date: "`r Sys.Date()`"
output:
rmarkdown::html_document:
theme: spacelab
highlight: pygments
css : css/bootstrap.css
vignette: >
%\VignetteIndexEntry{Vignette Title}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
```{r, echo = FALSE, message = FALSE}
require(data.table)
require(RCurl)
knitr::opts_chunk$set(
comment = "#",
error = FALSE,
tidy = FALSE,
cache = FALSE,
collapse=TRUE)
```
This vignette discusses the default usage of reshaping functions `melt` (wide to long) and `dcast` (long to wide) for *data.tables* as well as the **new extended functionalities** of melting and casting on *multiple columns* available from `v1.9.6`.
***
```{r echo=FALSE}
options(width=100)
```
## Data
We will load the data sets directly within sections.
## Introduction
The `melt` and `dcast` functions for *data.tables* are extensions of the corresponding functions from the [reshape2](http://cran.r-project.org/web/packages/reshape2/index.html) package.
In this vignette, we will
1. first briefly look at the default *melting* and *casting* of *data.tables* to convert them from *wide* to *long* format and vice versa,
2. then look at scenarios where the current functionalities becomes cumbersome and inefficient,
3. and finally look at the new improvements to both `melt` and `dcast` methods for *data.tables* to handle multiple columns simultaneously.
The extended functionalities are in line with *data.table's* philosophy of performing operations efficiently and in a straightforward manner.
#### Note: {.bs-callout .bs-callout-info}
From `v1.9.6` on, you don't have to load `reshape2` package to use these functions for *data.tables*. You just need to load `data.table`. If you've to load `reshape2` for melting or casting matrices and/or data.frames, then make sure to load it *before* loading `data.table`.
## 1. Default functionality
### a) `melt`ing *data.tables* (wide to long)
Suppose we have a `data.table` (artificial data) as shown below:
```{r}
DT = fread("https://raw.githubusercontent.com/wiki/Rdatatable/data.table/data/melt_default.csv")
DT
## dob stands for date of birth.
str(DT)
```
#
#### - Convert `DT` to *long* form where each `dob` is a separate observation.
We could accomplish this using `melt()` by specifying `id.vars` and `measure.vars` arguments as follows:
```{r}
DT.m1 = melt(DT, id.vars = c("family_id", "age_mother"),
measure.vars = c("dob_child1", "dob_child2", "dob_child3"))
DT.m1
str(DT.m1)
```
#### {.bs-callout .bs-callout-info}
* `measure.vars` specify the set of columns we would like to collapse (or combine) together.
* We can also specify column *indices* instead of *names*.
* By default, `variable` column is of type `factor`. Set `variable.factor` argument to `FALSE` if you'd like to return a *character* vector instead. `variable.factor` argument is only available in `melt` from `data.table` and not in the [`reshape2` package](http://github.com/hadley/reshape).
* By default, the molten columns are automatically named `variable` and `value`.
* `melt` preserves column attributes in result.
#
#### - Name the `variable` and `value` columns to `child` and `dob` respectively
```{r}
DT.m1 = melt(DT, measure.vars = c("dob_child1", "dob_child2", "dob_child3"),
variable.name = "child", value.name = "dob")
DT.m1
```
#### {.bs-callout .bs-callout-info}
* By default, when one of `id.vars` or `measure.vars` is missing, the rest of the columns are *automatically assigned* to the missing argument.
* When neither `id.vars` nor `measure.vars` are specified, as mentioned under `?melt`, all *non*-`numeric`, `integer`, `logical` columns will be assigned to `id.vars`.
In addition, a warning message is issued highlighting the columns that are automatically considered to be `id.vars`.
### b) `Cast`ing *data.tables* (long to wide)
In the previous section, we saw how to get from wide form to long form. Let's see the reverse operation in this section.
#### - How can we get back to the original data table `DT` from `DT.m`?
That is, we'd like to collect all *child* observations corresponding to each `family_id, age_mother` together under the same row. We can accomplish it using `dcast` as follows:
```{r}
dcast(DT.m1, family_id + age_mother ~ child, value.var = "dob")
```
#### {.bs-callout .bs-callout-info}
* `dcast` uses *formula* interface. The variables on the *LHS* of formula represents the *id* vars and *RHS* the *measure* vars.
* `value.var` denotes the column to be filled in with while casting to wide format.
* `dcast` also tries to preserve attributes in result wherever possible.
#
#### - Starting from `DT.m`, how can we get the number of children in each family?
You can also pass a function to aggregate by in `dcast` with the argument `fun.aggregate`. This is particularly essential when the formula provided does not identify single observation for each cell.
```{r}
dcast(DT.m1, family_id ~ ., fun.agg = function(x) sum(!is.na(x)), value.var = "dob")
```
Check `?dcast` for other useful arguments and additional examples.
## 2. Limitations in current `melt/dcast` approaches
So far we've seen features of `melt` and `dcast` that are based on `reshape2` package, but implemented efficiently for *data.table*s, using internal `data.table` machinery (*fast radix ordering*, *binary search* etc..).
However, there are situations we might run into where the desired operation is not expressed in a straightforward manner. For example, consider the *data.table* shown below:
```{r}
DT = fread("https://raw.githubusercontent.com/wiki/Rdatatable/data.table/data/melt_enhanced.csv")
DT
## 1 = female, 2 = male
```
And you'd like to combine (melt) all the `dob` columns together, and `gender` columns together. Using the current functionalty, we can do something like this:
```{r}
DT.m1 = melt(DT, id = c("family_id", "age_mother"))
DT.m1[, c("variable", "child") := tstrsplit(variable, "_", fixed=TRUE)]
DT.c1 = dcast(DT.m1, family_id + age_mother + child ~ variable, value.var = "value")
DT.c1
str(DT.c1) ## gender column is character type now!
```
#### Issues {.bs-callout .bs-callout-info}
1. What we wanted to do was to combine all the `dob` and `gender` type columns together respectively. Instead we are combining *everything* together, and then splitting them again. I think it's easy to see that it's quite roundabout (and inefficient).
As an analogy, imagine you've a closet with four shelves of clothes and you'd like to put together the clothes from shelves 1 and 2 together (in 1), and 3 and 4 together (in 3). What we are doing is more or less to combine all the clothes together, and then split them back on to shelves 1 and 3!
2. The columns to *melt* may be of different types, as in this case (character and integer types). By *melting* them all together, the columns will be coerced in result, as explained by the warning message above and shown from output of `str(DT.c1)`, where `gender` has been converted to *character* type.
3. We are generating an additional column by splitting the `variable` column into two columns, whose purpose is quite cryptic. We do it because we need it for *casting* in the next step.
4. Finally, we cast the data set. But the issue is it's a much more computationally involved operation than *melt*. Specifically, it requires computing the order of the variables in formula, and that's costly.
#
In fact, `base::reshape` is capable of performing this operation in a very straightforward manner. It is an extremely useful and often underrated function. You should definitely give it a try!
## 3. Enhanced (new) functionality
### a) Enhanced `melt`
Since we'd like for *data.tables* to perform this operation straightforward and efficient using the same interface, we went ahead and implemented an *additional functionality*, where we can `melt` to multiple columns *simultaneously*.
#### - `melt` multiple columns simultaneously
The idea is quite simple. We pass a list of columns to `measure.vars`, where each element of the list contains the columns that should be combined together.
```{r}
colA = paste("dob_child", 1:3, sep="")
colB = paste("gender_child", 1:3, sep="")
DT.m2 = melt(DT, measure = list(colA, colB), value.name = c("dob", "gender"))
DT.m2
str(DT.m2) ## col type is preserved
```
#### - Using `patterns()`
Usually in these problems, the columns we'd like to melt can be distinguished by a common pattern. We can use the function `patterns()`, implemented for convenience, to provide regular expressions for the columns to be combined together. The above operation can be rewritten as:
```{r}
DT.m2 = melt(DT, measure = patterns("^dob", "^gender"), value.name = c("dob", "gender"))
DT.m2
```
That's it!
#### {.bs-callout .bs-callout-info}
* We can remove the `variable` column if necessary.
* The functionality is implemented entirely in C, and is therefore both *fast* and *memory efficient* in addition to being *straightforward*.
### b) Enhanced `dcast`
Okay great! We can now melt into multiple columns simultaneously. Now given the data set `DT.m2` as shown above, how can we get back to the same format as the original data we started with?
If we use the current functionality of `dcast`, then we'd have to cast twice and bind the results together. But that's once again verbose, not straightforward and is also inefficient.
#### - Casting multiple `value.var`s simultaneously
We can now provide **multiple `value.var` columns** to `dcast` for *data.tables* directly so that the operations are taken care of internally and efficiently.
```{r}
## new 'cast' functionality - multiple value.vars
DT.c2 = dcast(DT.m2, family_id + age_mother ~ variable, value.var = c("dob", "gender"))
DT.c2
```
#### {.bs-callout .bs-callout-info}
* Attributes are preserved in result wherever possible.
* Everything is taken care of internally, and efficiently. In addition to being fast, it is also very memory efficient.
#
#### Multiple functions to `fun.aggregate`: {.bs-callout .bs-callout-info}
You can also provide *multiple functions* to `fun.aggregate` to `dcast` for *data.tables*. Check the examples in `?dcast` which illustrates this functionality.
#
***