vignettes/datatable-reshape.Rmd

---
title: "Efficient reshaping using data.tables"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_document:
    theme: spacelab
    highlight: pygments
    css : css/bootstrap.css
vignette: >
  %\VignetteIndexEntry{Vignette Title}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r, echo = FALSE, message = FALSE}
require(data.table)
require(RCurl)
knitr::opts_chunk$set(
  comment = "#",
  error = FALSE,
  tidy = FALSE,
  cache = FALSE,
  collapse=TRUE)
```

This vignette discusses the default usage of reshaping functions `melt` (wide to long) and `dcast` (long to wide) for *data.tables* as well as the **new extended functionalities** of melting and casting on *multiple columns* available from `v1.9.6`.

***

```{r echo=FALSE}
options(width=100)
```

## Data

We will load the data sets directly within sections. 

## Introduction

The `melt` and `dcast` functions for *data.tables* are extensions of the corresponding functions from the [reshape2](http://cran.r-project.org/web/packages/reshape2/index.html) package.  

In this vignette, we will

1. first briefly look at the default *melting* and *casting* of *data.tables* to convert them from *wide* to *long* format and vice versa,

2. then look at scenarios where the current functionalities becomes cumbersome and inefficient, 

3. and finally look at the new improvements to both `melt` and `dcast` methods for *data.tables* to handle multiple columns simultaneously.
    
The extended functionalities are in line with *data.table's* philosophy of performing operations efficiently and in a straightforward manner.

#### Note: {.bs-callout .bs-callout-info}

From `v1.9.6` on, you don't have to load `reshape2` package to use these functions for *data.tables*. You just need to load `data.table`. If you've to load `reshape2` for melting or casting matrices and/or data.frames, then make sure to load it *before* loading `data.table`.


## 1. Default functionality

### a) `melt`ing *data.tables* (wide to long)

Suppose we have a `data.table` (artificial data) as shown below:

```{r}
DT = fread("https://raw.githubusercontent.com/wiki/Rdatatable/data.table/data/melt_default.csv")
DT 
## dob stands for date of birth.

str(DT)
```


#

#### - Convert `DT` to *long* form where each `dob` is a separate observation.

We could accomplish this using `melt()` by specifying `id.vars` and `measure.vars` arguments as follows:

```{r}
DT.m1 = melt(DT, id.vars = c("family_id", "age_mother"), 
                measure.vars = c("dob_child1", "dob_child2", "dob_child3"))
DT.m1
str(DT.m1)
```

#### {.bs-callout .bs-callout-info}

* `measure.vars` specify the set of columns we would like to collapse (or combine) together.

* We can also specify column *indices* instead of *names*.

* By default, `variable` column is of type `factor`. Set `variable.factor` argument to `FALSE` if you'd like to return a *character* vector instead. `variable.factor` argument is only available in `melt` from `data.table` and not in the [`reshape2` package](http://github.com/hadley/reshape).

* By default, the molten columns are automatically named `variable` and `value`.

* `melt` preserves column attributes in result.

#

#### - Name the `variable` and `value` columns to `child` and `dob` respectively
 

```{r}
DT.m1 = melt(DT, measure.vars = c("dob_child1", "dob_child2", "dob_child3"), 
               variable.name = "child", value.name = "dob")
DT.m1
```

#### {.bs-callout .bs-callout-info}

* By default, when one of `id.vars` or `measure.vars` is missing, the rest of the columns are *automatically assigned* to the missing argument. 

* When neither `id.vars` nor `measure.vars` are specified, as mentioned under `?melt`, all *non*-`numeric`, `integer`, `logical` columns will be assigned to `id.vars`.

    In addition, a warning message is issued highlighting the columns that are automatically considered to be `id.vars`.

### b) `Cast`ing *data.tables* (long to wide)

In the previous section, we saw how to get from wide form to long form. Let's see the reverse operation in this section.

#### - How can we get back to the original data table `DT` from `DT.m`? 

That is, we'd like to collect all *child* observations corresponding to each `family_id, age_mother` together under the same row. We can accomplish it using `dcast` as follows:

```{r}
dcast(DT.m1, family_id + age_mother ~ child, value.var = "dob")
```

#### {.bs-callout .bs-callout-info}

* `dcast` uses *formula* interface. The variables on the *LHS* of formula represents the *id* vars and *RHS* the *measure*  vars. 

* `value.var` denotes the column to be filled in with while casting to wide format.

* `dcast` also tries to preserve attributes in result wherever possible.
 
# 

#### - Starting from `DT.m`, how can we get the number of children in each family?
 
You can also pass a function to aggregate by in `dcast` with the argument `fun.aggregate`. This is particularly essential when the formula provided does not identify single observation for each cell.

```{r}
dcast(DT.m1, family_id ~ ., fun.agg = function(x) sum(!is.na(x)), value.var = "dob")
```

Check `?dcast` for other useful arguments and additional examples.

## 2. Limitations in current `melt/dcast` approaches

So far we've seen features of `melt` and `dcast` that are based on `reshape2` package, but implemented efficiently for *data.table*s, using internal `data.table` machinery (*fast radix ordering*, *binary search* etc..).

However, there are situations we might run into where the desired operation is not expressed in a straightforward manner. For example, consider the *data.table* shown below:

```{r}
DT = fread("https://raw.githubusercontent.com/wiki/Rdatatable/data.table/data/melt_enhanced.csv")
DT
## 1 = female, 2 = male 
``` 

And you'd like to combine (melt) all the `dob` columns together, and `gender` columns together. Using the current functionalty, we can do something like this:

```{r}
DT.m1 = melt(DT, id = c("family_id", "age_mother"))
DT.m1[, c("variable", "child") := tstrsplit(variable, "_", fixed=TRUE)]
DT.c1 = dcast(DT.m1, family_id + age_mother + child ~ variable, value.var = "value")
DT.c1

str(DT.c1) ## gender column is character type now!
```

#### Issues {.bs-callout .bs-callout-info}

1. What we wanted to do was to combine all the `dob` and `gender` type columns together respectively. Instead we are combining *everything* together, and then splitting them again. I think it's easy to see that it's quite roundabout (and inefficient).

    As an analogy, imagine you've a closet with four shelves of clothes and you'd like to put together the clothes from shelves 1 and 2 together (in 1), and 3 and 4 together (in 3). What we are doing is more or less to combine all the clothes together, and then split them back on to shelves 1 and 3!

2. The columns to *melt* may be of different types, as in this case (character and integer types). By *melting* them all together, the columns will be coerced in result, as explained by the warning message above and shown from output of `str(DT.c1)`, where `gender` has been converted to *character* type.

3. We are generating an additional column by splitting the `variable` column into two columns, whose purpose is quite cryptic. We do it because we need it for *casting* in the next step.

4. Finally, we cast the data set. But the issue is it's a much more computationally involved operation than *melt*. Specifically, it requires computing the order of the variables in formula, and that's costly.

#

In fact, `base::reshape` is capable of performing this operation in a very straightforward manner. It is an extremely useful and often underrated function. You should definitely give it a try!

## 3. Enhanced (new) functionality
		
### a) Enhanced `melt`

Since we'd like for *data.tables* to perform this operation straightforward and efficient using the same interface, we went ahead and implemented an *additional functionality*, where we can `melt` to multiple columns *simultaneously*.

#### - `melt` multiple columns simultaneously

The idea is quite simple. We pass a list of columns to `measure.vars`, where each element of the list contains the columns that should be combined together.

```{r}
colA = paste("dob_child", 1:3, sep="")
colB = paste("gender_child", 1:3, sep="")
DT.m2 = melt(DT, measure = list(colA, colB), value.name = c("dob", "gender"))
DT.m2

str(DT.m2) ## col type is preserved
``` 

#### - Using `patterns()`

Usually in these problems, the columns we'd like to melt can be distinguished by a common pattern. We can use the function `patterns()`, implemented for convenience, to provide regular expressions for the columns to be combined together. The above operation can be rewritten as:

```{r}
DT.m2 = melt(DT, measure = patterns("^dob", "^gender"), value.name = c("dob", "gender"))
DT.m2
```

That's it!

#### {.bs-callout .bs-callout-info}

* We can remove the `variable` column if necessary.

* The functionality is implemented entirely in C, and is therefore both *fast* and *memory efficient* in addition to being *straightforward*.

### b) Enhanced `dcast`

Okay great! We can now melt into multiple columns simultaneously. Now given the data set `DT.m2` as shown above, how can we get back to the same format as the original data we started with?
 
If we use the current functionality of `dcast`, then we'd have to cast twice and bind the results together. But that's once again verbose, not straightforward and is also inefficient. 

#### - Casting multiple `value.var`s simultaneously

We can now provide **multiple `value.var` columns** to `dcast` for *data.tables* directly so that the operations are taken care of internally and efficiently.

```{r}
## new 'cast' functionality - multiple value.vars
DT.c2 = dcast(DT.m2, family_id + age_mother ~ variable, value.var = c("dob", "gender"))
DT.c2
```

#### {.bs-callout .bs-callout-info}

* Attributes are preserved in result wherever possible.

* Everything is taken care of internally, and efficiently. In addition to being fast, it is also very memory efficient.

# 

#### Multiple functions to `fun.aggregate`: {.bs-callout .bs-callout-info}

You can also provide *multiple functions* to `fun.aggregate` to `dcast` for *data.tables*. Check the examples in `?dcast` which illustrates this functionality.

# 

***