data_transformation.qmd

# Data transformation

Plotting a graph is easy. You just need to find the right library with the right function. However, it is sometimes not so easy to get your data into the form desired to generate a graph. In this chapter, we will cover some basic techniques in tidying data with `ggplot2`.

## What is tidy data?

Here's the definition of Tidy Data given by Hadley Wickham:

>A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
>
>* Each variable forms a column.
>
>* Each observation forms a row.
>
>* Each observational unit forms a value in the table.
>
*See [r4ds on tidy data](https://r4ds.had.co.nz/tidy-data.html){target="_blank"} for more info.*

**What are the advantages of tidy data?**

* Uniformity : It is easier to learn the tools that work with the data because they have a consistent way of storing data.

* Most built-in R functions work with vectors of values. Thus, having variables as columns/vectors allows R’s vectorized nature to shine.

**Take a look at the following data and can you tell whether this data is messy or not?**
```{r,echo=FALSE}
library(dplyr)
library(tidyr)
library(tibble)
library(readr)
library(MASS)

head(mtcars)
```

* In this data set,all the variables are parameters of cars. This means that they are not different variables, but are values of a common variable.

To transform the data, we use `pivot_longer`.

## `pivot_longer`

```{r}
mtcars |>  
  rownames_to_column("carname") |>
  pivot_longer(cols = !carname, names_to = "Parameters",values_to = "value") |>
  head()
```
Follow the simple two steps:

* Identify the column you want to keep as is. In this case, we want all variables to match car names.

* **Additionally**, notice that in the original data set, `carname` acts as the index of the data set. You would want to convert the index to a column using `rownames_to_column`.

* Create meaningful names for the two new columns. In our case, straightforwardly, column names go into `parameters` and corresponding values go into `value`.


## `pivot_wider`

`pivot_wider` is just the opposite of `pivot_longer`. Using `pivot_wider`, we can transform our tidy data back into the messy form as all distinct values in `Parameters` will become column names. 

`pivot_wider` is often used in the case such that one observation being recorded over multiple rows. Consider the following example:

```{r, echo=FALSE}
example <- tibble(
  Country = c("USA","USA","Canada","Canada"),
  Type = c("Case","Death","Case","Death"),
  Number = c(4,3,2,1))

example
```

```{r}
example |> pivot_wider(names_from = Type, values_from = Number)
```

It would make much more sense if `Case` and `Death` are separate features.

The main focus of this chapter is `pivot_longer` and `pivot_wider`. However, other fundamental functions in `dplyr` are also very important in manipulating your data set. In the following section, we will give an overview of the basics.

## Basic transformation functions

For the following sections, we will use data set `biopsy` from `MASS` for demonstration purpose.

```{r echo = FALSE}
head(biopsy)
```

### `rename`

Upon getting the data, we noticed that the names of the columns are very vague. After reading the documentation, we wanted to change the names of the column so that the viewer gets a sense of the values they’re referring to. We use `rename` to modify the column names.

```{r}
biopsy_new <- rename(biopsy,
       thickness = V1,cell_size = V2,
       cell_shape = V3, marg_adhesion = V4,
       epithelial_cell_size = V5, bare_nuclei = V6,
       chromatin = V7, norm_nucleoli = V8, mitoses = V9)

head(biopsy_new)
```

### `select`

`select` is column-wise operation. Specifically, only the columns that are specified will be returned.

In the biopsy data, we do not require the variables “chromatin” and “mitoses”. So, let’s drop them using a minus sign:

```{r}
#selecting all except the columns chromatin and mitoses
biopsy_new <- biopsy_new |> dplyr::select(-chromatin,-mitoses)

head(biopsy_new,5)
```

### `mutate`

The **mutate** function computes new variables from the already existing variables and adds them to the dataset. It gives information that the data already contained but was never displayed.

The variable `bare_nucleus` contains the values  from 1.00 to 10.00. If we wish to normalize the variable, we can use the mutate function:

```{r}
#normalize the bare nuclei values 
maximum_bare_nuclei<-max(biopsy_new$bare_nuclei,na.rm=TRUE)
biopsy_new <- biopsy_new |> mutate(bare_nuclei=bare_nuclei/maximum_bare_nuclei)

head(biopsy_new,5)
```

In some situations, your new variable might involve conditions. You can consider using `case_when` combined with `mutate`.

### `filter`

**filter** is a row-wise operation. It returns a modified copy that contains only certain rows. This function filters rows based on conditions supplied in its argument. The filter function takes the data frame as the first argument. The next argument contains one or more logical tests. The rows/observations that pass these logical tests are returned in the result of the filter function.

For our example, say we only want the data of those tumor cells that have clump thickness greater than 6.

```{r}
biopsy_new <- biopsy_new |> filter(thickness>5.5)

head(biopsy_new,5)
```

If you want to filter using multiple conditions, use logical operators: &(And), |(Or).

### `arrange`

**Arrange** reorders the rows of the data based on their contents in the ascending order by default.

Say in our example, the doctors would want to view the data in the order of the cell size of the tumor.

```{r}
#arrange in the order of V2:cell size
head(arrange(biopsy_new,cell_size))
```

In case you want your data in descending order, wrap your variable with `desc()`.

### `group_by` and `summarize`

The **summarize** function uses the data to create a new data frame with the summary statistics such as minimum, maximum, average, and so on. These statistical functions must be aggregate functions which take a vector of values as input and output a single value.

The **group_by** function groups the data by the values of the variables. This, along with summarize, makes observations about groups of rows of the dataset.

The doctors would want to see the maximum cell size and the thickness for each of the classes: benign and malignant. This can be done by grouping the data by class and finding the maximum of the required variables:

```{r}
biopsy_grouped <- group_by(biopsy_new,class)
summarize(biopsy_grouped, max(thickness), mean(cell_size), var(norm_nucleoli))
```

### `slice_max` (`slice_min`)

The **slice_max** function helps you to find the top n values of a specific column. Suppose we now want to see the top five biopsies with the biggest thickness. Notice in this case, since we have more than five rows with thickness 10, all of them are selected (for neatness, we only show first several rows).

```{r}
biopsy_new |> 
  slice_max(order_by = thickness,n=5) |> 
  head()
```

### join

Sometimes you will need to combine two data sets and this is when function join comes into play. There are four types of joins provided by `dplyr` and take a look at the following example. 

```{r}
# Main dataset
s77 <- data.frame(state.x77) |> 
  rownames_to_column("state") |>
  dplyr::select(-c(Illiteracy))

head(s77)
```

```{r}
# https://www.cookpolitical.com/2020-national-popular-vote-tracker
partyinfo <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vS3Z8Rq9xqOLISwoKdK0n6CFLBuPSCoXbbLeY8vhi-rzFS3ZFNEtR0BCdEbHcS-2Tlh5aPcnZbwBLao/pub?output=csv")
partyinfo <- partyinfo |>
  dplyr::select(state, called)

head(left_join(s77, partyinfo))
```

**s77** contains statistics of 50 states in the US and **partyinfo** holds information whether a state is democratic or republican. The two data sets are joined on common feature `state`. If you want to join on features with different names, specify using the argument `by`. For detailed explanations of differnet types of joins, refer to the [documentation](https://dplyr.tidyverse.org/reference/mutate-joins.html).

### Cases to tables

When you want to perform a Chi-squared test or create a paired mosaic plot, your data has to follow a table format. For example, the following table is in the correct format. The columns are anxiety statues and rows are class years. 

```{r,echo=FALSE}
library(Lock5withR)

e_1 <- SleepStudy |> 
  group_by(ClassYear,AnxietyStatus) |>
  summarise(n = n()) |>
  pivot_wider(names_from = AnxietyStatus, values_from = n)

e_1mat <- as.matrix(e_1[,2:4])
rownames(e_1mat) <- e_1$ClassYear

e_1mat
```

The following example demonstrates how you can convert cases to tables. Notice the starting data has a count for each of the categorical combinations.

```{r}
# watch out: summarise
rebates <- read_csv("https://data.ny.gov/resource/thd2-fu8y.csv")
rebate_counts <- rebates |> group_by(make, ev_type) |> 
  summarize(Freq = n())
head(rebate_counts)
```

By using `xtabs`, we are able to transform our data into a table ready for Chi-squared test or paired mosaic plot.

```{r}
head(xtabs(Freq ~ ., data = rebate_counts))
```