-
Notifications
You must be signed in to change notification settings - Fork 6
/
EDA2.Rmd
401 lines (201 loc) · 16.3 KB
/
EDA2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
---
title: "Exploratory Data Analysis 2"
subtitle: "R Notes"
author: "Luc Anselin and Grant Morrison^[University of Chicago, Center for Spatial Data Science -- anselin@uchicago.edu,morrisonge@uchicago.edu]"
date: "08/08/2018"
output:
html_document:
fig_caption: yes
self_contained: no
toc: yes
toc_depth: 4
css: tutor.css
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
<br>
## Introduction
This notebook cover the functionality of the [Exploratory Data Analysis 2](https://geodacenter.github.io/workbook/2b_eda_multi/lab2b.html) section of the GeoDa workbook. We refer to that document for details on the methodology, references, etc. The goal of these notes is to approximate as closely as possible the operations carried out using GeoDa by means of a range of R packages.
The notes are written with R beginners in mind, more seasoned R users can probably skip most of the comments
on data structures and other R particulars. Also, as always in R, there are typically several ways to achieve a specific objective, so what is shown here is just one way that works, but there often are others (that may even be more elegant, work faster, or scale better).
For this notebook, we will use socioeconomic data about NYC from the GeoDa website. Our goal in this lab is show how to implement exploratory data analysis methods with three or more variables.
### Objectives
After completing the notebook, you should know how to carry out the following tasks:
- make a scatterplot matrix
- add different types of regression to a plot
- make a bubble plot
- make a 3d scatterplot
- make a parallel coordinates plot
- make conditional plots based off categorical variables
#### R Packages used
- **ggplot2**: To make statistical plots. We use this rather than base R for increased functionality and more aesthetically pleasing plots.
- **GGally**: To make a scatterplot matrix.
- **scatterplot3d**: Used to make a static 3d scattterplot
- **plotly**: Used to make an interactive 3d plot and a parallel coordinates plot.
#### R Commands used
Below follows a list of the commands used in this notebook. For further details
and a comprehensive list of options, please consult the
[R documentation](https://www.rdocumentation.org).
- **Base R**: `install.packages`, `library`, `read.csv`, `setwd`, `head`, `which`, `names`
- **ggplot2**: `ggplot`, `geom_point`, `ggtitle`, `labs`, `geom_smooth`, `geom_histogram`, `geom_smooth`
- **GGally**: `ggpairs`, `ggparcoord`
- **scatterplot3d**: `scatterplot3d`
- **plotly**: `plot_ly`, `add_markers`, `layout`
`
## Preliminaries
Before starting, make sure to have the latest version of R and of packages that are compiled for the matching version of R (this document was created using R 3.5.1 of 2018-07-02). Also, optionally, set a working directory, even though we will not
actually be saving any files.^[Use `setwd(directorypath)` to specify the working directory.]
### Load packages
First, we load all the required packages using the `library` command. If you don't have some of these in your system, make sure to install them first as well as
their dependencies.^[Use
`install.packages(packagename)`.] You will get an error message if something is missing. If needed, just install the missing piece and everything will work after that.
```{r}
library(foreign)
library(ggplot2)
library(GGally)
library(scatterplot3d)
library(plotly)
```
## Obtaining the Data from the GeoDa website
To get the data for this notebook, you will and to go to [NYC Data](https://geodacenter.github.io/data-and-lab/nyc/) The download format is a zipfile, so you will need to unzip it by double clicking on the file in your file finder. From there move the resulting folder titled: nyc into your working directory to continue. Once that is done, you can use the base R function: `read.csv` to read the data into your R environment. There are faster table reading functions, but for small datasets such as ours, `read.csv` is sufficient.
```{r}
nyc.data <- read.csv("nyc.csv")
head(nyc.data)
```
## Scatterplot Matrix
The first multivariate data plot we will be making is a scatterplot matrix. A scatterplot matrix is a plot that shows many scatterplots with different variables and typically will show the distribution of the data in some form along the diagonal of the matrix. To build this, we use the **GGally** library, which builds on the **ggplot2** library to make more intricate plots. We use the `ggpairs()` function to build our scatterplot matrices. The key inputs for this function are the dataset, the desired variables, and specifications for the upper and lower parts of the matrix. With the `upper =` and `lower =` specifications we can tell the function what types of regression we want and wheter the display should be a regresssion plot or a correlation statistic. We first do a line of best for the bottom part of the matrix and allow the default correlation statistic for the top part. In our `lower =` we use a `list()` because there are more specifications that can be give to `lower =`
```{r}
ggpairs(nyc.data[, c("kids2000", "pubast00", "hhsiz00")], lower = list(continuous = "smooth"))
```
Now, we provide the specification for the `upper =` as well as `lower =` to get a matrix full of regression plots, rather than the default correlation statistic. Both `upper =` and `lower =` follow the same form. `continuous = "smooth"` gives us a line of best fit in our scatterplots.
```{r}
ggpairs(nyc.data[, c("kids2000", "pubast00", "hhsiz00")], lower = list(continuous = "smooth"), upper = list(continuous = "smooth"))
```
We can also add a loess regression line to our plots too. The only change is instead of `continuous = "smooth"`, we input `continuous = "smooth_loess"`
```{r}
ggpairs(nyc.data[, c("kids2000", "pubast00", "hhsiz00")], lower = list(continuous = "smooth_loess"), upper = list(continuous = "smooth_loess"))
```
## Bubble Plots
A bubble plot is a scatterplot where the size of the bubble is a measure of a third variable. It is a way of exploring relationships between 3 variables and can be done for 4 if you add a fill color to the bubbles.
### Three variables
We will use the **ggplot2** library to make this plot. We first start with the `ggplot()` function and specified the dataset. We then specify the axis and bubble size using `aes()`. For a bubble plot we need to input a variable for the bubble size measure with `size =` inside of `aes()`. Now need use `geom_point()` to add the points to the plot. `ggtitle()` adds a title to the plot. `labs()` allows us to label the x and y axises for increases readability.
```{r}
ggplot(data = nyc.data, aes(x = kids2000, y = pubast00, size = rent2002)) +
geom_point(shape = 21) +
ggtitle("NYC socioeconomic plot") +
labs(x = "% households with kids under 18", y = "% households receiving gov. assisstence")
```
### Four variables
We can add a fill color to our plot to show relationships between 4 variables on our bubble plot. This is simple to add, we just set `fill =` to and variable inside of `aes()`
```{r}
ggplot(data = nyc.data, aes(x = kids2000, y = pubast00, size = rent2002, fill = rentpct02)) +
geom_point(shape = 21) +
ggtitle("NYC socioeconomic plot") +
labs(x = "% households with kids under 18", y = "% households receiving gov. assisstence")
```
## 3D Scatterplot
Our next plot is a 3D scatterplot. A 3D scatterplot is a scatterplot that shows relationships between 3 variables rather than 2 through 3D graphics. We will start by making static plots with the **scatterplot3d** library. We will then move on to interactive 3d scatterplots with the **plotly** library. The interactivity added by the **plotly** library makes it much easier to observe relationships in the data because we will be able to rotate the plot and get views from different angles, whereas in the static option we cannot necesarily get the full picture. T
### Static Plots
We will start with the **scatterplot3d** package to make a static 3d scatterplot. It is very simple to make the bare bones 3dscatterplot, we just specify the x, y, and z variables inside of the `scatterplot3d()` function with `x =`, `y =', and `z =`.
```{r}
scatterplot3d(x = nyc.data$kids2000, y = nyc.data$pubast00, z = nyc.data$rentpct02)
```
The 3d plot we made above is not very visually pleasing, so we can use some of the `scatterplot3d()` parameters to improve upon this. Fisrt we add a title with the `main =` parameter. Then we label the axises with `xlab =`, `ylab =`, and `zlab =`. The `pch =` parameter allows us to pick a shape for the point on the plot, as we have seen above the default is an open circle. We use a square which corresponds with the number 15. Finally we can change the color of the points with the `color =` parametet.
```{r}
scatterplot3d(x = nyc.data$kids2000, y = nyc.data$pubast00, z = nyc.data$rentpct02,
main = "NYC 3d Scatterplot",
xlab = "% households with kids",
ylab = "% household with gov. assistence",
zlab = "rent percent",
pch = 15,
color = "red")
```
### Interactive Plots
Now we move on to interactive 3d scatterplots. This can very useful in the visualization of the data. With **plotly** functionality, we can rotate the plot to get different angles on the 3d scatterplot and we can hover over each data point and get its coordinate information.
We begin with the `plot_ly()` function, which like the `ggplot()` function, we specify our dataset and the variables for the axises. We then use other functions from the **plotly** library to make the plot more readable. A key difference between the code format of **plotly** and **ggplot2** is that we use the pipe operator `%>%` in the additional add ons to our plot. This operator makes the preceding statement an argument in the following statement which is very useful in code readability.
```{r}
plot_ly(nyc.data, x = ~kids2000, y = ~pubast00, z = ~rent2002) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = '% household kids'),
yaxis = list(title = '% gov. asssistence'),
zaxis = list(title = 'rent2002')))
```
We can examine an additional variable with this visual. All we have to do is pick a variable for `color =` in the `plot_ly()` function.
```{r}
plot_ly(nyc.data, x = ~kids2000, y = ~pubast00, z = ~rent2002, color = ~rentpct02) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = '% household kids'),
yaxis = list(title = '% gov. asssistence'),
zaxis = list(title = 'rent2002')))
```
## Parallel Coordinates Plot
A Parallel Coordinates Plot or PCP takes to or more variables where each variable has a line and each observation has a point on each line. The points coresponding to an observation are connected by a line. These plots can be useful in identifying clusters, which are groups of lines that follow similar paths. These plots can be used to show relationships between many variables unlike the plots we have demonstrated up until now.
### Static Plot
To make our first PCP, we will use the **GGally** library to build a static plot. We will use the `ggparcoord()` function to make this plot. The parameters need for this function are the dataset and the variable columns listed in a vector. This function requires the indices of our variables as inputs. We can get these with the `which()` function, where we return the index that has a column name equal to our input name. We use the `==` operator to test if a name is equal or in other words the same as our input name, then `which()` will return the index. This is better than opening the data table and counting the columns, as it is faster and leaves out human error.
```{r}
a <- which(names(nyc.data)=="kids2000")
b <- which(names(nyc.data)=="pubast00")
c <- which(names(nyc.data)=="rent2002")
d <- which(names(nyc.data)=="rentpct02")
# alternatively could use select to make a new dataframe of desired variables,
# then use it as the only input for ggparcoord
ggparcoord(data = nyc.data, columns = c(a,b,c,d))
```
### Interactive Plot
To make our PCP interactive, we will use the **plotly** library. We will be able to select portions of each variable's line to highlight the connecting lines. We will also be able to move the variable lines to the right and left. This is useful when many of the lines are bunched up and patterns aren't quite visual. For this plot, we just need the `plotly()` function. We need to specify the dataset, the type of plot, and the dimensions. The most difficult to understand part of this code is the `dimension =` part as we have to use nested lists to everything together. This consist of a list of variables, then a list of specifications for each variables. in our case it is not too much, just the name of the axis and the variable specification.
```{r}
plot_ly(nyc.data,type = 'parcoords',
dimensions = list(
list(
label = '% kids', values = ~kids2000),
list(
label = '% gov. assisstence', values = ~pubast00),
list(
label = 'rent2002', values = ~rent2002),
list(
label = 'rent %', values = ~rentpct02)
)
)
```
## Conditional Plots
Conditional plots allow us the assess relationships between two or more variables. Multiple graphs are made based on the conditioning results of one or two variables.
For our exploratory analysis in R we need a categorical variable in order for the function to work. GeoDa supports an option that automatically assess the distribution of the data and uses it for means of conditioning in the conditional plots. So in other words it can a quantitative variables. Functionality such as this can be gain in R, but you will need to build a function that assigns a category to values of a given observation based on their position relative to the mean or distribution.
We will be doing something more simple. We will make a conditional plots based on a categories I have assigned to our dataset below. These are made up and only have the purpose of demonstrating how these plots are made. **cond1** is our first conditioning variable and **cond2** is our second.
```{r}
nyc.data$cond1 <- "lower"
nyc.data$cond2 <- "first"
nyc.data$cond1[27:55] <- "upper"
nyc.data$cond2[0:12] <- "second"
nyc.data$cond2[42:55] <- "second"
```
### Conditional Scatterplot
#### Linear Regression
To make our conditional scatterplot, we will use the `ggplot()` function, which is apart of the **ggplot2** library. To build out plot, we start with how we would for a scatterplot in **ggplo2** with the `ggplot()` function where we specify the dataset and the axis assignments with `aes()`. AFter this, we add the point layer with `geom_point()` and the regression line with `geom_smooth()`. Now to get the conditional plot we need the `facet_wrap()` function, where we input a vector of our conditioning variables. In our case, we will get four plots because we have two conditioning variables. One where cond1 is "lower" and cond2 is "first", one where cond1 is "lower" and cond2 is "second", one where cond1 is "upper" and cond2 first, and one where cond1 is "upper" and cond2 is "second". With this, we can look at scatterplots for different categorical variables.
```{r}
ggplot(nyc.data, aes(x = kids2000, y = pubast00)) +
geom_point() +
geom_smooth(method="lm") +
facet_wrap(c(~cond1, ~cond2))
```
#### loess regression
Here we just change `method =` to "loess" to get a loess regression instead of a line of best fit
```{r}
ggplot(nyc.data, aes(x = kids2000, y = pubast00)) +
geom_point() +
geom_smooth(method="loess") +
facet_wrap(c(~cond1, ~cond2))
```
### Conditional Histogram
The format for making a conditional histogram is similar to that for a conditional scatterplot. The only changes are that we list one variable instead of two in the case of the scatterplot and instead of `geom_smooth()`, we use `geom_histogram()` and the `aes()` is only one variable.
```{r}
ggplot(nyc.data, aes(kids2000)) +
geom_histogram() +
facet_wrap(c(~cond1, ~cond2))
```
For our histograms, you can change the number of bins by specifying `binwidth =` inside of the `geom_histogram()` function
```{r}
ggplot(nyc.data, aes(kids2000)) +
geom_histogram(binwidth = 20) +
facet_wrap(c(~cond1, ~cond2))
```