-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathggplot2_Scatterplots.Rmd
190 lines (152 loc) · 7.3 KB
/
ggplot2_Scatterplots.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
---
### ggplot2 Series 1 - Scatterplots
output: html_document
---
By: Thalia
```{r global_options, include=FALSE}
knitr::opts_chunk$set(warning=FALSE, message=FALSE)
```
```{r, warning = FALSE}
# Load libraries
library(ggplot2)
library(ggfortify)
library(gridExtra) # Use to arrange the ggplots
```
ggplot only wok with dataframes but not individual vectors. All the data needed to make the plot is contained within the dataframe. The dataset we will use for this tutorial is from Kaggle: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
```{r}
# Load data for demonstration
house_data <- read.csv('house_price.csv', header = TRUE)
head(house_data[c('GrLivArea', 'OverallCond', 'SalePrice', 'GarageType')], 3)
```
Columns we will use in this tutorial:
- GrLivArea: Above grade (ground) living area square feet
- OverallCond: Overall condition rating
- SalePrice: the property's sale price in dollars
- GarageType: Type of Garage location
#### Basic Syntax
```{r}
# Initialize ggplot
gg_init <- ggplot(house_data, aes(x=GrLivArea, y=SalePrice))
```
Note ggplot itself does not generate any plot. If you print out the ggplot object we just defined, it will draw an blank plot. with **aes** we told ggplot what is the x and y axis we want to use for our plot. To make a scatterplot, we need to add an geom layter on top of the initial blank plot.
```{r, fig.width=5, fig.height=3, fig.align='center'}
# Basic syntax of scatterplots
gg_init + geom_point()
```
Let's turn off the 1e+06 notation
```{r, fig.width=5, fig.height=3, fig.align='center'}
options(scipen=999) # turn off scientific notation like 1e+06
gg_init + geom_point()
```
Now we get a scatter plot. We can further customize it.
#### Change the shape of dots
```{r, fig.width=22, fig.height=6, fig.align='center'}
# Basic syntax of scatterplots
gg1 <- gg_init + geom_point(shape=1)
gg2 <- gg_init + geom_point(shape=5, size=2)
gg3 <- gg_init + geom_point(shape=17, size=1.5, colour='steelblue')
grid.arrange(gg1, gg2, gg3, ncol=3)
```
Some other available shapes are listed below:
```{r, echo = FALSE, out.width="500px", fig.align='center', fig.cap='...'}
knitr::include_graphics('geom_point.png')
```
#### Add smoothed conditional mean to the scatter plots
For detailed info check: https://ggplot2.tidyverse.org/reference/geom_smooth.html
```{r, fig.width=22, fig.height=6, fig.align='center'}
# Add regression lines to scatterplots
gg4 <- gg_init + geom_point(shape=1) + geom_smooth() # by default use method 'loess' and formula 'y ~ x'
gg5 <- gg_init + geom_point(shape=5, size=2) +
geom_smooth(method=lm, colour='rosybrown3')
gg6 <- gg_init + geom_point(shape=17, size=1.5, colour='steelblue') +
geom_smooth(method=lm, colour='rosybrown3', se=FALSE) # Remove the shaded confidence region
grid.arrange(gg4, gg5, gg6, ncol=3)
```
What if we want to use different colors to reflect the value in another column?
```{r, fig.width=15, fig.height=5, fig.align='center'}
# Numerical(quantitative) value: OverallCond
gg7 <- gg_init + geom_point(aes(col=OverallCond)) + # Color the dots by OverallCond
geom_smooth(col="steelblue")
# Categorical(qualitative) value: Condition1
gg8 <- gg_init + geom_point(aes(col=GarageType)) + # Color the dots by SaleType
geom_smooth(col="steelblue") # change color palette
grid.arrange(gg7, gg8, ncol=2)
```
#### Adjust x and y axis limits
From the charts you can see that the fitted line is greatly affected by the points with GRLivArea > 4000, which seems to be outliers. Let's take a closer look at the data.
```{r, fig.width=6, fig.height=3, fig.align='center'}
# zoom in to the region of interest without deleting the points
gg9 <- gg_init + geom_point(aes(col=GarageType)) + # Color the dots by SaleType
geom_smooth(col="steelblue") + # change color palette + coord_cartesian(xlim=c(0,
coord_cartesian(xlim=c(0, 4000))
print(gg9)
```
It seems to be safe to delete these outliers.
```{r, fig.width=6, fig.height=3, fig.align='center'}
# let's reset the xlim to delete the points outside the range
gg10 <- gg_init + geom_point(aes(col=GarageType)) + # Color the dots by SaleType
geom_smooth(col="steelblue") +
xlim(c(0, 4000))
print(gg10)
```
#### Customize tick marks and labels
```{r, fig.width=6, fig.height=3, fig.align='center'}
# Reset the breaks for x axis
gg11 <- gg_init + geom_point(aes(col=GarageType)) + # Color the dots by SaleType
geom_smooth(col="steelblue") +
scale_x_continuous(limits=c(0, 4000), breaks=seq(0, 4000, 500))
print(gg11)
# Reset the tick marks for y axis
gg12 <- gg_init + geom_point(aes(col=GarageType)) + # Color the dots by SaleType
geom_smooth(col="steelblue") +
scale_x_continuous(limits=c(0, 4000), breaks=seq(0, 4000, 500)) +
scale_y_continuous(breaks=seq(0, 800000, 200000),
labels = function(x){paste0(x/1000, 'K')},
expand = c(0.1, 0.1))
print(gg12)
```
#### Add title and axis labels
```{r, fig.width=24, fig.height=7, fig.align='center'}
# Method 1
gg13 <- gg12 + labs(title = "Sales Price V.S. Living Area SF",
subtitle = "From housing price prediction dataset",
x = "Living area square feet: GrLivArea",
y = "Sales Price")
# Method 2
gg14 <- gg12 + ggtitle(label = 'Sales Price V.S. Living Area SF',
subtitle = "From housing price prediction dataset") +
xlab('Living area square feet: GrLivArea') +
ylab('Sales Price')
grid.arrange(gg13, gg14, ncol=2)
```
#### Adjust the size, color, font and position of title, legend and axis labels
```{r, fig.width=5, fig.height=4, fig.align='center'}
gg15 <- gg14 + theme(
legend.position= "bottom", # Try: c(.5, .5), "None", "right", "top", "left"
plot.title = element_text(color="bisque4",
size=10, face="bold.italic",
hjust = 0.5),
plot.subtitle = element_text(color="bisque4",
size=8, face="italic",
hjust = 0.5),
axis.title.x = element_text(color="azure4", size=10, face="bold"),
axis.title.y = element_text(color="slategrey", size=10, face="bold")
)
print(gg15)
```
#### Customize the entire theme using ggthemes
You can easily change the theme without setting all the attibutes yourself using ggthemes.
Reference: https://ggplot2.tidyverse.org/reference/ggtheme.html
Some of commonly used themes:
```{r, fig.width=22, fig.height=6}
# install.packages('ggthemes', dependencies = TRUE)
library(ggthemes)
theme_1 <- gg15 + theme_light()
theme_2 <- gg15 + theme_dark()
#theme_3 <- gg15 + theme_minimal()
#theme_4 <- gg15 + theme_gray()
#theme_5 <- gg15 + theme_bw()
#theme_6 <- gg15 + theme_void()
grid.arrange(theme_1, theme_2, ncol=2)
```
END