forked from nuitrcs/R-fundamentals-summer-workshop
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Day4_stats_and_data_visualization.qmd
958 lines (625 loc) · 29.3 KB
/
Day4_stats_and_data_visualization.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
---
output:
html_document:
df_print: paged
code_download: TRUE
toc: true
toc_depth: 1
editor_options:
chunk_output_type: console
---
```{r, setup, include=FALSE}
knitr::opts_chunk$set(
eval=FALSE, warning=FALSE, error=FALSE
)
```
# Brief intro to statistics in R
## How to approach statistics in R
R is a great language for statistics. Most statistical methods that you'll want to use are probably available in R, either in base R or in a package.
There are so many statistical methods implemented in R, and new ones are constantly developed, that we can't teach you everything. (Although we do offer a workshop on [statistical modeling in R](https://github.com/nuitrcs/r-statistical-modeling)).
Learning statistics in R is not about memorizing a set of functions. Once you know base R (as you do now!), you can use any statistical method in R. Some methods may have more or less complexity, but you can do it.
To use statistics in R, you first need to know statistics. Say you already do and you want to use linear regression.
1. The first step is to Google "linear regression in R." That search will result in many resources (which is one of the great things of using R!).
2. Once you find a good resource, it's all a matter of using your knowledge of R to implement any statistical method.
Throughout the process, it's important that you don't lose yourself in the code and forget to use your statistical knowledge. Are you using the method in an appropriate way? Are you satisfying the key assumptions? Are you using the right tests and diagnostics?
## Example method: linear regression
To start familiarizing yourself with statistics in R, let's take the example of linear regression, which is a commonly used method and is helpful to illustrate some key aspects about statistics in R.
Here we're focusing on how to implement statistical methods in R. We're not covering the statistics of linear regression. That said, our team does have statisticians and we do teach workshops focused on statistics.
Let's use the `penguins` dataset that we've used before from the package `palmerpenguins`.
```{r}
library(palmerpenguins)
data(penguins)
head(penguins)
```
To compute a linear regression, we use the `lm()` function (linear model) and the special formula syntax.
A formula object specifies a relationship between variables. A formula is specified using the "tilde operator" `~`. A very simple example of a formula is shown below:
```
outcome_variable ~ predictor_variable
```
The *precise* meaning of this formula depends on exactly what you want to do with it, but in broad terms it means "the outcome variable, analysed in terms of the predictor variable". That said, although the simplest and most common form of a formula uses the "one variable on the left, one variable on the right" format, there are others.
With this knowledge of the formula syntax, we can predict body mass with flipper and bill length:
```{r}
reg1 <- lm(body_mass_g ~ flipper_length_mm + bill_length_mm , data=penguins)
reg1
```
To get more than just the coefficient estimates in the output, use the summary function on the resulting regression object:
```{r}
summary(reg1)
```
The result of the regression is an lm model object, which contains multiple pieces of information. We can access these different components if we need to.
```{r}
names(reg1)
```
```{r}
reg1$coefficients
```
The result of the summary function is a different object, with some different components:
```{r}
names(summary(reg1))
```
But we can still extract the information
```{r}
summary(reg1)$coefficients
```
### EXERCISE
What is the Pearson correlation coefficient between flipper length and bill length?
Hint: Break down this problem into steps. Select the columns for correlation and remove any NA values that might interfere. Then find out the R function you would need to calculate the correlation coefficient between 2 vectors.
```{r}
library(palmerpenguins)
data(penguins)
```
### EXERCISE
Perform a Welch two sample t-test to check if there is a statistically significant difference in body mass between male and female penguins. Assume that the variance of the two groups is unequal. The test should be two-sided with a confidence level of 95%. Get the p-value for this test.
Hint: Find out the R function you would need to calculate the t-test statistic between 2 vectors. Use `names()` to access the p-value of your test statistic.
```{r}
```
## Resources to learn more about statistics in R
You can learn more in our workshop on [statistical modeling in R](https://github.com/nuitrcs/r-statistical-modeling). We also have a [resource guide on linear regression](https://sites.northwestern.edu/researchcomputing/resource-guides/resource-guide-linear-regression/), which includes materials for R. Further, we have a [resource guide on R](https://sites.northwestern.edu/researchcomputing/resource-guides/resource-guide-r/), which includes statistical aspects. Finally, you can always use our [free consultation service](https://services.northwestern.edu/TDClient/30/Portal/Requests/ServiceDet?ID=93) for questions about research using statistical methods and their implementation in R.
# Data Visualization
Often, plotting your data will be much more informative than summary statistics alone. The R package *ggplot2* is the most widely used and aesthetically pleasing graphics framework available in R. It relies on a structure called the "grammar of graphics". Essentially, it follows a layered approach to describe and construct the visualization.
Here is a handy [cheat sheet for ggplot2](https://statsandr.com/blog/files/ggplot2-cheatsheet.pdf)! Most users rely on the cheat-sheet because it is often difficult to remember the exact syntax and options.
Load the *ggplot2* package in your workspace as below.
```{r}
# install.packages("ggplot2")
library(ggplot2)
```
## The grammar of graphics
![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*mcLnnVdHNg-ikDbHJfHDNA.png){width="665"}
To build or describe any visualization with one or more dimensions, we can use the components as follows.
1. **Data**: Always start with the data, identify the dimensions you want to visualize.
2. **Aesthetics**: Confirm the axes based on the data dimensions, positions of various data points in the plot. Also check if any form of encoding is needed including size, shape, color and so on, which are useful for plotting multiple data dimensions.
3. **Scale:** Do we need to scale the potential values, use a specific scale to represent multiple values or a range? For example, you can scale the data to log of the original values.
4. **Geometric objects:** These are popularly known as 'geoms'. This would cover the way we would depict the data points on the visualization. Should it be points, bars, lines and so on?
5. **Statistics:** Do we need to show some statistical measures in the visualization like measures of central tendency, spread, confidence intervals?
6. **Facets:** Do we need to create subplots based on specific data dimensions?
7. **Coordinate system:** What kind of a coordinate system should the visualization be based on ---should it be Cartesian or polar?
You don't have to remember everything - refer to your handy [cheat sheet for ggplot2](https://statsandr.com/blog/files/ggplot2-cheatsheet.pdf)!
# Understanding the `{ggplot}` Syntax
The syntax for constructing ggplots could be puzzling if you are a beginner or work primarily with base graphics. The main difference is that, unlike base graphics:
- `{ggplot}` works with dataframes and not individual vectors. All the data needed to make the plot are typically contained within the dataframe supplied to the `ggplot()` function itself or can be supplied to respective geoms.
- The second noticeable feature is that you can keep enhancing the plot by adding more layers (and themes) to an existing plot created using the `ggplot()` function.
- The order of the layers matters, the first command/layer is executed first, and so on. *Sometimes, this can make a difference in your final plot.*
Let's initialize a basic ggplot based on the midwest dataset:
```{r}
# Setup
data(midwest, package = "ggplot2")
View(midwest)
```
```{r}
ggplot(midwest) # Initialize a plot by calling the ggplot() function
```
A blank ggplot is drawn. ggplot knows we want a plot with a given data set!
![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*mcLnnVdHNg-ikDbHJfHDNA.png){width="665"}
```{r}
ggplot(midwest) + # Initialize a plot by calling the ggplot() function
aes(x=area, y=poptotal) # area and poptotal are columns in 'midwest'
```
Now ggplot knows the columns we want to plot!
Notice how layering works with the + operator (not %\>%).
Even though `x` and `y` are specified, there are no points or lines in it. This is because ggplot doesn't assume that you want a scatterplot or a line chart to be drawn. We have only told ggplot what dataset to use and what columns should be used for X and Y. We haven't explicitly asked it to draw any points.
Also note that `aes()` function is used to specify the X and Y axes. That's because, any information that is part of the source dataframe has to be specified inside the `aes()` function.
![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*mcLnnVdHNg-ikDbHJfHDNA.png){width="665"}
# Make a simple scatterplot
Let's make a scatterplot on top of the blank ggplot by adding points using a geom layer called `geom_point`
```{r}
ggplot(midwest) +
aes(x=area, y=poptotal) +
geom_point()
```
But the scientific notation doesn't look too appealing - no problem, you can set an R option to turn it off! Options settings allow the user to set and examine a variety of global *options* which affect the way in which R computes and displays its results. Typically you would include these options on top of the script, after loading the packages that you're going to use.
```{r}
options(scipen=999) # turn off scientific notation like 1e+06
ggplot(midwest) +
aes(x=area, y=poptotal) +
geom_point()
```
We got a basic scatterplot, where each point represents a county. However, it lacks some basic components such as the plot title, meaningful axis labels, etc. Moreover, most of the points are concentrated on the bottom portion of the plot, which is not so helpful. You will see how to rectify these in upcoming steps.
![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*mcLnnVdHNg-ikDbHJfHDNA.png){width="665"}
## Using geoms to make other plot types
Like `geom_point()`, there are many such geom layers such as `geom_line` , `geom_bar`, `geom_density`, `geom_histgram`, etc.
```{r}
# histogram of county area
ggplot(midwest) +
aes(x=area) +
geom_histogram(bins=100)
```
```{r}
# density plot of county area
ggplot(midwest) +
aes(x=area) +
geom_density()
```
```{r}
# make a boxplot
ggplot(midwest) +
aes(x=area, y=state) +
geom_boxplot()
```
```{r}
# make a violinplot
ggplot(midwest) +
aes(x=area, y=state) +
geom_violin()
```
## EXERCISE
Load the penguins data from before by loading the `palmerpenguins` package.
Make a scatter plot of flipper length and body mass with point representations for each observation.
```{r}
library(palmerpenguins)
data(penguins)
```
## Adding a statistic to your plot
For now, let's just add a smoothing layer using `geom_smooth(method='lm')`. Since the `method` is set as `lm` (short for [*linear model*](http://r-statistics.co/Linear-Regression.html)), it draws the line of best fit.
```{r}
g <- ggplot(midwest) +
aes(x=area, y=poptotal) +
geom_point() +
geom_smooth(method="lm") # set se=FALSE to turnoff confidence bands
g
```
The line of best fit is in blue. Can you find out what other `method` options are available for `geom_smooth` (note: see `?geom_smooth`)?
## EXERCISE
Add a linear fit line to the `penguin_plot` from before.
```{r}
```
## EXERCISE
What if we just want a smooth (non-linear) fit to the plot above? Hint: use method `"loess"`.
```{r}
```
## Adjusting the X and Y axes limits
You might have noticed that the majority of points lie in the bottom of the chart which doesn't really look nice. So, let's change the Y-axis limits to focus on the lower half.
The X and Y axes limits can be controlled in 3 ways.
### **Method 1**: By deleting the points outside the range
*This will change the lines of best fit or smoothing lines as compared to the original data.*
This can be done by `xlim()` and `ylim()`. You can pass a numeric vector of length 2 (with max and min values) or just the max and min values themselves.
```{r}
g <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point() +
geom_smooth(method="lm")
g
# Delete the points outside the limits
g +
xlim(c(0, 0.1)) +
ylim(c(0, 1000000)) # deletes points
```
In this case, the chart was not built from scratch but rather was built on top of `g`. This is because, the previous plot was stored as `g`, a ggplot object, which when called will reproduce the original plot. Using ggplot, you can add more layers, themes and other settings on top of this plot.
Did you notice that the line of best fit became more horizontal compared to the original plot? This is because, when using `xlim()` and `ylim()`, the points outside of the specified range are deleted and will not be considered while drawing the line of best fit (using `geom_smooth(method='lm')`). This feature might come in handy when you wish to know how the line of best fit would change when some extreme values (or outliers) are removed.
### **Method 2**: Zooming In
The other method to adjust the X and Y axes is to zoom in to the region of interest *without* deleting the points. This is done using `coord_cartesian()`.
Let's store this plot as `g1`.
```{r}
g <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point() +
geom_smooth(method="lm") # set se=FALSE to turnoff confidence bands
# Zoom in without deleting the points outside the limits.
# As a result, the line of best fit is the same as the original plot.
g1 <- g +
coord_cartesian(xlim=c(0,0.1), ylim=c(0, 1000000)) # zooms in
g1
```
Since all points were considered, the line of best fit did not change.
### **Method 3**: Using the scales layer
Scales in *ggplot2* control the mapping from data to aesthetics. They take your data and turn it into something that you can see, like size, color, position, or shape. They also provide the tools that let you interpret the plot: the axes and legends.
You can generate plots with ggplot2 without knowing how scales work, but understanding scales and learning how to manipulate them will give you much more control. To get an in-depth understanding of how scales work in *ggplot2* see [this book chapter on scales and guides](https://ggplot2-book.org/scales-guides).
An important property of *ggplot2* is the principle that every aesthetic in your plot is associated with exactly one scale. For instance, when you write this:
```{r}
ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point()
```
*ggplot2* adds a default scale for each aesthetic used in the plot:
```{r}
ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point() +
scale_x_continuous() +
scale_y_continuous()
```
Inside the `scale` layer, you can add limits as:
```{r}
scale_plot <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point() +
scale_x_continuous(limits=c(0, 0.1)) +
scale_y_continuous(limits=c(0, 1000000)) +
geom_smooth(method="lm")
scale_plot
```
Notice that, similarly to method 1, *this method deletes the points outside of the range.*
Position scales have many more features that you can [explore](https://ggplot2-book.org/scales-position).
![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*mcLnnVdHNg-ikDbHJfHDNA.png){width="665"}
## EXERCISE
Limit the flipper length to values between 180 and 220 in `penguin_plot` **without** removing the original points from the best-fit line calculation.
```{r}
```
## Change the Title and Axes Labels
Let's continue with our previous scatter plot of the midwest data and add the plot title and labels for the X and Y axes. This can be done in one go using the `labs()` function with `title`, `x`, and `y` arguments. Another option is to use `ggtitle()`, `xlab()`, and `ylab()`.
```{r}
g <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point() +
geom_smooth(method="lm") # set se=FALSE to turnoff confidence bands
g1 <- g +
coord_cartesian(xlim=c(0,0.1), ylim=c(0, 1000000)) # zooms in
# Add Title and Labels
g1 +
labs(title="Area Vs Population",
subtitle="From midwest dataset",
y="Population",
x="Area",
caption="Midwest Demographics")
# or
g1 +
ggtitle("Area Vs Population",
subtitle="From midwest dataset") +
xlab("Area") +
ylab("Population")
```
Excellent! So here is the full function call:
```{r}
ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point() +
geom_smooth(method="lm") +
coord_cartesian(xlim=c(0,0.1), ylim=c(0, 1000000)) +
labs(title="Area Vs Population",
subtitle="From midwest dataset",
y="Population",
x="Area",
caption="Midwest Demographics")
```
## EXERCISE
Add suitable axis labels and titles to `penguin_plot`.
```{r}
```
## Change the Color and Size of Points
### Set a static color or size
We can change the aesthetics of a geom layer by modifying the respective geoms. Let's change the color of the points and the line to a static value.
```{r}
ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(col="plum3") +
geom_smooth(method="lm", col="aquamarine3") +
coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) +
labs(title="Area Vs Population",
subtitle="From midwest dataset",
y="Population",
x="Area",
caption="Midwest Demographics")
```
Let's try new colors and also change the size of the points:
```{r}
ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(col="steelblue", size=3) + # Set static color and size for points
geom_smooth(method="lm", col="firebrick") + # change the color of line
coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) +
labs(title="Area Vs Population",
subtitle="From midwest dataset",
y="Population",
x="Area",
caption="Midwest Demographics")
```
## EXERCISE
Change the color of the points to red and the transparency (alpha) to 0.5 by setting parameters within `geom_point` for a plot of flipper_length_mm vs body_mass_g.
```{r}
```
### Change the Color To Reflect Categories in Another Column
If we want the color to change based on another column in the source dataset (`midwest`), it must be specified inside the `aes()` function.
```{r}
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(colour=state), size=2) + # Vary color based on state categories.
geom_smooth(method="lm", col="firebrick", linewidth=2) +
coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) +
labs(title="Area Vs Population",
subtitle="From midwest dataset",
y="Population",
x="Area",
caption="Midwest Demographics")
gg
```
Now each point is colored based on the `state` it belongs to because of `aes(colour=state)`. Not just color, but `size`, `shape`, `stroke` (thickness of boundary), and `fill` (fill color) can be used to discriminate groupings.
As an added benefit, the legend is added automatically. If needed, it can be removed by setting the `legend.position` to `None` from within a `theme()` function. We'll talk about themes later.
### Scales for color
Just like you can add scales for position, you can also add [scales for color](https://ggplot2-book.org/scales-colour).
The default scale for continuous fill scales is `scale_fill_continuous()` which in turn defaults to `scale_fill_gradient()`. As a consequence, these three commands produce the same plot using a gradient scale:
```{r}
gg
gg + scale_fill_continuous()
gg + scale_fill_gradient()
```
You can change the color palette entirely using a different color scale:
```{r}
gg + scale_colour_brewer(palette = "Set1") # change color palette
```
More of such palettes can be found in the RColorBrewer package.
```{r}
library(RColorBrewer)
head(brewer.pal.info, 10) # show 10 palettes
gg + scale_colour_brewer(palette = "PuOr")
```
*When selecting colors for your visualizations, in addition to other factors, it's important to consider [friendly colors for people with color vision deficiency](https://www.tableau.com/blog/examining-data-viz-rules-dont-use-red-green-together).*
## EXERCISE
Change the color of the points according to the species column for a plot of flipper_length_mm vs body_mass_g.
```{r}
```
# Themes
Let's make the graph more elegant! Let's use themes to clean up the background.
`theme_pubr()` from the `{ggpubr}` package is a nice option:
```{r}
#install.packages("ggpubr")
library(ggpubr)
ggplot(data=penguins,
aes(x = flipper_length_mm,
y = body_mass_g)) +
geom_point(aes(color = species)) +
geom_smooth(method="lm", aes(color = species)) +
labs(x = "Flipper length (mm)",
y = "Body mass (g)",
title = "Penguins Data") +
theme_pubr()
```
Themes can be [customized](https://rpubs.com/mclaire19/ggplot2-custom-themes) to every detail - the [possibilities](https://ggplot2.tidyverse.org/reference/theme.html) are endless!
The [`{ggsci}`](https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html) package offers a collection of high-quality color palettes inspired by colors used in scientific journals, data visualization libraries, science fiction movies, and TV shows. But there are several built-in themes in ggplot that you can use:
```{r}
ggplot(data=penguins,
aes(x = flipper_length_mm,
y = body_mass_g,
color = species)) +
geom_point() +
geom_smooth(method="lm") +
labs(x = "Flipper length (mm)",
y = "Body mass (g)",
title = "Penguins Data") +
theme_classic()
```
## Fonts
Themes are extremely versatile. For example, you change the axis font styles using custom inputs:
```{r}
ggplot(data=penguins,
aes(x = flipper_length_mm,
y = body_mass_g,
color = species)) +
geom_point() +
theme_classic() +
theme(axis.title.x = element_text(color="dodgerblue",
size=15,
face="italic",
family="Comic Sans MS"))
```
Notice how the theme custom input overrides the defaults set in `theme_classic()`.
Remember that the order of layering matters! The last passed input will be retained.
```{r}
ggplot(data=penguins,
aes(x = flipper_length_mm,
y = body_mass_g,
color = species)) +
geom_point() +
theme(axis.title.x = element_text(color="dodgerblue",
size=15,
face="italic",
family="Comic Sans MS")) +
theme_classic()
```
## EXERCISE
Add a theme of your choosing from [here](https://r-charts.com/ggplot2/themes/) to `penguin_plot`.
Reminder: if you are using a new package, first install and load that package to access the themes inside it.
```{r, eval=FALSE}
penguin_plot <- ggplot(data = penguins,
aes(x = flipper_length_mm,
y = body_mass_g)) +
geom_point(aes(color=species))
penguin_plot + _________
```
# Separating by Facets
The graph above still looks quite messy. Maybe we want to separate out the three species into individual panels:
```{r}
ggplot(data=penguins,
aes(x = flipper_length_mm,
y = body_mass_g)) +
geom_point(aes(color = species),
size = 2,
alpha = 0.8) +
geom_smooth(method="lm",
aes(color=species,
fill=species),
alpha=0.1) +
theme_classic() +
facet_grid(~species) + # facet by species -- note the use of a formula!!
labs(x = "Flipper length (mm)",
y = "Body mass (g)",
title = "Penguins Data")
```
You can separate by more than one column. Note also the difference between "color" and "fill" in the example below:
```{r}
ggplot(data=penguins,
aes(x = flipper_length_mm,
y = body_mass_g,
group = species)) +
geom_point(aes(color = species),
size = 2,
alpha = 0.8) +
geom_smooth(method="lm",
aes(color=species),
fill="black",
alpha=0.15) +
facet_grid(~species + island) +
labs(x = "Flipper length (mm)",
y = "Body mass (g)",
title = "Penguins Data") +
theme_classic()
```
## EXERCISE
Separate the plot above into facets by sex. Make sure to remove observations that have no sex specified.
```{r}
```
# Bonus 1: Saving your beautiful plots!
Suppose you created a bunch of plots and now you want to save them, you can use the [`ggsave()`](https://ggplot2.tidyverse.org/reference/ggsave.html) function:
```{r}
# save the plot object as a variable
p <-
ggplot(data = penguins,
aes(x = flipper_length_mm,
y = body_mass_g,
group = species)) +
geom_point(aes(color = species),
size = 2,
alpha = 0.8) +
geom_smooth(method = "lm",
aes(color = species, fill = species),
alpha = 0.1) +
facet_grid( ~ species) + # facet by species -- note the use of a formula!!
labs(x = "Flipper length (mm)",
y = "Body mass (g)",
title = "Penguins Data") +
theme_classic()
ggsave("my_beautiful_plot.pdf", plot=p, height = 4, width = 6, device="pdf", useDingbats=F) # setting useDingbats to FALSE lets you edit plot fonts in Illustrator
```
![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*mcLnnVdHNg-ikDbHJfHDNA.png){width="665"}
# Bonus 2: Arranging multiple plots
If you are preparing several different plots for publication, perhaps you want to arrange them in a grid before exporting and saving them.
Let's create another plot first:
```{r}
# We already have a plot p
p
# create a new plot to display the sample size for each species
p_count <- ggplot(penguins, aes(species)) +
geom_bar(aes(fill=species)) + # this is coloring the bars by species
theme_classic() +
scale_color_brewer(palette = "Accent") +
labs(title="Count of penguin species")
p_count
```
Now let's put the plots together using the packages `{grid}` and `{gridExtra}`:
```{r}
# arrange plot p and plot_new
library(gridExtra)
library(grid)
grid.arrange(p_count, p, ncol=2)
grid.arrange(p_count, p, nrow=2)
```
You can create very [complicated layouts](https://cran.r-project.org/web/packages/gridExtra/vignettes/arrangeGrob.html) if needed. You can also use the `{patchwork}` package to same effect.
# PRACTICE EXERCISES
Several of these exercises will require you to Google or look up the documentation, especially for custom theme inputs.
Load in the data set below:
```{r}
chic <- readr::read_csv(file="data/chicago_weather_data.csv")
```
```{r}
dim(chic)
str(chic)
View(chic)
```
### EXERCISE 1
Generate a scatter plot of temperature vs date and add color the points `firebrick`
```{r}
```
### EXERCISE 2
Now add a title to the plot and appropriate axis labels.
```{r}
```
### EXERCISE 3
Make the main title bold and grey in color
```{r}
```
### EXERCISE 4
Get rid of axis ticks. Hint: look up the `theme()` and `axis.ticks.y`
```{r}
```
### EXERCISE 5
Limit the range of the axis. Hint: See `ylim()`, `scale_x_continuous()` or `coord_cartesian()`
```{r}
```
### EXERCISE 6
Limit the range of the x-axis from 1995 to 2005. Hint: See `ylim()`, `scale_x_continuous()` or `coord_cartesian()`
```{r}
```
### EXERCISE 7
Color the plot based on season
```{r}
```
### EXERCISE 8
Turn off the legend title. See `legend.title` parameter in `theme()`
```{r}
```
### EXERCISE 9
Change the title of the legend to be more informative
```{r}
```
### EXERCISE 10
Set the theme to make the graph more clean in appearance
```{r}
```
### EXERCISE 11
Create a single row of 4 plots showing data from each year separately. Hint: use facets
```{r}
```
### EXERCISE 12
Make the above plot so that you have a 2x2 grid of plots instead of a row
```{r}
```
### EXERCISE 13
Fit a smooth line to each facet plot. Change the color of the line to `black` and set the fill option to be translucent with `alpha`
```{r}
```
### EXERCISE 14
Set the parameter `scales` to "free" inside the `facet_wrap()` function. What do you notice?
```{r}
```
### EXERCISE 15
Create a grid of plots using 2 variables `year` and `season`
```{r}
```
### EXERCISE 16
Create 2 unrelated plots of your choice from the data. And then arrange them together as you might for a publication.
```{r}
```
### EXERCISE 17
Plot temperature vs date, color by season, and then choose a color palette instead of the default colors
```{r}
```
### EXERCISE 18
Create a boxplot of o3 values by season. Then flip the coordinates for the boxplot
```{r}
```
### EXERCISE 19
Create a violinplot similar to above, that is not trimmed at the ends. Color each season differently.
```{r}
```
### EXERCISE 20
Add the mean and standard deviation for each season to the plot above
```{r}
```
### EXERCISE 21
Generate a scatter plot of temperature vs date and add a trend line to the plot
```{r}
```
### EXERCISE 22
Increase or decrease the smoothness of the trend line
```{r}
```
### EXERCISE 23
Create a density plot of `o3` and color the density lines by factor season. Fill in the density plots also colored by factor season.
```{r}
```
### EXERCISE 24
Add a summary statistic mean for each season to the plot above
```{r}
```
### EXERCISE 25
Save your plot from above with appropriate dimensions
```{r}
```