From 5f4255aca8da941bd3b7b58a06bb5711192ff93f Mon Sep 17 00:00:00 2001 From: "Juan F. Fung" Date: Mon, 2 Oct 2023 14:06:39 -0400 Subject: [PATCH 1/6] Add summary of Intro to R in the Narrative of instructor notes #477 --- instructors/instructor-notes.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/instructors/instructor-notes.md b/instructors/instructor-notes.md index db0fb4aa9..28bdbda38 100644 --- a/instructors/instructor-notes.md +++ b/instructors/instructor-notes.md @@ -2,7 +2,9 @@ title: Instructor Notes --- -## Dataset +## Narrative + +### Dataset The data used for this lesson are a slightly cleaned up version of the SAFI Survey Results available on GitHub. The original data is on @@ -11,8 +13,6 @@ SAFI Survey Results available on GitHub. The original data is on This lesson uses `SAFI_clean.csv`. The direct download link for the data file is: [https://raw.githubusercontent.com/datacarpentry/r-socialsci/main/episodes/data/SAFI_clean.csv](https://raw.githubusercontent.com/datacarpentry/r-socialsci/main/episodes/data/SAFI_clean.csv). -## Narrative - ### Before we start - The main goal here is to help the learners be comfortable with the RStudio @@ -25,6 +25,12 @@ This lesson uses `SAFI_clean.csv`. The direct download link for the data file is ### Intro to R +- The main goal is to introduce users to the various objects in R, from atomic types + to creating your own objects. +- While this epsiode is foundational, be careful not to get caught in the weeds as the + variety of types and operations can be overwhelming for new users, especially before + they understand how this fits into their own "workflow." + ### Starting with data The two main goals for this lessons are: From 2de44499a420a48c5c0532b3ee574db8a64f783d Mon Sep 17 00:00:00 2001 From: "Juan F. Fung" Date: Mon, 2 Oct 2023 14:12:25 -0400 Subject: [PATCH 2/6] Added suggested lesson plans to instructor notes #477 --- instructors/instructor-notes.md | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/instructors/instructor-notes.md b/instructors/instructor-notes.md index 28bdbda38..ec6db7b52 100644 --- a/instructors/instructor-notes.md +++ b/instructors/instructor-notes.md @@ -69,6 +69,29 @@ The two main goals for this lessons are: - Note that his lesson was community-contributed and remains a work in progress. As such, it could benefit from feedback from instructors and/or workshop participants. +## Lesson Plans + +The lesson contains much more material than can be taught in a day. Instructors will +need to pick an appropriate subset of episodes to use in a standard one day course. + +Suggested path for half-day course: + +- Before we Start +- Introduction to R +- Starting with Data + +Suggested path for full-day course: + +- Before we Start +- Introduction to R +- Starting with Data +- Data Wranging with dplyr +- (OPTIONAL) Data Wrangling with tidyr +- Data Visualisation with ggplot2 + +For a two-day workshop, it may be possible to cover all of the episodes. Feedback from +the community on successful lesson plans is always appreciated! + ## Technical Tips and Tricks Show how to use the 'zoom' button to blow up graphs without constantly resizing From ee13ef608b77a8afc0a12ff81afd1e4018c48e29 Mon Sep 17 00:00:00 2001 From: "Juan F. Fung" Date: Mon, 2 Oct 2023 14:15:16 -0400 Subject: [PATCH 3/6] Changed lesson timing in Summary and Schedule to cite suggested lesson plans in Instructor Notes #477 --- index.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/index.md b/index.md index 5ab6eb3d9..b428a76d0 100644 --- a/index.md +++ b/index.md @@ -8,8 +8,10 @@ less time, and with less pain. The lessons below were designed for those interested in working with social sciences data in R. This is an introduction to R designed for participants with no -programming experience. These lessons can be taught in a day (~ 6 -hours). They start with some basic information about R syntax, the +programming experience. These lessons can be taught in a half-day, +full-day, or over a two-day workshop (see +[Instructor Notes](https://datacarpentry.org/r-socialsci/instructor/instructor-notes.html). +They start with some basic information about R syntax, the RStudio interface, and move through how to import CSV files, the structure of data frames, how to deal with factors, how to add/remove rows and columns, how to calculate summary statistics from a data From 02dfb96b8269159ab3f4415572ea9d841e75b218 Mon Sep 17 00:00:00 2001 From: "Juan F. Fung" Date: Tue, 3 Oct 2023 09:21:22 -0400 Subject: [PATCH 4/6] Fix typo instructor notes heading for Introduction to R --- instructors/instructor-notes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/instructors/instructor-notes.md b/instructors/instructor-notes.md index ec6db7b52..e6e86a6ef 100644 --- a/instructors/instructor-notes.md +++ b/instructors/instructor-notes.md @@ -23,7 +23,7 @@ This lesson uses `SAFI_clean.csv`. The direct download link for the data file is sure that learners are in the correct working directory, and that they create a `data` (all lowercase) subfolder. -### Intro to R +### Introduction to R - The main goal is to introduce users to the various objects in R, from atomic types to creating your own objects. From d713551de5d88a82114253764e7cbe7b931955fd Mon Sep 17 00:00:00 2001 From: "Juan F. Fung" Date: Tue, 3 Oct 2023 09:25:02 -0400 Subject: [PATCH 5/6] Add missing close parenthesis to Summary --- index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/index.md b/index.md index b428a76d0..c1ad674fb 100644 --- a/index.md +++ b/index.md @@ -10,7 +10,7 @@ those interested in working with social sciences data in R. This is an introduction to R designed for participants with no programming experience. These lessons can be taught in a half-day, full-day, or over a two-day workshop (see -[Instructor Notes](https://datacarpentry.org/r-socialsci/instructor/instructor-notes.html). +[Instructor Notes](https://datacarpentry.org/r-socialsci/instructor/instructor-notes.html)). They start with some basic information about R syntax, the RStudio interface, and move through how to import CSV files, the structure of data frames, how to deal with factors, how to add/remove From c8a259c127ff38dd75fdef22b3f775fe04b3702e Mon Sep 17 00:00:00 2001 From: Juan Fung Date: Tue, 3 Oct 2023 14:07:11 -0400 Subject: [PATCH 6/6] Remove handout from learners subdirectory and navigation in config file --- config.yaml | 2 +- learners/R-handout.Rmd | 934 ----------------------------------------- 2 files changed, 1 insertion(+), 935 deletions(-) delete mode 100644 learners/R-handout.Rmd diff --git a/config.yaml b/config.yaml index 6e8f5d053..31084e1ba 100644 --- a/config.yaml +++ b/config.yaml @@ -71,10 +71,10 @@ episodes: # Information for Learners learners: - reference.md -- R-handout.Rmd # Information for Instructors instructors: +- instructor-notes.md # Learner Profiles profiles: diff --git a/learners/R-handout.Rmd b/learners/R-handout.Rmd deleted file mode 100644 index 896a7f380..000000000 --- a/learners/R-handout.Rmd +++ /dev/null @@ -1,934 +0,0 @@ ---- -title: Code Handout - R for Social Scientists -output: - html_document: - df_print: paged - code_download: yes ---- - -```{r, include=FALSE} -knitr::opts_chunk$set(fig.width = 3, fig.height = 3, message = FALSE, warning = FALSE, eval = FALSE) -``` - -This document contains all of the core functions that were covered in the R for Social Scientists workshop. -Each function is presented alongside an example of -how it can be used. It is split into the following 4 sections: - -- Introduction to R - -- Starting with Data - -- Data Wrangling - -- Data Visualization - -Each section has instructions to load all necessary libraries or data, so you can start from any of the 4 points linked above. - -## Introduction to R - -The first section covers core programming concepts and functions in Base R and does not require any data or libraries to be loaded. - -### Creating Objects - -- `<-` -- "assignment arrow", assigns a value (vector, dataframe, single value) to the name of a variable - -```{r} -x <- 3 -``` - -- `c()` -- the "concatenate" function combines inputs to form a vector, the - values have to be the same data type. - -```{r} -animals <- c("bird", "cat", "dog") -numbers <- c(1, 14, 57, 89) -logicals <- c(TRUE, FALSE, TRUE, TRUE) -``` -- `+` -- addition and other mathematical operators can be repeated on every value - in a vector. - -```{r} -y <- c(1, 2, 3) -z <- x + y -``` - -### Inspecting Objects - -- `str()` -- compact display of the structure of an R object - -```{r} -str(animals) -``` - -- `class()` -- returns the type of element of any R object - -```{r} -class(logicals) -``` - -- `typeof()` -- returns the data type or storage mode of any R object - -```{r} -typeof(numbers) -``` - -### Functions in R - -- `args()` -- returns the arguments of a function - -```{r} -args(round) -``` - -- named arguments -- the name of the argument the function expects - - You can choose to not name your arguments, **if** you know the **exact** - order they should be in! - - However, we generally discourage this. - -- `round()` -- round a decimal or fraction to a specified number of digits - -```{r} -# Either of these work, since the digits argument is named explicitly. -round(3.14159, digits = 2) -round(digits = 2, 3.14159) - -# This does not work, since the arguments are not named and in the incorrect order. -round(2, 3.14159) -``` - -### Functions to Summarize Data - -- `sqrt()` -- returns the square root of a numeric variable - -```{r} -sqrt(numbers) -``` - -- `mean()` -- returns the mean of a numeric variable - - You can add the `na.rm` argument, to remove `NA` values before calculating - the mean. - -```{r} -mean(numbers) -``` - -- `max()` -- returns the maximum of a numeric variable - - You can add the `na.rm` argument, to remove `NA` values before calculating - the max. - -```{r} -max(numbers) -``` - -- `sum()` -- returns the sum of a numeric variable - - You can add the `na.rm` argument, to remove `NA` values before calculating - the sum. - -```{r} -sum(numbers) -``` - -- `length()` -- returns the length of a vector (of any datatype) - -```{r} -length(animals) -``` - -- Additional summary functions include: - - `var()` -- find the variance of a numerical variable - - `sd()` -- finds the standard deviation of a numerical variable - - `IQR()` -- find the innerquartile range (Q3 - Q1) of a numerical variable - - `median()` -- finds the median of a numerical variable - -### Subsetting Data - -- `[]` -- used to subset elements from a vector - -- `X:Y` -- used to retrieve a "slice" of a vector starting at X and continuing through Y - -```{r} -animals[3] -# selects the third element - -animals[2:3] -# selects the second and third element - -animals[c(1, 3)] -# selects the first and third element -``` - -- relational operators -- return logical values indicating where a relation is - satisfied. The most commonly used logical operators for data analysis are as follows: - - `==` means "equal to" - - `!=` means "not equal to" - - `>` or `<` means "greater than" or "less than" - - `>=` or `<=` means "greater than or equal to" or "less than or equal to" - -```{r} -animals == "dog" - -animals != "cat" - -numbers > 4 - -numbers <= 12 -``` - -- logical operators -- join subset criteria together - - `&` means "and" -- where two criteria must **both** be satisfied - - `|` means "or" -- where at least one criteria must be satisfied - -```{r} -numbers > 4 & numbers < 20 - -animals == "dog" | animals == "cat" -``` - -- `%in%` -- the "inclusion operator", allows you to test if any of the elements - of a search vector (on the left hand side) are found in the target vector (on - the right hand side). - - The levels of the target vector must be included in a vector (`c()`). - -```{r} -possessions <- c("car", "bicycle", "radio", "television", "mobile_phone") - -possessions %in% c("car", "bicycle", "motorcycle") -``` - -### Missing Data - -- `is.na()` -- returns a vector of logical values indicating which elements of - a vector have `NA` values - - Often combined with `!`, where the `!` negates the previous statement (e.g. - `!TRUE` is equal to `FALSE`). - -```{r} -missing <- c(1, 3, NA, 7, 12, NA) - -is.na(missing) - -!is.na(missing) -``` - -- `na.omit()` -- removes the observations with `NA` values - -```{r} -na.omit(missing) -``` - -- `complete.cases()` -- returns a vector of logical values indicating which - elements of a vector **are not** missing (`NA`) values - -```{r} -complete.cases(missing) -``` - -## Starting with Data - -In this section, we begin working with data. -All data examples are in the context of the Palmer Penguins, found -[here (link)](https://allisonhorst.github.io/palmerpenguins/index.html). - -### Packages - -Packages (also called libraries) expand the capabilities of R beyond the functions that come when you install it. Each needs to be downloaded and installed only once but loaded into each R session. - -- `install.packages()` -- install a new package - -- `library()` -- loads packages into your `R` session - -```{r, message=FALSE, warning=FALSE} -# Install packages (not run) -# Delete `#` from lines below if missing packages -#install.packages("tidyverse") -#install.packages("lubridate") -#install.packages("palmerpenguins") - -# Load libraries -library(tidyverse) -library(lubridate) -library(palmerpenguins) #load Palmer Penguins data as `penguins` -``` - -### Inspecting Data - -- `dim()` - returns a vector with the number of rows as the first element, - and the number of columns as the second element (the **dim**ensions of - the object) - -```{r} -dim(penguins) -``` - -- `nrow()` - returns the number of rows -- `ncol()` - returns the number of columns - -```{r} -nrow(penguins) -ncol(penguins) -``` - -- `head()` - displays the first 6 rows of the dataframe -- `tail()` - displays the last 6 rows of the dataframe - -```{r} -head(penguins) -tail(penguins) -``` - -- `names()` - returns the all of the names of an object (both row and column) -- `colnames()` - returns column names for dataframes (without row names) - -```{r} -names(penguins) -colnames(penguins) -``` - -- `glimpse()` - provides a preview of the data, where column names are presented - with their associated data types, and the entries from each column are printed - in each row - -```{r} -glimpse(penguins) -``` - -- `str()` - returns the structure of the object and information about the class, - the names and data types of each column, and a preview of the first entries of - each column - -```{r} -str(penguins) -``` - -- `summary()` - provides summary statistics for each column - - Note: summary statistics for character variables are not meaningful, as they - simply state the number of observations (length) of the variable - -```{r} -summary(penguins) -``` - -### Subsetting Data in Dataframes - -- `[]` -- selects rows and columns from a dataframe - - The first entry is the row number, the second entry is the column number(s), - and they are separated with a comma. - -```{r} -# Selects the element in the first row, second column -penguins[1, 2] - -# Selects every element in the fourth row -penguins[4, ] - -# Selects every element in the third column -penguins[, 3] -``` - -- `[[]]` -- selects a column from a dataframe - - Inside the brackets you can pass either the number of the column or the - name of the column (in quotations) - -```{r} -penguins[[1]] - -penguins[["island"]] -``` - -- `$` -- selects a column from a dataframe, where the name of the dataframe is - on the left and the name of the column is on the right - -```{r} -penguins$body_mass_g -``` - -### Working with Different Data Types - -- `factor()` -- creates a categorical variable from a character or numeric - variable, variable has a factor datatype - - the values (level) of the factor levels is specified in the `levels` - argument, where the levels must be specified in a vector (using `c()`) - - Note: the order you wish for the levels to appear is how you should list - them in the `levels` argument, you can also specify `ordered = TRUE` to - ensure the levels remain in this order - -```{r} -penguins$year_fct <- factor(penguins$year, - levels = c("2007", "2008", "2009"), - ordered = TRUE) -``` - -- `as.factor()` -- creates a categorical variable from a character or numeric - variable, variable has a factor datatype - - does not allow for you to specify the order of the levels - - defaults to alphabetical ordering for factor levels - -```{r} -penguins$year_fct <- as.factor(penguins$year) -``` - -- `levels()` -- returns the levels of a factor variable in the - order they were stored - - Note: this function will not work for character variables - -```{r} -levels(penguins$year_fct) -``` - -- `nlevels()` -- returns the number of levels of a factor variable - - Note: this function will not work for character variables - -```{r} -nlevels(penguins$year_fct) -``` - -- `as.character()` -- creates a character variable from a numeric or factor - variable - -```{r} -penguins$species_chr <- as.character(penguins$species) -``` - -- `ymd()` -- transforms dates stored as character or numeric variables to dates - - Note: to use this function, dates must be stored in year-month-day format - - The function does well with heterogeneous formats (as seen below), but - formats where some of the entries are not in double digits may not be parsed - correctly. - -```{r} -x <- c("2009-01-01", "2009-01-02", "2009-01-03") -ymd(x) -``` - -- `day()` -- extracts the day (number) of a date variable - -```{r} -day(x) -``` - -- `month()` -- extracts the month (number) of a date variable - -```{r} -month(x) -``` - -- `year()` -- extracts the year of a date variable - -```{r} -year(x) -``` - -### Basic Data Visualization (see Data Visualization section for more) - -- `plot()` -- a generic function for plotting R objects - - In this lesson `plot()` was used to create bargraphs of categorical - variables. - -```{r} -plot(penguins$species) -``` - -## Data Wrangling - -This section continues using the [Palmer Penguins data](https://allisonhorst.github.io/palmerpenguins/index.html) and introduces concepts and functions to explore, clean and summarize data, many of which come from the `dplyr` and `plyr` tidyverse libraries. - -### Packages - -- `library()` -- loads packages into your `R` session - -```{r, message=FALSE, warning=FALSE} -library(tidyverse) -library(palmerpenguins) -``` - -### Inspecting Data - -- `glimpse()` -- shows a summary of the dataset, the number of rows and columns, - variable names, and the first 10 entries of each variable - -```{r} -glimpse(penguins) -``` - -### Verbs of Data Wrangling - -- `%>%` -- the "pipe" operator, joins sequences of data wrangling steps together, - works with any function that has `data = ` as the first argument -- `select()` -- selects variables (columns) from a dataframe - -```{r} -penguins %>% - select(species) -``` - -- `filter()` -- filters observations (rows) out of / into a dataframe, where - the inputs (arguments) are the conditions to be satisfied in the data that are - kept - -**Logical operators:** Filtering for certain observations (e.g. flights from a -particular airport) is often of interest in data frames where we might want to -examine observations with certain characteristics separately from the rest of -the data. To do so, you can use the `filter` function and a series of **logical -operators**. The most commonly used logical operators for data analysis are as -follows: - -- `==` means "equal to" - -- `!=` means "not equal to" - -- `>` or `<` means "greater than" or "less than" - -- `>=` or `<=` means "greater than or equal to" or "less than or equal to" - -```{r} -# It's nice to have a new line for each condition, so your code is easier to read! -penguins %>% -filter(species == "Adelie", - body_mass_g > 3000, - year == 2008) -``` - -- `mutate()` -- creates new variables or modifies existing variables - -```{r} -penguins %>% - filter(is.na(bill_length_mm) != TRUE, - is.na(bill_depth_mm) != TRUE) %>% - mutate(body_mass_kg = body_mass_g / 1000) -``` - -- `group_by()` -- groups the dataframe based on levels of a categorical variable, - usually used alongside `summarize()` - -```{r, eval=FALSE} -penguins %>% - group_by(island) -``` - -- summarize()`-- creates data summaries of variables in a dataframe, for grouped summaries use alongside`group\_by()\` - -```{r} -penguins %>% - filter(is.na(body_mass_g) != TRUE) %>% - group_by(island) %>% - summarize(mean_mass = mean(body_mass_g)) - -``` - -- `ungroup()` -- removes the grouping of a dataframe, typically used after group - summaries when additional ungrouped operations are required - -```{r} -penguins %>% - filter(is.na(body_mass_g) != TRUE) %>% - group_by(island) %>% - summarize(mean_mass = mean(body_mass_g)) %>% - ungroup() -``` - -- `arrange()` -- orders a dataframe based on the values of a numerical variable, - paired with `desc()` to order in descending order - -```{r} -penguins %>% - filter(is.na(body_mass_g) != TRUE) %>% - group_by(island) %>% - summarize(mean_mass = mean(body_mass_g)) %>% - arrange(desc(mean_mass)) -``` - -Chain multiple operations together with `%>%` to create specific outputs without extra steps or dataframes being created along the way. - -```{r} -penguins %>% - select(species, island, body_mass_g, sex, year) %>% - filter(island == "Torgersen", - is.na(body_mass_g) != TRUE) %>% - group_by(species, year) %>% - summarize(mean_mass = mean(body_mass_g), - median_mass = median(body_mass_g), - observations = n()) %>% - arrange(desc(mean_mass)) -``` - -### Other Data Wrangling Tools - -- `count()` -- counts the number of observations (rows) of the different levels - of a categorical variable - - can add `sort = TRUE` to sort the table in descending order (similar to - using `arrange(desc())` ) - -```{r} -penguins %>% - count(species) -``` - -- `sample_n()` -- selects $n$ rows from the dataframe, based on the value of - `size` specified - -```{r} -penguins %>% - sample_n(size = 10) -``` - -- `replace_na()` -- replaces NA values with the value specified - - The values to be replaced must be passed to the function (input) as a - `list()` object. - -```{r} -penguins %>% - replace_na(list(bill_length_mm = "no_measurement", - bill_depth_mm = "no_measurement")) %>% - glimpse() -``` - -- `separate_rows()` -- separates a variable with multiple values based on the - delimiter specified. - - - Variables whose entries are stored as a list with commas or semicolons are - great candidates for this function! - -- `rowSums()` -- forms row sums for numeric variables - - - Note: In the lesson `rowSums()` was used on a `logical` variable, because - logical values can be numerically represented as 0 (FALSE) and 1 (TRUE) - -```{r} -x <- tibble(x1 = 3, x2 = c(4:1, 2:5)) -rowSums(x) -``` - -### Pivoting Dataframes - -- `pivot_wider()` -- transforms a dataframe from long to wide format - - takes three principal arguments: - 1. the *data* (often passed by a `%>%`) - 2. the *names\_from* column variable whose values will become new column names - 3. the *values\_from* column variable whose values will fill the new column - variables. - - Further arguments include `values_fill` which, if set, fills in missing - values with the value provided. - -```{r} -wide <- penguins %>% - mutate(island_logical = TRUE) %>% - pivot_wider(names_from = species, - values_from = island_logical, - values_fill = list(island_logical = FALSE)) - -glimpse(wide) -``` - -- `pivot_longer()` -- transforms a dataframe from wide to long format - - takes four principal arguments: - 1. the data - 2. *cols* are the names of the columns we use to fill the a new values variable - (or to drop). - 3. the *names\_to* column variable we wish to create from the *cols* provided. - 4. the *values\_to* column variable we wish to create and fill with values - associated with the *cols* provided. - -```{r} -wide %>% - pivot_longer(cols = Adelie:Gentoo, - names_to = "species", - values_to = "island_logical") -``` - -### Extracting Data - -- `write_csv()` -- writes a dataframe to a csv file, output into the file path - specified - -```{r} -write_csv(wide, path = "data/penguins_wide.csv") -``` - -### Importing Data - -- `read_csv()` -- function to import a csv file. - - First argument is the path to the data, passed as a character - (inside quotations). - - You can specify what values should be considered missing, using the `na` - argument. - -```{r} -penguins_wide <- read_csv("data/penguins_wide.csv") -``` - -## Data Visualization with ggplot2 - -This section continues using the [Palmer Penguins data](https://allisonhorst.github.io/palmerpenguins/index.html) to introduce key features of the ggplot2 package for visualizing data -and provide examples combining data wrangling and visualization. - -### Packages - -```{r} -library(tidyverse) -library(palmerpenguins) -``` - -### Foundations of `ggplot()` - -- `ggplot()` -- a function to create the shell of a visualization, where - specific variables are mapped to different aspects of the plot - -- `aes()` -- aesthetics that can be used when creating a `ggplot()`, where the - aesthetics can either be hard coded (e.g. `color = "blue"`) or associated with - a variable (e.g. `color = sex`). - - - The following are the aesthetic options for *most* plots: - - `x` -- variable to use for x axis - - `y` -- variable to use for y axis - - `alpha` -- changes transparency - - `color` -- produces colored outline - - `fill` -- fills with color - - `group` -- used with categorical variables, similar to color - -```{r} -#nothing should appear on this plot except the axes and labels -penguins %>% - ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) -``` - -- **`+`** -- an important aspect creating a `ggplot()` is to note that the - `geom_XXX()` function is separated from the `ggplot()` function with a plus - sign, `+`. - - - `ggplot()` plots are constructed in series of layers, where the plus sign - separates these layers. - - Generally, the `+` sign can be thought of as the end of a line, so you - should always hit enter/return after it. While it is not mandatory to move - to the next line for each layer, doing so makes the code a lot easier to - organize and read. - -- `geom_point()` -- adds a scatter plot; see full explanation later in list - -```{r, fig.width=6} -penguins %>% - ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + - geom_point() -``` - -### Geometric Objects to Visualize the Data - -- `geom_histogram()` -- adds a histogram to the plot, - where the observations are binned into ranges of values and then frequencies - of observations are plotted on the y-axis - - You can specify the number of bins you want with the `bins` argument - -```{r} -penguins %>% - ggplot(aes(x = bill_length_mm)) + - geom_histogram(bins = 20) -``` - -- `geom_boxplot()` -- adds a boxplot to the plot, where observations are - aggregated (summarized), the min, Q1, median, Q3, and maximum are plotted as the - box and whiskers, and "outliers" are plotted as points. - - You can plot a vertical boxplot by specifying the `x` variable, or a - horizontal boxplot by specifying the `y` variable. - - Note: the min and max may not be included in the whiskers, if they are - deemed to be "outliers" based on the $1.5 \\times \\text{IQR}$ rule. - -```{r} -# Horizontal boxplot -penguins %>% - ggplot(aes(x = bill_length_mm)) + - geom_boxplot() - -# Vertical boxplot -penguins %>% - ggplot(aes(y = bill_length_mm)) + - geom_boxplot() -``` - -- `geom_density()` -- adds a density curve to the plot, where the probability - density is plotted on the y-axis (so the density curve has a total area of one). - - By default this creates a density curve without shading. By specifying a - color in the `fill` argument, the density curve is shaded. - - Can be thought of as the "one group" violin plot (see below) - -```{r, warning=FALSE, message=FALSE} -penguins %>% - ggplot(aes(x = bill_length_mm)) + - geom_density(fill = "tomato") -``` - -- `geom_violin()` -- plots violins for each level of a categorical variable - - Can be thought of as a hybrid mix of `geom_boxplot()` and `geom_density()`, - as the density is displayed, but it is reflected to provide a plot similar in - nature to a boxplot. - - To obtain violins stacked vertically, declare the categorical variable as `y`. - To obtain side-by-side violins, declare the categorical variable as `x`. - -```{r} -# Stacked vertically -penguins %>% - ggplot(aes(x = bill_length_mm, y = species)) + - geom_violin() - -# Side-by-side -penguins %>% - ggplot(aes(y = bill_length_mm, x = species)) + - geom_violin() -``` - -- `geom_bar()` -- creates a barchart of a categorical variable - - Can produce stacked barcharts by specifying a variable as the `fill` - aesthetic. - - Can change from stacked barchart to a side-by-side barchart by specifying - `position = "dodge"`. - - If your data are already in counts (e.g. output from `count()`), then you - can specify the `stat = "identity"` argument inside `geom_bar()`. - -```{r} -# Stacked barchart -penguins %>% - ggplot(aes(x = species)) + - geom_bar(aes(fill = sex)) - -# Side-by-side barchart -penguins %>% - ggplot(aes(x = species)) + - geom_bar(aes(fill = sex), - position = "dodge") - -# If data are raw counts -penguins %>% - count(species, sex) %>% - ggplot(aes(x = species, y = n)) + - geom_bar(aes(fill = sex), - stat = "identity", - position = "dodge") -``` - -- `geom_point()` -- plots each observation as an (x, y) point, used to create - scatterplots - - Can use `alpha` to increase the transparency of the points, to reduce - overplotting. - - Can specify `aes`thetics inside of `geom_point()` for local aesthetics (point - level) or inside of `ggplot()` for global aesthetics (plot level) - -```{r} -penguins %>% - ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + - geom_point(aes(color = species)) -``` - -- `geom_jitter()` -- plots each observation as an (x, y) point and adds a small - amount of jitter around the point - - Useful so that we can see each point in the locations where there are - overlapping points. - - Can specify the `width` and `height` of the jittering using the optional - arguments. - -```{r} -penguins %>% - ggplot(aes(y = body_mass_g, x = species)) + - geom_violin() + - geom_jitter(aes(color = sex), width = 0.25, height = 0.25) -``` - -- `geom_smooth()` -- plots a line over a set of points, draws the readers eye - to a specific trend - - The methods we will use are "lm" for a linear model (straight line), and - "loess" for a wiggly line - - By default, the smoother gives you gray SE bars, to remove these add - `se = FALSE` - -```{r, fig.width=6} -penguins %>% - ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + - geom_point() + - geom_smooth(method = "lm") -``` - -- `facet_wrap()` -- creates subplots of your original plot, based on the levels - of the variable you input - - To facet by one variable, use `~variable`. - - To facet by two variables, use `variable1 ~ variable2`. - - If you prefer for your facets to be organized in rows or columns, use the - `nrow` and/or `ncol` arguments. - -```{r, fig.width=12} -penguins %>% - ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + - geom_point() + - geom_smooth(method = "lm") + - facet_wrap(~island, nrow = 1) -``` - -### Plot Characteristics - -- `labs()` -- specifies the plot labels, possible labels are: x, y, color, fill, - title, and subtitle - -```{r, fig.width=6} -penguins %>% - ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + - geom_point() + - geom_smooth(method = "lm") + - labs(x = "Bill Length (mm)", - y = "Bill Depth (mm)", - color = "Penguin Species") -``` - -- `theme_bw()` -- changes the plotting background to the classic dark-on-light - ggplot2 theme. - - This theme may work better for presentations displayed with a projector. - - Other common themes are `theme_minimal()`, `theme_light()`, and `theme_void()`. - -```{r} -penguins %>% - ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + - geom_point() + - geom_smooth(method = "lm") + - labs(x = "Bill Length (mm)", - y = "Bill Depth (mm)", - color = "Penguin Species") + - theme_bw() -``` - -- `theme()` -- adjust individual theme elements - - Possible options are: - - `panel.grid` -- controls the grid lines (`panel.grid = element_blank()` - removes grid lines) - - `text` -- specifies font size for the entire plot (e.g. - `text = element_text(size = 16)` - - `axis.text.x` -- specifies the font size for the x-axis text - - `axis.text.y` -- specifies the font size for the y-axis text - - `plot.title` -- specifies aspects of the plot title, can use - `plot.title = element_text(hjust = 0.5)` to centre the title - -```{r} -penguins %>% - ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + - geom_point() + - geom_smooth(method = "lm") + - labs(x = "Bill Length (mm)", - y = "Bill Depth (mm)", - color = "Penguin Species") + - theme_bw() + - theme(axis.text.x = element_text(size = 12), - axis.text.y = element_text(size = 12)) -``` - -### Exporting Plots - -- `ggsave()` -- convenient function for saving a plot - - Unless specified, defaults to the last plot that was made. - - Uses the size of the current graphics device to determine the size of the - plot. - -```{r} -plot1 <- penguins %>% - ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + - geom_point() + - geom_smooth(method = "lm") + - facet_wrap(~island, nrow = 1) - -ggsave(path = "images/faceted_plot.png", plot = plot1) -``` - -