This is a use case and tutorial for the R package remake
The remake package is "Make-like build management, reimagined for R". Using the package allows you to:
- Change parts of your workflow and only update what changed
- Make your workflow reproducible since it incorporates the data importing, analysis and reporting in the same pipeline
This tutorial will assume you have a working understanding of the following topics:
- The
gapminder
dataset - Writing functions in R
- Generating plots using
ggplot2
- Using the
dplyr
package for data manipulation
If you need a refresher on any of these topics, a great resource that covers them is provided by Data Carpentry. We will also use rmarkdown
to generate a PDF report from our R code.
The example uses Jenny Bryan's excerpt of the Gapminder data. The data is available as an R package and is also available as CSVs file in this repository in the data directory. Generally the data is a global survey, every 5 years, of country, continent, year, life expectancy at birth, total population and gdpPercap.
In this use case, we will
- pose questions we want to ask
- determine the data we need to use
- determine the plots and analyses needed to answer those questions
- write R code to generate these plots and analyses
- write the YAML file required for remake
-
Are there differences in the life expectancy trends over time by continent?
-
How do the trends in life expectancy differ between 4 countries in Africa?
-
We have new data starting in 1982. After the data from 1982 to 2007 is added, are there changes in trends?
We had the Gapminder data through 1977 and did the analysis. Then our collaborator sent us the data from 1982 to 2007. So, we have two files:
gapminder1952-1977.csv - Gapminder data from 1952 to 1977
gapminder1982-2007.csv - Gapminder data from 1982 to 2007
In order to create your workflow you need to describe the beginning, intermediate and end points of your analysis, and how they flow together. These steps are called the targets, rules and dependencies on remake
.
-
"Targets": What are you going to generate. These can be files or R objects.
-
"Rules": How are you going to generate your targets. What functions do you need?
-
"Dependencies": What do you need to generate your targets. These may be other targets that you need for a particular target. Bear in mind that you might have several intermediate targets to produce your final target.
Targets are anything we need to produce at the end and throughout the pipeline
- the first target we need is the data frame of the imported csv
- the next target we need is the the transformed data frame
- a plot of average life expectancy per continent over time
- a plot of life expectancy for the 4 countries over time
- the last target is the rendered html output
The rules are the commands that you need to use to create all your targets. You can create an R file with all the functions you are going to use.
The yaml file tells remake everything that needs to know.
At the beginning of your YAML file you need to write the packages and you need to source the functions for your rules.
packages:
- dplyr
- ggplot2
- rmarkdown
sources:
- R/function.R
After setting up what remake needs we can start defining our targets.
The all
target is the final output. In this case we are creating an hmtl report.
In order to create this report, we need first to import the data. In this case the target is gapminder and the rule is read.csv
.
packages:
- dplyr
- ggplot2
- rmarkdown
sources:
- R/function.R
targets:
all:
depends:
- report.html
gapminder:
command: read.csv(file = "gapminder1952-1977.csv")
The next targets we need is our transformed data frame and the plots. The rules for the plots are the plotting functions we defined in the function file.
packages:
- dplyr
- ggplot2
- rmarkdown
sources:
- R/function.R
targets:
all:
depends:
- report.html
gapminder:
command: read.csv(file = "gapminder1952-1977.csv")
mean_lifeExp_by_continent_data:
command: mean_lifeExp_by_continent(gapminder)
figures/mean_lifeExp_by_continent.png:
command: plot_mean_lifeExp(mean_lifeExp_by_continent_data)
figures/plot_by_country.png:
command: plot_by_country(gapminder, I(countries = c("South Africa", "Morocco", "Algeria", "Nigeria")))
Finally we want to create the output which is the html report.
packages:
- dplyr
- ggplot2
- rmarkdown
sources:
- R/function.R
targets:
all:
depends:
- report.html
gapminder:
command: read.csv(file = "gapminder1952-1977.csv")
mean_lifeExp_by_continent_data:
command: mean_lifeExp_by_continent(gapminder)
figures/mean_lifeExp_by_continent.png:
command: plot_mean_lifeExp(mean_lifeExp_by_continent_data)
figures/plot_by_country.png:
command: plot_by_country(gapminder, I(countries = c("South Africa", "Morocco", "Algeria", "Nigeria")))
report.html:
depends:
- figures/mean_lifeExp_by_continent.png
- figures/plot_by_country.png
command: render("report.Rmd")
Go to the working directory where the YML file is
On the console install and load remake
Install and load the libraries
install.packages("devtools")
devtools::install_github("richfitz/remake")
library(remake)
Run make
On the console run make
make()
Remake uses the DiagrammeR package to visualize your workflows!
Make sure DiagrammeR is installed.
devtools::install_github('rich-iannone/DiagrammeR')
You can diagram your whole pipeline using the diagram()
function from remake.
Our current pipeline looks like this:
One of the best things about remake is re-running your whole pipeline on a new data set.
Try running this on gapminder dataset from 1982 - 2007.
gapminder:
command: read.csv(file = "gapminder1982-2007.csv")
In the RMarkdown file, the R code
- imports the data
- has a function that transforms the data for the analysis
- has a function that plots the life expectancy for 4 countries over time
- imports a plot of average life expectancy per continent over time that was generated with another R script
Dependencies are an important component of remake. Dependencies ensure that a target gets remade every time any of the dependencies changes. With remake
there are implicit and explicit dependencies. Implicit dependencies are objects that are called within a function. E.g.,
gapminder:
command: read.csv(file = "data/gapminder.csv")
# "data/gapminder.csv" is an implicit dependency
mean_lifeExp_by_continent_data:
command: mean_lifeExp_by_continent(gapminder)
# gapminder is an implicit dependency
You can also have explicit dependencies. When the creation of an object/output is dependent on one or more objects that are never referenced in a function, you must explicitly state these dependencies. E.g.,
report.html:
depends:
- figures/mean_lifeExp_by_continent.png
- figures/plot_by_country.png
command: render("report.Rmd")
# figures/mean_lifeExp_by_continent.png and figures/plot_by_country.png are
# explicit dependencies
# "report.Rmd" is an implicit dependency
Not every argument that you pass to a function needs to be a dependency. For example, when we specify what countries we want to plot, we do not want remake to look for variables called South Africa
, Morocco
, Algeria
, and Nigeria
because these are columns in the dataset gapminder. To tell remake that certain arguments are not considered as dependencies use the function I().
figures/plot_by_country.png:
command: plot_by_country(gapminder, I(countries = c("South Africa", "Morocco", "Algeria", "Nigeria")))
# using the I() function to tell remake that the countries argument values should be treated as is
# not as dependencies