If you know of a good resource that you want to share, feel free to submit a pull request! If you're not sure how to do that, you can also submit an issue with the link to the resource and a quick intro to why it's helpful for you.
This doc is based on my google doc of resources and inspired by a coversation I had on twitter.
I'm slowly working on coverting the original doc!
- News
- GIS and online mapping
- Data visualization- point and click
- Rstats
- Links
- Mentorship
- Podcasts
- Interviews
- Packages
- Blogs
- Books
- Style guides
- Helpful file functions
- Extracting data from PDFs
- Excel in R
- Googlesheets
- Dates and times
- Dataframes/tables
- Conditionals
- ggplot2
- GIS in R
- Cleaning data
- Modeling in R
- Links
- Data
- Civic tech projects & Open source contributing
- General tips + tricks
- GovTech - articles about technology in government
- Fivethirtyeight
- Upshot - NYT's data analysis
- Flowing Data - beautiful data visualization, both articles and tutorials
- HuffPost Data
- WaPo Wonkblog
- Brookings
- Urban Institute data/viz
- NPR Political data and technology
- LA Times data desk
- The Guardian datablog
- Data journalism awards
- 7 types of people who are out of work, categorized with data
- Smart Cities DIVE
- Pudding
- Tableau public - Tableau desktop is free with a student email address! woohoo!
- Gephi
- UMD treemap
- Chart.js
- Raw graphs
- Timeline
- Data points: visualization that means something by Nathan Yau
- Visualize this by Nathan Yau
- Anything by Edward Tufte
- Designer's guide to creating charts and diagrams by Nigel Holmes
- Cool infographics by Randy Krum
- TED talk: The beauty of data visualization by David McCandless
- The functional art by Alberto Cairo
- Post: Visualizing incomplete and missing data by Nathan Yau
- "How To Use Color To Prove Your Point, From A Data Viz Expert" by Meg Miller
- "When communicating with color, balance can be a path to accuracy" by Chir Barncard
- XKCD Color Survey Results
- "A Simple Web Developer's Color Guide" by Laura Elizabth
- "Your Friendly Guide to Colors in Data Visualisation" by Lisa Charlotte Rost
- "Visual Reasoning Visual Reasoning Lab at the University of Wisconsin – Madison" by Karen Schloss. Related Data Stories podcast
- "Viz Palette for Data Visualization Color" by Elijah Meeks
- "Interaction of Color" by Josef Albers (print and Apple app)
- "How To Avoid Equidistant HSV Colors" by Gregor Aisch
- "On rainbows" by Charlie Lloyd
- Colorpicker for Data
- Chroma.js
- ColorBrewer: color advice for maps
- Viz Palette
- Colorgorical
- Adobe color wheel
There is no use having a visualization if folks can't read it.
- "Accessible Colors for Data Visualization" by Zach Grosser
- Design.blog post on inclusive design
- How to Meet WCAG (Web Content Accessability Guidelines): quick reference)
- WebAIM (we accessibility in mind)
- Super list of color contrast tools
- Many of the color tools above have specific filters and toolsets for making your web and print content
- awesome-R - github repo with SO MANY good R resources! Mostly packages, some other things at the end.
- RStudio online learning
- Data helpers - A site by Angela Bassa with a list of people willing to help/mentor!
- Dataframed - datacamp's podcast
- Not so standard deviations
- Data skeptic
- Partially derivative
- Linear digressions
- Data Stories
- ROpenSci packages
- Emo: emojis in R because why not?
- Tidyverse packages
- Broom- tidy model output
- Markovify- use this for future Markov chain poems! Re: Malcolm's tweet
- "Top 40" New Package Picks
- Write your own packages!
- forecast and forecastHybrid
- ROpenSci
- RWeekly
- David Robinson
- Datacamp blog
- Datacamp tech
- RStudio
- Mara Averick
- Simply statistics
- Andrew Gelman
- Civil stat
- Citizen statistician
- Maelle Salmon
- Live free or dichotomize
- Caitlin Hudon
- Julia Silge
- Thomas Lin Pedersen
- Robert J Hyndman
- R for data science by Hadley Wickham
- Advanced R by Hadley Wickham
- Happy Git with R by Jenny Bryan and the STAT 545 TAs
- Stat 545 - not quite a book, but same idea by Jenny Bryan
- R Packages by Hadley Wickham
- Graphics Cookbook for R by Winston Chang
- Data visualization: a practical introduction by Kieran Healy
- R programming for data science by Roger Peng
- Exploratory data analysis with R by Roger Peng
- An intro to statistical and data sciences via R by Chester Ismay and Albert Y. Kim
- Tidy text mining by Julia Silge and David Robinson
- Intro to empirical Bayes by David Robinson
- Curated list of R books
- Computational linear algebra course
- The fundamentals of data visualization by Claus Wilke
- Geocomputation with R by Robin Lovelace, Jakub Nowosad, Jannes Muenchow
- Introduction to Open Data Science by The Ocean Health Index Team
- Efficient R Programming by Colin Gillespie, Robin Lovelace
- Forecasting: Principles and Practice by Rob J. Hyndman, George Athanasopoulos
- Tidyverse style guide
- Specifically: tidyverse error style guide
- Google R style guide
- R style guide
- Data sharing style guide
file.path() # takes folder names and returns a path to your file
file.choose() # pulls up finder so you can point to your file
download.file(url, dest_path) # makes downloading files reproducible
jsonlite
is a good package to use!
fromJSON()
takes in a JSON file (can be a URL) and returns an R listtoJSON()
takes in a file and converts it to JSON
Formatting JSON files:
minify()
{"a":1,"b":2,"c":{"x":5,"y":6}}
prettify()
,pretty = TRUE
by default intoJSON()
{ "a": 1, "b": 2, "c": { "x": 5, "y": 6 } }
ROpenSci has an awesome package to use tabula in R-- tabulizer
Tabula is also awesome on its own.
# example code from github
library("tabulizer")
f <- system.file("examples", "data.pdf", package = "tabulizer")
out1 <- extract_tables(f)
out2 <- extract_tables(f, pages = 1, guess = FALSE, method = "data.frame")
extract_areas() # turns the pdf into an R graphic so you can select the tables on a page, if the whole page isn't a table!
out3 <- extract_text(f, page = 3) # gets just the text from the table, without turning it into a dataframe
cat(out3, sep = "\n")
## len supp dose
## 4.2 VC 0.5
## 11.5 VC 0.5
## 7.3 VC 0.5
## 5.8 VC 0.5
## 6.4 VC 0.5
## 10.0 VC 0.5
## 11.2 VC 0.5
## 11.2 VC 0.5
## 5.2 VC 0.5
## 7.0 VC 0.5
## 16.5 VC 1.0
## 16.5 VC 1.0
## 15.2 VC 1.0
## 17.3 VC 1.0
## 22.5 VC 1.0
## 3
How to install tabulizer:
(On Windows)
From github vignette:
In command prompt, install Chocolately if you don't already have it:
@powershell -NoProfile -ExecutionPolicy Bypass -Command "iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1'))" && SET PATH=%PATH%;%ALLUSERSPROFILE%\chocolatey\bin
Then install java!
choco install jdk7 -y
To set an environment variable in R use
Sys.setenv(JAVA_HOME = "C:/Program Files/Java/jdk1.8.0_92")
if that doesn't work, look at the other options in the vignette.
Then install the dev version of the package (not yet on CRAN)
if (!require("ghit")) {
install.packages("ghit")
}
# on 64-bit Windows
ghit::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"), INSTALL_opts = "--no-multiarch")
# elsewhere
ghit::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"))
Using the xlsx package to create an excel file
readxl package - tidyverse w/ excel
excel_sheets() # gives a character vector of the sheet names
read_excel() # reads in excel files
# args: sheets (defines which sheet you want)
col_names = TRUE
col_types ("blank" to skip a column)
skip (n rows to skip)
# put all sheets in a list to keep them together
my_workbook <- lapply(excel_sheets("data.xlsx"),
read_excel,
path = "data.xlsx")
XLConnect - dynamic connection between R and excel.
createSheet(book, “name”)
loadWorkbook(“path”)
getSheets(book)
writeWorksheet () # function (object, data, sheet, startRow = 1, startCol = 1, header = TRUE,
rownames = NULL)
writeWorkbook(object, “path”)
renameSheet() # function (object, sheet, newName)
removeSheet() # function (object, sheet)
There's a vignette with lots of good info.
# uses oauth to access drive
gs_auth() # run to authorize
From the github site, an overview of the functions:
gs_ls() # list sheets
gs_title() # register a sheet by title
gs_key() # register a sheet by key
gs_url() # register a sheet by url
... tbd
strptime
to convert character strings to POSIXct.
strptime(x, format, tz = "")
Formatting POSIXct objects:
- %Y: 4-digit year (1982)
- %y: 2-digit year (82)
- %m: 2-digit month (01)
- %d: 2-digit day of the month (13)
- %A: weekday (Wednesday)
- %a: abbreviated weekday (Wed)
- %B: month (January)
- %b: abbreviated month (Jan)
- %H: hours as a decimal number (00-23)
- %I: hours as a decimal number (01-12)
- %M: minutes as a decimal number
- %S: seconds as a decimal number
- %T: shorthand notation for the typical format %H:%M:%S
- %p: AM/PM indicator
Example use:
# from r documentation
x <- c("1jan1960", "2jan1960", "31mar1960", "30jul1960")
z <- strptime(x, "%d%b%Y")
Also use lubridate
for working with date objects!
There's helpful info on the tidyverse site as well as a nice vignette
Lubridate makes it easy to manipulate and work with dates.
It has some of the same capacity as strptime
, although I actually prefer the former.
ymd(20101215)
#> [1] "2010-12-15"
mdy("4/1/17")
#> [1] "2017-04-01"
Lubridate also makes timezones less confusing!
time <- ymd_hms("2010-12-13 15:30:30")
time
#> [1] "2010-12-13 15:30:30 UTC"
# Changes printing
with_tz(time, "America/Chicago")
#> [1] "2010-12-13 09:30:30 CST"
# Changes time
force_tz(time, "America/Chicago")
#> [1] "2010-12-13 15:30:30 CST"
You can also pull out specific parts of a date object.
bday <- dmy("14/10/1979")
month(bday)
#> [1] 10
wday(bday, label = TRUE)
#> [1] Sun
#> Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
Say you want to turn a time into a numerical representation (i.e. 8:30 AM would be 8.50). Use lubridate!
time <- hm("8:30")
hour(time) + minute(time)/60
#> 8.5
str()
is a helpful starter function for working with nested lists, but we can certainly improve upon it. First, using the max.level
argument specifies how much information you want about the list. For example, the output from str(..., max.level=0)
is something along the lines of List of x
, replace x with the length of your list.
A more interactive way to work with nested lists is with {listviewer}, a package that lets you easily explore your list.
library(listviewer)
jsonedit(df, mode = "view")
The options for the mode
argument are code, form, text, tree, and view. Even though you specify the mode, you can easily switch within the interactive list view if you think something else would work better for your particular use.
In R, you can write an if/else statement, similar to the way you would write it in Python. It helps me to think about it this way because I started programming in Python first.
It would look something like this:
if (condition) {
do something
} else if (condition) {
do something
} else {
do something
}
But let's say you want to define a variable in a dataframe based on a conditional. It is way easier with tidyverse packages!
You can use mutate()
with ifelse(condition, if-do, else-do)
For example, let's say you want to assign high & low with a conditional based on some values. You could do:
library(dplyr)
df %>%
mutate(x = ifelse(var > 5, "high", "low"))
# this assigns "high" to anything strictly higher than 5
You can also use a nested ifelse()
for more complicated conditionals, although that is discouraged. More on that in a bit...
In this example, I wanted the final variable to say "White subjects", "Nonwhite subjects", or "Unknown". I started with two different variables. One said whether or not race was known, and the other gave race, if known.
library(tidyverse)
biopics <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/biopics/biopics.csv")
biopics <- biopics %>% # start with our original data
mutate(race_plotting = ifelse(race_known == "Unknown", "Unknown", ifelse(subject_race == "White", "White subjects", "Nonwhite subjects"))) # this says, if race_known is unknown, stop there and just mark unknown. But if we know the race, check to see if the race is coded as white. If it is, mark white and stop there. If not, mark nonwhite.
Instead of getting lost in nested statements, try this instead! Found in Github
df <- mutate(df, newvar = derivedFactor(
"group1" = oldvar %in% c("a", "b", "c"),
"group2" = oldvar %in% c("d", "e", "f"),
"group3" = oldvar %in% c("g", "h", "i"),
.default = NA
))
case_when()
is another good alternative that draws on some SQL syntax. Here's an example from Stack Overflow
library(dplyr) # >= 0.7.0
mtcars %>%
mutate(cg = case_when(carb <= 2 ~ "low",
carb > 2 ~ "high"))
How to change the order of the bars in geom_bar()
!!!!
# from https://stackoverflow.com/questions/5208679/order-bars-in-ggplot2-bar-graph
require(forcats)
ggplot(data, aes(x = fct_infreq(var))) + geom_bar()
# from https://rstudio-pubs-static.s3.amazonaws.com/7433_4537ea5073dc4162950abb715f513469.html
x$name <- factor(x$name, levels = x$name[order(x$val)])
ggplot(x, aes(x = name, y = val)) + theme_bw() + geom_bar(stat = "identity")
To look into: gghighlight in this blog post! Looks like an easier way to highlight certain lines/bars instead of manually coding that.
Cookbook of good ggplot2
hacks here!!
- GIS in R by Nick Eubank.
- Robin Lovelace: Creating maps in R
- R Spatial
- For geocoding, try opencage per Maelle's suggestion!
# forward geocoding
output <- opencage_forward(placename = "Sarzeau")
# reverse geocoding
output2 <- opencage_reverse(latitude = 51.5034070,
longitude = -0.1275920)
# adding parameters
results3 <- opencage_forward(placename = "Berlin", country = "DE")
But for now, I'm using a function I found online. Example:
I wanted to geocode the locations of the Boston community centers recently. I found the names here. Using the chrome extension for CSS selector gadget, I scraped the names into R.
community_centers_url <- read_html("https://www.boston.gov/community-centers") # site with names
centers <- community_centers_url %>%
html_nodes(".cd-t") %>% # css selector
html_text() # I want the text
Then I used a combination of functions to geocode.
url <- function(address, return.call = "json", sensor = "false") {
root <- "http://maps.google.com/maps/api/geocode/"
u <- paste(root, return.call, "?address=", address, "&sensor=", sensor, sep = "")
return(URLencode(u))
} # calling google maps API
geoCode <- function(address,verbose=FALSE) {
if(verbose) cat(address,"\n")
u <- url(address)
doc <- getURL(u)
x <- fromJSON(doc,simplify = FALSE)
if(x$status=="OK") {
lat <- x$results[[1]]$geometry$location$lat
lng <- x$results[[1]]$geometry$location$lng
location_type <- x$results[[1]]$geometry$location_type
formatted_address <- x$results[[1]]$formatted_address
return(c(lat, lng, location_type, formatted_address))
} else {
return(c(NA,NA,NA, NA))
}
} # actually getting the addresses
I'm sure there's a cleaner way to do this, I'll work on using opencage and then updating this page!
Another method of geocoding is using ggmap package and Google API. First, register for a key here. After you get the key, this script will geocode for you (up to the API limit which is around 2500/day)
library(ggmap)
register_google(key = "my_key")
for(i in 1:nrow(turnout_data)){
#location is a string like "123 University Ave, Gainesville FL"
#This works best if you format your locations the same way you would for Google Maps/Waze/Tom Tom
result <- geocode(my_data_frame$location[i], output = "latlona", source = "google")
my_data_frame$lon[i] <- as.numeric(result[1])
my_data_frame$lat[i] <- as.numeric(result[2])
turnout_data$geoAddress[i] <- as.character(result[3])
}
Tidycensus for demographic data.
# median income in Boston census tracts
bos_inc <- get_acs(geography = "tract",
variables = c(medincome = "B19013_001"),
state = "MA",
county = "Suffolk")
# see available variables for 2016 acs 5-year
load_variables(year = 2016, dataset = "acs5")
Tigris to load the shapefiles to map census data!
options(tigris_use_cache = TRUE) # doesn't cache by default
bos_tract <- tracts(state = "MA", county = "Suffolk")
You can join the two with tigris::geo_join
. Regular joins don't work with spatial data.
bos_joined <- geo_join(bos_tract, bos_inc, by = "GEOID")
And if I want to exclude a particular tract (it's mostly water, NA for all demographics) that makes my maps looks ugly :(
bos_joined <- subset(bos_joined, NAME.1 != "Census Tract 9901.01, Suffolk County, Massachusetts")
dplyr::filter
doesn't work with spatial data!!
Principles of tidy data:
- Each row is an observation, each column in an attribute/variable
- Exactly one type of observational unit per table
- Column headers should be variable names, not values (ie eye color vs blue)
The tidyverse
has understandable and compatible functions for dealing with data cleaning!
Sometimes you have a table that is "wide" or "narrow," and you want it to be the other way. Use spread and gather!
From the RStudio blog:
library(tidyr)
library(dplyr)
messy <- data.frame(
name = c("Wilbur", "Petunia", "Gregory"),
a = c(67, 80, 64),
b = c(56, 90, 50)
)
messy
#> name a b
#> 1 Wilbur 67 56
#> 2 Petunia 80 90
#> 3 Gregory 64 50
# We have three variables (name, drug and heartrate), but only name is currently in a column. We use gather() to gather the a and b columns into key-value pairs of drug and heartrate:
# gather(data, key, value, ..., na.rm = FALSE, convert = FALSE,
factor_key = FALSE)
messy %>%
gather(drug, heartrate, a:b)
#> name drug heartrate
#> 1 Wilbur a 67
#> 2 Petunia a 80
#> 3 Gregory a 64
#> 4 Wilbur b 56
#> 5 Petunia b 90
#> 6 Gregory b 50
Conversely, you also might want to spread
your data.
spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE,
sep = NULL)
Let's say you have a variable for date-time, but you want one variable for date and the other for time.
In the case of the vignette, they wanted to separate location and time (randomized variables) in the table.
tidier <- messy %>%
gather(key, time, -id, -trt)
tidier %>% head(8)
#> id trt key time
#> 1 1 treatment work.T1 0.08514
#> 2 2 control work.T1 0.22544
#> 3 3 treatment work.T1 0.27453
#> 4 4 control work.T1 0.27231
#> 5 1 treatment home.T1 0.61583
#> 6 2 control home.T1 0.42967
#> 7 3 treatment home.T1 0.65166
#> 8 4 control home.T1 0.56774
tidy <- tidier %>%
separate(key, into = c("location", "time"), sep = "\\.")
tidy %>% head(8)
#> id trt location time time
#> 1 1 treatment work T1 0.08514
#> 2 2 control work T1 0.22544
#> 3 3 treatment work T1 0.27453
#> 4 4 control work T1 0.27231
#> 5 1 treatment home T1 0.61583
#> 6 2 control home T1 0.42967
#> 7 3 treatment home T1 0.65166
#> 8 4 control home T1 0.56774
The opposite of separate()
is unite()
.
Let's say I have a dataframe where one of the variables is a full address. I want only the rows from Minnesota.
Found in a SO question
df %>%
filter(str_detect(address, "Minnesota"))
# "detects" the string "Minnesota" in the address column
What if we have NA values and blank cells, both meant to be NA?
Answer found in a SO question
df <- df[!(is.na(df$var) | df$var==""), ]
janitor
is an awesome data cleaning package. As it says on the github site, it was built with beginners in mind, but can be useful for more experienced R users by making routing processes a little faster. I now routinely add clean_names()
every time I load a csv.
read_csv("data.csv") %>%
clean_names()
In fact, when I'm loading lots of csvs I'll often write my own function that does that automatically. Something like:
clean_read <- function(data) {
x <- read_csv(data) %>%
clean_names()
return(x)
}
If you're dealing with messy data, especially hard coded data, you might see a lot of duplicates. In that case get_dupes()
from janitor
will be your new best friend!
General format for linear models in R lm(response ~ explanatory, data = data)
- Include all variables as explanatory variables
lm(response ~ ., data = data)
- Include all but one (of a few) variables as explanatory variables
lm(response ~ . -vars_to_exclude, data = data)
Linear model assumptions
- Errors are normally distributed
- Errors have mean zero
- Errors have constant variance
- Common fix for non constant variance is
log(response)
- Error terms are independent
David Robinson's broom
package gives model output in a "tidy" format.
The tidy()
function gives the typical model summary as a tidy data frame-- great for further analysis!
> library(fivethirtyeight)
> data("hate_crimes")
> library(broom)
> mod <- lm(hate_crimes_per_100k_splc ~ gini_index, data = hate_crimes)
> tidy(mod)
term estimate std.error statistic p.value
1 (Intercept) -1.527463 0.7833043 -1.950025 0.05741966
2 gini_index 4.020510 1.7177215 2.340606 0.02374447
Then instead of having to look for the fitted values in the nested structure of mod
, you can get them + more in a tidy data frame with augment()
!
> head(augment(mod))
.rownames hate_crimes_per_100k_splc gini_index .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
1 1 0.12583893 0.472 0.3702175 0.04512572 -0.24437860 0.03499099 0.2410643 0.0192795907 -1.0312203
2 2 0.14374012 0.422 0.1691920 0.06752783 -0.02545191 0.07835622 0.2439314 0.0005134118 -0.1098987
3 3 0.22531995 0.455 0.3018689 0.03520103 -0.07654890 0.02129211 0.2436851 0.0011190988 -0.3207498
4 4 0.06906077 0.458 0.3139304 0.03543832 -0.24486962 0.02158013 0.2410927 0.0116131848 -1.0261863
5 5 0.25580536 0.471 0.3661970 0.04407070 -0.11039166 0.03337397 0.2433762 0.0037397389 -0.4654371
6 6 0.39052330 0.457 0.3099099 0.03527586 0.08061342 0.02138273 0.2436546 0.0012466081 0.3377963
Then you can easily add fitted values and residuals to plots that help assess model quality. Fewer steps is good!
- US gov data
- Charlottesville open data
- Open data DC
- Open data Minneapolis
- Open data St Paul
- MN geospatial commons
- Met Council (Twin Cities)
- Boston Open Data
- World Bank databank
- Fivethirthyeight
- Buzzfeed
- Seth Stephens-Davidowitz: google trends data
- ProPublica datasets- only some are free
- Awesome public datasets
- Sunlight labs- before they closed
- Public APIs
- The Guardian: global development data
- Open elections- creating a central database of US election data at the precinct level. Uses Python + roles for non-coding.
- Mapping prejudice- digitizing and mapping racial covenants in MPLS home deeds. No coding required!
- Contributr- A shiny app for finding beginner Github issues to contribute to. Github link
- Contributing code to the tidyverse, blog post by Jim Hester
- Guidelines for telling a great data science story
- Beginner mistakes
- How to call bullshit on big data: a practical guide
- The blissful ignorance of the narrative fallacy
- Emoji cheat sheet
-
Python
-
HTML/CSS
-
Regex
-
SQL
-
Jekyll + blogging