-
Notifications
You must be signed in to change notification settings - Fork 153
/
Copy pathimport_data.Rmd
216 lines (146 loc) · 12.1 KB
/
import_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
# (PART) Working with Data {-}
# Importing Data {#import}
![](images/banners/banner_import.png)
*This chapter originated as a community contribution created by [ ZhangZhida](https://github.com/ZhangZhida){target="_blank"}*
*This page is a work in progress. We appreciate any input you may have. If you would like to help improve this page, consider [contributing to our repo](contribute.html).*
## Overview
This section covers how to import data from built-in R sources, local files, web sources and databases.
## Import built-in dataset
R comes with quite a lot of built-in datasets, which R users can play around with. You are probably familiar with many of the built-in datasets like `iris`, `mtcars`, `beavers`, `dataset`, etc. Since datasets are preloaded, we can manipulate them directly. To see a full list of built-in R datasets and their descriptions, please refer to [The R Datasets Package](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html){target="_blank"}. We can also run `data()` to view the full list.
The most convenient option for viewing is `??datasets` since provides a list of datasets in the Help pane. Clicking on a dataset will bring up its help file. There's lots of important information about the sources of the data and the meaning of the variables in these help files, so be sure to check them out.
Most datasets are [lazy-loaded](https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Data-in-packages), which means that although they don't appear as objects in the global environment, they are there when you reference them. However, for some packages, you must use `data()` to access the datasets, as follows:
```{r, eval=FALSE}
library(pgmm)
data(wine)
```
This is a common source of frustration for students: "I installed the library and loaded the package but the data's not there!" Forewarned is forearmed. Packages that we use that fall in this category include: `lawstat`, `pgmm`, and others. (Submit a PR to add to this list.)
## Import local data
This section covers base R functions for reading data. For tidyverse versions (`read_csv`, `read_delim`, `read_table`, etc.) see the [Data Import chapter](https://r4ds.had.co.nz/data-import.html) of *R for Data Science.)
### Import text file
The function `read.table()` is the most general function for reading text files. To use this function, we need to specify how we read the file. In other words, we need to specify some basic parameters like `sep`, `header`, etc. `sep` represents the separator, and `header` is set to `TRUE` if we want to read the first line as the header information. Other parameters are also useful in different cases. For example, `na.strings` indicates strings should be regarded as NA values.
```{r}
df <- read.table("data/MusicIcecream.csv", sep=",", header=TRUE)
head(df)
```
### Import CSV file
A Comma-Separated Values file (CSV) is a delimited text file that uses a comma to separate values. We can easily read a CSV file with built-in R functions.
The `read.csv()` function provides two useful parameters. One is `header`, which can be set to `FALSE` if there is no header. The other is `sep`, which specifies the separator. For example, we can specify the separator to be `sep="\t` if the CSV file value is seperated by the tab character. The default value of `header` and `sep` are `TRUE` and `","`, respectively.
`read.csv2()` is another function for reading CSV files. The difference between `read.csv()` and `read.csv2` is that, the former uses the tab `"\t"` as the separator, while the latter one uses the semicolon `";"`. This serves as an easy shortcut for different CSV formats used in different regions.
Let's see an example on reading a standard CSV file:
```{r}
df <- read.csv("data/MusicIcecream.csv")
head(df)
```
A small note while reading multiple files: let R know your current directory by using `setwd()`. Then, you can read any file in this directory by directly using the name of the file, without specifying the location.
### Import JSON file
A JSON file is a file that stores simple data structures and objects in JavaScript Object Notation (JSON) format, which is a standard data interchange format. For example, `{"name":"Vince", "age":23, "city":"New York"}` is an object with JSON format. In recent years, JSON has become the mainstream format to transfer data on websites.
To read a JSON file, we can use the `jsonlite` package. The `jsonlite` package is a JSON parser/generator optimized for the web. Its main strength is that it implements a bidirectional mapping between JSON data and the most important R data types. In the example below, the argument `simplifyDataFrame = TRUE` will directly transform a list of JSON objects into a dataframe. If you want to know more about the arguments `simplifyVector` and `simplifyMatrix`, which provide flexible control on other R data formats to transform to, please refer to [Getting started with JSON and jsonlite](https://cran.r-project.org/web/packages/jsonlite/vignettes/json-aaquickstart.html){target="_blank"}.
```{r, message=FALSE, warning=FALSE}
library(jsonlite)
# read JSON data
raw_json_data <- fromJSON(txt = "data/WaterConsumptionInNYC.json", simplifyDataFrame = TRUE)
# transform JSON to Data Frame
df <- as.data.frame(raw_json_data)
head(df)
```
## Import web data
### Read a data file directly into the workspace
Let's take the example of `Water Consumption In The New York City`, which is on the [NYC Open Data website](https://data.cityofnewyork.us/Environment/Water-Consumption-In-The-New-York-City/ia2d-e54m){target="_blank}.
We can import data from a URL just as we do with local data files.
```{r, warning=FALSE, message=FALSE}
library(tidyverse)
# specify the URL link to the data source
url <- "https://data.cityofnewyork.us/api/views/ia2d-e54m/rows.csv"
# read the URL
df <- read_csv(url)
head(df)
```
### Read data from an API
The best option here is to look for a package that has set this up for you, such as **WHO**, **atus**, and many others. Need ideas? [These R packages import sports, weather, stock data and more](https://www.computerworld.com/article/3109890/these-r-packages-import-sports-weather-stock-data-and-more.html){target="_blank"}, is a great place to start looking for such packages.
If such a package does not exist for your data, use the **httr** to facilitate the API calls. The RStudio webinar [Extracting Data from the Web Part I](https://resources.rstudio.com/webinars/extracting-data-web-part1){target="_blank"} is an excellent resource for learning **httr** as is Sharon Machlis's article, [Get API data with R: No R package for the API you want? It’s easy to write your own function with the httr and jsonlite packages](https://www.infoworld.com/article/3434627/get-api-data-with-r.html){target="_blank"}.
### Scrape an HTML table using `rvest`
Sometimes we wish to import data that appears as an HTML table on a web page. It might be a little messy, so best to first check if there's another means for importing the data before moving forward. If not, `rvest` makes the process as painless as possible.
Here's a simple example. Suppose we wish to work with the borough data found on Wikipedia's [Boroughs of New York City](https://en.wikipedia.org/wiki/Boroughs_of_New_York_City) page.
First we read the page, find the tables, and then parse them with `html_table`:
```{r, message=FALSE, warning=FALSE}
library(tidyverse)
library(rvest)
nyctables <- read_html("https://en.wikipedia.org/wiki/Boroughs_of_New_York_City") %>%
html_nodes("table") %>%
html_table(fill = TRUE)
```
`nyctables` is a list with three elements, one for each table on the page.
Next we can check each list item until we find what we want, consulting the original web page to get a sense of where our table is located. (There are other methods for identifying what you need from a web page in more complex situations. See Additional Resources below.)
It turns out that the table we want is the first list element:
```{r}
mytable <- nyctables[[1]]
head(mytable, 3)
```
We can see that the column names are all the same due to the merged header in the original. We'll fix the column names and remove the rows we don't need:
```{r}
colnames(mytable) <- c("borough", "county", "population", "gdp_total", "gdp_per_capita",
"land_sq_miles", "land_sq_km", "density_sq_miles", "density_sq_km")
# remove unneeded rows
mytable <- mytable %>% slice(-c(1, 2, 10))
# convert character to numeric data where appropriate
mytable <- mytable %>%
mutate_at(vars(population:density_sq_km), parse_number)
```
Now we're good to go. Let's draw a plot!
```{r, warning=FALSE, message=FALSE}
options(scipen = 999) # turn off scientific notation
mytable %>%
slice(1:5) %>%
select(borough, gdp_per_capita, land_sq_miles, population) %>%
gather(var, value, -borough) %>%
ggplot(aes(value, fct_reorder2(borough, var=="gdp_per_capita", value, .desc = FALSE),
color = borough)) + geom_point() + ylab("") +
facet_wrap(~var, ncol = 1, scales = "free_x") +
guides(color = FALSE)
```
**Additional Resources**
[Excellent webinar from RStudio on using `rvest`](https://www.rstudio.com/resources/webinars/extracting-data-from-the-web-part-2/) -- covers how to use the structure of the HTML and CSS on the page to scrape the information that you need, as well as using additional `rvest` functions such as `html_text()`, `html_name()`, `html_attrs()`, `html_children()`, etc.
## Import data from database
R provides packages to manipulate data from relational databases like PostgreSQL, MySQL, etc. One of those packages is `odbc` package, which is one database interface for communication between R and relational database management systems. More resources on package: [odbc](https://db.rstudio.com/odbc/){target="_blank"}.
Before we connect to a local database, we must satisfy the requirement of the ODBC driver, through which our R package can communicate with the database. To get help on how to install ODBC driver on systems like Windows, Linux, MacOS, please refer to this document: [Install ODBC Driver](https://cran.r-project.org/web/packages/odbc/readme/README.html){target="_blank"}.
After we installed the ODBC driver, with `odbc` and `DBI` packages, we are able to manipulate the database. To read a table in the database, we usually take steps as follows. First, we build the connection to the database using `dbConnect()` function. Then, we can do some exploratory operations like listing all tables in the database. To query the data we want, we can send a SQL query into the database. Then we can retrieve the desired data and `dfFetch()` provides control on how many records to retrieve at a time. Finally, we finish reading and close the connection.
```{r, eval=FALSE}
library(odbc)
library(DBI)
# build connection with database
con <- dbConnect(odbc::odbc(),
driver = "PostgreSQL Driver",
database = "test_db",
uid = "postgres",
pwd = "password",
host = "localhost",
port = 5432)
# list all tables in the test_db database
dbListTables(con)
# read table test_table into Data Frame
data <- dbReadTable(con, "test_table")
# write an R Data Frame object to an SQL table
# here we write the built-in data mtcars to a new_table in DB
data <- dbWriteTable(con, "new_table", mtcars)
# SQL query
result <- dbSendQuery(con, "SELECT * FROM test_table")
# Retrieve the first 10 results
first_10 <- dbFetch(result, n = 10)
# Retrieve the rest of the results
rest <- dbFetch(result)
# close the connection
dbDisconnect(con)
```
## More resources
- Import local file:
[This R Data Import Tutorial Is Everything You Need](https://www.datacamp.com/community/tutorials/r-data-import-tutorial#Getting){target="_blank"}
- Import JSON file:
[Getting started with JSON and jsonlite](https://cran.r-project.org/web/packages/jsonlite/vignettes/json-aaquickstart.html){target="_blank"}
- Import web data:
[The RCurl Package](http://www.omegahat.net/RCurl/){target="_blank"}
- Import database file
[Databases using R](https://db.rstudio.com/){target="_blank"}
- Documentation on odbc package
[odbc](https://db.rstudio.com/odbc/){target="_blank"}
- Install ODBC Driver On Your System
[Install ODBC Driver](https://cran.r-project.org/web/packages/odbc/readme/README.html){target="_blank"}