-
Notifications
You must be signed in to change notification settings - Fork 0
/
Lesson 6 Reading and Writing Data Files.Rmd
257 lines (185 loc) · 8.45 KB
/
Lesson 6 Reading and Writing Data Files.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
---
title: "R Notebook: Reading and Writing Data Files"
output: html_notebook
---
```{r}
library(tidyverse)
library(xlsx)
```
Load a small spending file three ways
```{r}
# Read in a csv file, store the data frame in the variable "spending"
# col_names argument is TRUE if the first row of the file contains column headers / names
# TRUE is the default value, but I am showing it here for instructional purposes
# Quote is the character used to quote text fields, this can be handy if you have large
# text fields that might contain formatting (carriage returns, form feeds or line breaks)
# or non-standard characters for your locale.I used the backtick (`), but the tilde (~)
# is another good option.
spending <- read_csv("SomeSpending.csv", col_names=TRUE, quote = "`")
# Call spec() to see how the parser guessed field types:
spec(spending)
```
The call to spec() is telling us that the parser detected that the file is comma-delimited, which is correct!
However, the parser thought our 'date' field was a character field, and 'item' should be an double, rather than an integer. The other fields are typed correctly, although the 'cat' field contains a string that should be treated as a category or factor.
Let's make a few changes to the code so that the file is read and parsed correctly:
```{r}
spending <- read_csv(
"SomeSpending.csv",
col_types = list(
date = col_date(format = "%m/%d/%Y"),
amount = col_double(),
item = col_integer(),
desc = col_character(),
cat = col_factor()
)
)
spec(spending)
```
```{r}
spending
```
This imaginary person's life is a LOT more fun (and a lot more expensive) than mine! :)
We don't have to load a file to look at its spec(). This is really handy when we want to see how readr will parse a file before we load it! That way, we can make the formatting changes in the first readr call.
```{r}
# Call spec() before loading the file (same file, in tab-delimited text format):
spec_tsv("SomeSpending.txt", col_names = TRUE)
# Load the file with the correct data types:
spending_txt <- read_tsv(
"SomeSpending.txt",
col_types = list(
date = col_date(format = "%m/%d/%Y"),
amount = col_double(),
item = col_integer(),
desc = col_character(),
cat = col_factor()
)
)
# Show the first five rows of the tibble data frame
head(spending_txt, 5)
```
```{r}
# Loading an Excel version of the file
# The basic version, using sheetIndex to specify the sheet:
excel_spending <- read.xlsx("SomeSpending.xlsx", sheetIndex = 1)
# The xlsx library DOES NOT create a tibble (special tidyverse data.frame)
# Instead, it creates an old-fashioned R data.frame object
# To see its structure and data types, we use the str() command
str(excel_spending)
```
The read.xlsx function parsed the date correctly! It isn't as fancy as tibble wrt numeric fields, and doesn't differentiate between integers and floats. It interpreted both 'desc' and 'cat' as strings.
To cast cat as a factor, we can reassign the value using as.factor:
```{r}
# Convert the 'cat' column to a categorical variable:
excel_spending$cat <- as.factor(excel_spending$cat)
str(excel_spending)
```
Now all our data types are accurate!
```{r}
# A few other useful things to know about reading Excel files:
# Read in only the first three columns:
excel_spending_subset <- read.xlsx("SomeSpending.xlsx", sheetIndex = 1, colIndex = 1:3)
head(excel_spending_subset, 3)
# Specify the starting row (handy for skipping some rows, OR
# for opening files that have big headers or extra formatting
# or graphics). If skipping some rows, you need to specify
# column names, or the last column you skipped will turn into
# a header.
excel_spending_subset2 <- read.xlsx("SomeSpending.xlsx",
sheetIndex = 1,
startRow = 3,
header = FALSE)
# assign column names
colnames(excel_spending_subset2) <- c("date", "amount", "item", "desc", "cat")
# don't forget our categorical variable...
excel_spending_subset2$cat <- as.factor(excel_spending_subset2$cat)
head(excel_spending_subset2, 3)
```
Let's make changes to our tibbles so we can write them to new files!
Go here for a detailed explanation of tibble vs. data.frame:
https://www.r-bloggers.com/2020/05/create-and-convert-tibbles/
NOTE: when they say 'the tibbles package needs to be loaded' - you've taken care of that by loading library(tidyverse)!
```{r}
# Spending tibble
print(spending)
# Using dplyr's mutate function to create a new column called "expensive" based on the value of "amount"
# REMEMBER, you can either pass the spending tibble as an argument, OR use the pipe operator %>%
# NOTE, if you have more than two options for a new variable, you can use case_when rather than ifelse
# https://www.statology.org/conditional-mutating-r/
spending <- spending %>%
mutate(expensive = ifelse(amount > 300, 1, 0))
# Spending tibble with new column, 'expensive'
print(spending)
```
Now, we can also use dplyr to easily summarize the most expensive categories!
First, we pass the tibble to the 'group_by' function, to arrange it by category, then we summarize the amount and expensive columns to get the total amounts and # of expensive purchases by category.
This small snippet of code is an introduction to dplyr's power.
```{r}
cat_spending_summary <- spending %>%
group_by(cat) %>%
summarise(sum(amount), sum(expensive))
colnames(cat_spending_summary) <- c("cat", "total_spending", "num_expensive")
print(cat_spending_summary)
```
Our friend sure knows how to party, but might be spending a little too much on fun.
Still, what about annual or one-time purchases?
Those aren't as repetitive, so a few of them isn't the worst.
Let's add another column for those.
This time, we'll have to scrutinize each entry to decide whether or not the label applies:
```{r}
print(spending)
# Using old-fashioned assignment to create a new column (also works on data.frames)
spending$occasional <- c(1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1)
print(spending)
```
Both of our new columns are typed 'double'; let's save space by converting them to logical.
```{r}
spending$expensive <- as.logical(spending$expensive)
spending$occasional <- as.logical(spending$occasional)
# And, re-summarize:
cat_spending_summary <- spending %>%
group_by(cat) %>%
summarise(sum(amount), sum(expensive), sum(occasional))
colnames(cat_spending_summary) <- c("cat", "total_spending", "num_expensive", "num_occasional")
print(cat_spending_summary)
```
So even though our friend spent a lot on fun, 75% of those purchases were infrequent.
```{r}
# Summarizing using some functions from last week's lesson :)
writeLines(
paste("Grand total recorded spending in 2018 file:",
paste0('$',
format(
sum(cat_spending_summary$total_spending),
big.mark = ","))))
```
Ok, let's write our new tibbles to a csv file, and an excel file
```{r}
# Writing 'spending' to a csv file
write_csv(spending, "SomeSpending2.csv")
# And read it back...
quickread <- read_csv("SomeSpending2.csv")
quickread
```
Once again, reading in the file, readr didn't know our categorical variable was categorical, or that the 'item' variable was an integer, instead of a double. But everything else looks good.
```{r}
# Writing 'spending' to an Excel file, and 'cat_spending_summary' to the second sheet
# Creating data.frame versions of our tibbles:
spending_df <- as.data.frame(spending)
cat_spending_summary_df <- as.data.frame(cat_spending_summary)
write.xlsx(spending_df,
file = "SomeSpending2.xlsx",
sheetName = "All Data",
row.names = FALSE, # don't save the row names, they're just numbers
append = FALSE)
# Add a second data set (now, append must be = TRUE)
write.xlsx(cat_spending_summary_df,
file = "SomeSpending2.xlsx",
sheetName = "By Category",
row.names = FALSE,
append = TRUE)
```
Open the file in Excel, and check it out!
Also, the xlsx R library has a lot of functionality that allows for creation of nifty things in Excel documents using code! If you want to learn how to add lots more formatting and functionality,
**including plots** to your R-generated Excel spreadsheets, check out the following tutorial:
https://www.learnbyexample.org/read-and-write-excel-files-in-r/#the-xlsx-package
xlsx Docs: https://www.rdocumentation.org/packages/xlsx/versions/0.6.5