forked from regan008/8500-Worksheets
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2-DataStructures.Rmd
466 lines (333 loc) · 17.5 KB
/
2-DataStructures.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
---
title: 'Worksheet 2: Data Structures'
author: "Amanda Regan"
date: "2023-01-05"
output: pdf_document
---
_This is the second in a series of worksheets for History 8510 at Clemson University. The goal of these worksheets is simple: practice, practice, practice. The worksheet introduces concepts and techniques and includes prompts for you to practice in this interactive document. When you are finished, you should change the author name (above), knit your document to a pdf, and upload it to canvas._
## Subsetting, Loops & Control Structures, and Functions
### Subsetting
Subsetting is the process of retrieving just the parts (a subset) of a large vector, list, or data frame. R has numerous operators for subsetting information or data structures. They allow you to perform complex operations with a small amount of code but can be hard to master because subsetting operators (`[[`, `[`, and `$`) interact differently with different vector types.
We'll be talking about vectors throughout the first part of this worksheet and it might be useful to refresh our memories about what a vector is.
> _A vector is a list of variables, and the simplest data structure in R. A vector consists of a collection of numbers, arithmetic expressions, logical values or character strings for example._
#### Selecting Multiple Elements
Lets start by creating a simple vector called x.
```{r}
x <- c(15.4, 25, 2, 8.35, 5, 383, 10.2)
```
If you type `x` into your console, you will see that all of the values are printed out and preceded by a `[1]`. What the `[1]` refers to is that first number, 15.4. Its position in the list is 1. Each number in our list has an index number and that is how you retrieve specific positions in this vector.
For example, we can use a positive integer to return elements at specific positions. Lets grab the value in the 3rd and 5th position.
```{r}
x[c(3,5)]
```
We can also use functions to manipulate our vector. Here we use `order()` to print the values contained in the vector in order.
```{r}
x[order(x)]
```
Duplicate indices will return duplicate values.
```{r}
x[c(3,3)]
```
(@) Create your own vector and return three values using the subsetting method above.
```{r}
```
Negative integers allow us to exclude values at certain positions.
```{r}
x[-c(3, 1)]
```
(@) What happened here? Describe this in your own words.
>
You can use either positive or negative integers to subset a vector but you **cannot mix the two**.
We can assign logical values to each value in a vector and use those to subset the data. Logical vectors select elements where the corresponding logical value is `TRUE`. Remember, we created a vector earlier and assigned it to x. Now, below, we assign logical values to each of the values in that vector. We're doing this by hand here, but you can imagine a scenario down the road where you use this technique to apply `TRUE` or `FALSE` values to a huge dataset dependent on some principal. When we run this, only the `TRUE` values are printed out.
```{r}
#create a vector
x[c(TRUE, FALSE, TRUE, TRUE, FALSE)]
```
We can also subset to get values that match a particular criteria. Below, each item in the vector is tested against this proposition and it returns the values that are `TRUE`.
```{r}
x[x > 7]
```
(@) What is going on in each of the lines above? Describe it in your own words.
>
Nothing returns the original vector. This isn't that useful here but can come in handy when you have more complex structures like matrices and arrays.
```{r}
x[]
```
If your vector is named, you can also use character vectors to return elements with matching names.
```{r}
(y <- setNames(x, letters[1:4]))
y[c("d")]
```
(@) What happened here? Explain this in your own words.
#### Matrices
We can subset higher dimensional structures, like matrices, too. First lets define a matrix. In R, **a matrix** is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns. Since you are only working with rows and columns, a matrix is called two dimensional. To create a matrix you can use the `matrix()` function.
```{r}
matrix(1:9, byrow = TRUE, nrow = 3)
```
In the `matrix()` function:
* The first argument is the collection of elements that R will arrange into the rows and columns of the matrix. Here, we use 1:9 which is a shortcut for c(1, 2, 3, 4, 5, 6, 7, 8, 9).
* The argument byrow indicates that the matrix is filled by the rows. If we want the matrix to be filled by the columns, we just place byrow = FALSE.
* The third argument nrow indicates that the matrix should have three rows.
(@) You try now, create a matrix that had five rows with numbers that range from 25 to 49.
```{r}
```
(@) Can you create a matrix that has 3 rows with only even numbers from 2 to 18?
```{r}
```
##### Subsetting Matricies
The most common way to subset matrices and arrays is to use a simple generalization of the same subsetting method we used for vectors above. We can subset by supplying index for each dimension separated by a comma. So it looks like this:
```{r}
a <- matrix(1:9, nrow=3)
colnames(a) <- c("A", "B", "C")
a
```
Here is our matrix. Its similar to the one we used above except I named each column using the `colnames()` function.
So to subset this matrix and get just the first two rows I can do:
```{r}
a[1:2,]
```
If I want to get the first two columns, I can do:
```{r}
a[,1:2]
```
I could also exclude a column:
```{r}
a[,-2]
```
Or get just one single value from the matrix:
```{r}
a[2,3]
```
These skills are important because dataframes have the characteristics of both lists and matricies.
So lets load the Gay guides data from our class pacakge.
```{r}
library(DigitalMethodsData)
data("gayguides")
```
To subset this data we could pull two particular columns like this:
```{r}
#gayguides[2:3]
```
That returns columns 2 and 3 (including every single row!).
If we use two indices it'll behave more like a matrix.
```{r}
gayguides[1:3,]
```
(@) Why is this different? What do we get compared to the previous example?
>
(@) Can you find just the city and state for the first 5 entries?
```{r}
```
(@) How about the street address and type for rows 2,555 to 2,560?
```{r}
```
(@) Load another dataset from `DigitalMethodsData`. Subset it to find some element of the data. Write the code below and then explain what element of the data you wanted to subset and why:
```{r}
```
>
Another useful operator to know is `$` which we used a little bit in the previous worksheet. `$` is shorthand for accessing a variable in a dataframe.
So for example both of these produce the same result. One is just easier to type. Because the output here is so large, I've commented these two lines out. Uncomment them and the run the block.
```{r R.options=list(max.print=10)}
#gayguides["title"]
#gayguides$title
```
Notice the above output is _huge_. We can combine `$` and `[` to drill down more specifically like we did earlier.
```{r}
gayguides$city[100:110]
```
(@) What does this code do?
>
We can also use `$` to create a new column in `gayguides`.
```{r}
gayguides$mynewcolumn <- TRUE
gayguides$mynewcolumn[1:10]
```
What does this code do? It creates a new column in the gayguides data called `mynewcolumn` and it assigns `TRUE` to every row (all 60k). This is a bit of a unrealistic example. You probably won't want to add the same value to 60k+ rows but the concept here is useful and we'll use it later in this worksheet.
### Control Structures: Loops & Choices
Loops and control structures can be indispensable tools when writing a script that does something for you. For example, maybe you want to write a script that goes out to a website, grabs a primary source, downloads it, and saves it to a file then repeats that hundreds of times until you have a set of primary sources. To write that kind of script you'd need loops and control structures to tell your program when to start, stop, and move to the next source.
Loops and control structures are one of those programming patterns that appear in almost every language. The syntax and style change, but the core idea doesn't. This is a concept I'd recommend you master and it'll come in handy for years to come.
**Choices** and **loops** are both known as types of **control structures.**
Choices, like `if`, allow you to run different code depending on the input.
Loops, like `for` and `while` allow you to repeatedly run code, typically with changing options.
Lets start with choices.
#### Choices
The most important choice statement in R is an `if` statement.
An `if` statement allows you to check the condition of something and then determine what should be done based on that input.
The basic format is this:
```
if (condition) true_action
#OR
if(condition) true_action else false action
```
And it works like this
```{r}
if (2 < 3) print(TRUE)
if (3 < 2) print(TRUE)
```
The first example above works. Why? Because 2 is indeed less than 3. The second line has no output. Why? Becuase the conditions weren't met. 3 is not less than 2 and so it didn't print anything out.
But we could ask it to do something else instead:
```{r}
if (3 < 2) print(TRUE) else print("This is clearly false.")
```
Most often, if statements are more complex (known as compound statements) and so they use brackets like this:
```{r}
x <- 71
if (x > 90) {
"A"
} else if (x > 50) {
"C"
} else {
"F"
}
```
(@) You are teaching History 100 for Clemson. Write an `if` statement that calculates whether the enrollment cap of 30 on your class has been met.
```{r}
```
(@) Create a list of presidents and store it in a variable. Use an `if` statement to check if the number of presidents in the list is more than 5. If its not indicate that too.
```{r}
```
(@) You can also use an if statement to check a variable in a dataframe. How might you use an if statement to check if the GayGuides dataset contains any year after 1990? (Hint: first try to figure out how to determine the latest year in the dataframe and then build an if statement that checks that and prints something out if its true or false. You should think about what kind of value in contained in the Year column, how to access it, and how to check for the latest value.)
```{r}
```
(@) Reflect on the question above. How did you figure this out? What was the process you went through to build this chunk of code?
>
#### Loops
A **loop* is used to iterate over items in a vector. They typically follow this form:
```
for (item in vector) perform_action
```
For each item in the vector, perform_action is called once and the value of the item is updated each time.
For example,
```{r}
for (i in 1:3) {
print(i)
}
```
What does this loop do? `1:3` is shorthand for 1, 2, 3. Run it in your terminal to see its output. This code says, for every number in the range 1:3, print the number.
We could do this with character vectors too.
```{r}
presidents <- c("George Washington", "John Adams", "Thomas Jefferson")
for (p in 1:length(presidents)) {
print(presidents[p])
}
```
(@) Why does this work? What does `length()` do and why do we use it on `presidents`?
>
Create a character vector that contains the title of each of your classes this semester. Can you write a `for` loop that prints out "I am enrolled in" and the name of each of your classes? ("I am enrolled in History 8150", "I am enrolled in History 8000".....etc). Hint: you'll need a function to combine the output and some text of your choosing inside the function.
```{r}
```
Sometimes we want our loop to terminate itself early. There are two statements that help us do that:
* `next` exits the current iteration
* `break` exits the entire for loop
```{r}
for (i in 1:10) {
if (i < 3)
next
print (i)
}
```
`Next` skips values until the criteria is met. So in the above example it skips numbers 1 and 2 because they don't match the criteria. But in some cases we may not want to skip the entries but rather exit the loop entirely. Something like this:
```{r}
for (i in 1:10) {
if (i < 3)
next
if (i >= 5)
break
print (i)
}
```
(@) What happened here? Why did the loop only print 3 and 4?
(@) In the state population data, can you write a loop that pulls only states with populations between 200,000 and 250,000 in 1800? You could accomplish this a few different ways. Try to use next and break statements.
```{r}
```
(@) What if we wanted to loop through the gay guides data? Can you write a loop that iterates through the cities in the Gay Guides data and returns the title of any location in Greenville?
```{r}
```
`while` is another useful tool for writing loops. While will run a loop _while_ a condition is true. Once that condition is not true, it exits. For example:
```{r}
i <- 1
while (i < 6) {
print(i)
i <- i+1
}
```
Its important to note here that `i` is defined outside the loop. Its initial value is 1 but each time the loop runs it gets 1 added to it. So the first time it runs, i = 1 and the loop prints the value then adds 1 to i. Now i = 2 but since that is still less than 6 the loop continues and prints the value then adds 1 to i. Eventually, i=6 and since 6 is not less than 6 the loop exits without printing.
`While` doesn't get used as often in R because there is often a more efficient way to do the same thing. But in other languages `while` gets used much more often. Its a good pattern to be familiar with even if you don't use in frequently in R.
## Functions
A **function** is a set of statements that are organized together to perform a specific task. You've already been using functions because R has a number of built in functions that are available to users. For example. `print()` is a function.
To be more specific, a **function** is a set of code that performs a task and returns a result. As you advance your programming skills you'll probably have certain tasks that you perform frequently. Once you've run a chunk of code several times its good practice to turn it into a function so it can be repeatedly used.
A function in R takes any arguments that may be necessary to perform the actions contained within the function, performs those functions, and then returns any result in the console.
For example, the below function takes a number as an input, multiplies it by 5 and then returns the result.
```{r}
myfunction <- function(y){
myvalue <- y * 5
print(myvalue)
}
```
When you run this nothing is returned. Why? You've loaded the function but you haven't called it yet. We can do that like this:
```{r}
myfunction(5)
```
You'll notice that the variable we created inside the function, `myvalue`, doesn't show up. Unless we write code asking R to return the value of that variable, it'll run invisibly in the background and only give us back the result we ask for. This comes in handy when you are writing more complex functions and need to store bits of data temporarily for a calculation that is being run by the function.
Here's another example:
```{r}
historyStudents <- function(studentname, class) {
statement <- paste("Hello ", studentname, ". Welcome to ", class, "!", sept="")
statement
}
```
(@) Can you run this function?
```{r}
```
There are several components to a function.
* **Function Name**:This is the actual name of the function. It is stored in R environment as an object with this name. In the above example, the function name is `historyStudents`
* **Arguments**: An argument is a placeholder. When a function is invoked, you pass a value to the argument. Arguments are optional; that is, a function may contain no arguments. Also arguments can have default values. In the example above, the function takes two arguments: a student's name and the name of a class.
* **Function Body**: The function body contains a collection of statements that defines what the function does. In the above example the function body is the code inside the `{` and `}` brackets.
* **Return Value**: The return value of a function is the last expression in the function body to be evaluated. In the above example, our return value is `statement` which prints our welcome statement using the names we provided in the arguments.
(@) Write a function that takes a string of text and counts the number of characters. The function should return "There are xx characters in that string."
```{r}
```
(@) Reflect on the process of building the above function. How did you go about figuring this out?
>
The body of a functions can use any of the techniques we've learned in this worksheet like loops and if statements. For example:
```{r}
grade <- function(x) {
if (x > 90) {
"A"
} else if (x > 50) {
"C"
} else {
"F"
}
}
grade(85)
```
We could run this function as many times as we want.
```{r}
grade(95)
grade(75)
```
(@) In the example below, why doesn't George Washington's grade get printed out?
```{r}
GeorgeWashington <- grade(60)
```
>
Here's a more complex example:
```{r}
gg.locations <- function (city, year) {
data(gayguides)
for (i in 1:length(gayguides$city)) {
if (gayguides$city[i] == city && gayguides$Year[i] == year) {
print(paste("Found a location called ", gayguides$title[i]))
} else {
next
}
}
}
```
(@) Write a function that uses the things you've learned in this worksheet. The function should accept two arguments, city and year. It should pull all the locations in that city and year and return a statement that says "Found a location called xxx". (Where x is the title of a location in that year and city.)
```{r}
```
(@) Use the Boston Women Voter dataset. Write a function that accepts the name of an occupation. Your function should return new dataframe that includes the records for every voter listed with that occupation.
```{r}
```