forked from anabento/R_Bootcamp
-
Notifications
You must be signed in to change notification settings - Fork 0
/
M5ManipulatingData.Rmd
98 lines (63 loc) · 4.38 KB
/
M5ManipulatingData.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
title: ""
author: ""
date: "`r Sys.Date()`"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval=FALSE)
```
#<span style="color:cadetblue">Manipulating and exploring data</span>
***
This module will cover:
- loading data
- calculate simple stats in base `R`
- plot function in base `R`
which will require the following skills already covered:
- logic statements
- manipulating data
- assigning an object
###<span style="color:cadetblue">Loading tabular data </span>
Before we jump into calculating summary statistics and simple data visualization, we need to import or load data into `R`. External data, often in table form, needs to be loaded into R. Common ways to read data frames into R uses `read.csv` or `read.table`.
<span style="color:orangered">**Practice**:</span> Write code to load a .csv file into `R`. How does `R` know where the file is located?
There are lots of tricks to loading data depending on the data types. For more information check out [datacamp's tutorial](https://www.datacamp.com/community/tutorials/r-data-import-tutorial#gs.4uaBTV8).
###<span style="color:cadetblue">Iterative process of data exploration</span>
Data exploration is a key step in scientific inquiry. Effective exploration might include transforming the data by grouping replicate populatons, calculating summary statistics, etc. Data cleaning (ie. removing missing values) should be done **prior** to this step and done in `R`. Data cleaning in `R` means you'll have a record of how the data was changed.
![data science cycle](data-science.png)
From the chart above, we can see that data visualization is also an important step to understanding your data. The rest of this module will be spent exploring a new data set using the skills we have already covered, and building on them to create plots of the data.
###<span style="color:cadetblue">Loading data from a package</span>
We'll explore the `flights` data in the data package called [`nycflights13`](https://cran.r-project.org/web/packages/nycflights13/index.html). The `flights` object contains airline on-time data for all flights leaving airports in the New York City, NY area in 2013.
You will most likely have to install this package by using the RStudio GUI or `install.packages()` command.
```{r}
install.packages("nycflights13") #install package
library(nycflights13) #load library
```
You'll notice the `nycflights13` package includes other metadata:
- airlines
- airports
- planes
- weather
You can access them through the package using double colons following the package name (`nycflights13::`), but they are also loaded in the workspace automatically.
You'll explore the data as a team of two, a driver and navigator. The driver will type the code, while the navigator checks for errors, and suggests answers. A good navigator will troubleshoot syntax questions by reading manual pages, googling, etc. Throughout the exercise you will be asked to switch roles with your teammate.
###<span style="color:cadetblue">Your problem</span>
Your data exploration should be working towards describing the broad trends in the data, along with answering specific questions. You should work with your partner to develop a common set of goals. If time allows, you'll present your findings to the clinic.
Data exploration might include answering the following:
1. Exploring Data Structure
* What is the data structure? columns and rows?
* How many total flights, total flights per airport or airline carrier?
* What are the different flight destinations?
* etc.
2. Exploring Data with Summary Statistics
* What is the max, min and mean in flight delays? How does this break down by airline, airport or both?
* What is the most common flight destination? Least common?
* What is the average number of flights per month?
* Other descriptive statistics built into `R` include: mean, sd, var, min, max, median, range, and quantile
3. Exploring Data with Basic Visualizations
* scatter plot: How does flight delay change with time? with weather? Do these relationships change with airport?
* histogram: What is the distribution of flights with time? to a destination?
* boxplot: What is the distribution of flight delay by airline?
* etc.
Get started by opening the **manipulating_data.R** script.