-
Notifications
You must be signed in to change notification settings - Fork 3
/
09-analysis-outside-of-EdSurvey.Rmd
110 lines (80 loc) · 7 KB
/
09-analysis-outside-of-EdSurvey.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
# Analysis Outside `EdSurvey` {#analysisOutsideEdSurvey}
Last edited: July 2023
**Suggested Citation**<br></br>
Lee, M. Introduction. In Bailey, P. and Zhang, T. (eds.), _Analyzing NCES Data Using EdSurvey: A User's Guide_.
`EdSurvey` gives users functions to efficiently analyze education survey data. Although `EdSurvey` allows for rudimentary data manipulation and analysis, this chapter will discuss how to integrate other R packages into `EdSurvey`. As this chapter will demonstrate, this functionality is especially useful for data processing and manipulation in popular R packages such as `dplyr`.
## Integration With Any Other Package
By calling the function `getData()`, one can extract a `light.edsurvey.data.frame`: a `data.frame`-like object containing requested variables, weights, and each weight's associated replicate weights. This `light.edsurvey.data.frame` can be not only manipulated as with other `data.frame` objects but also used with packaged `EdSurvey` functions. As noted in [Chapter 6](#retrievingAllVariablesInADataset), setting the arguments `dropOmittedLevels` and `defaultConditions` to `FALSE` ensures that the values that would normally be removed are included. The argument `addAttributes = TRUE` ensures the extraction of necessary survey design attributes, including the replicate weights, PSU variables, and strata variables.
```{r readNAEPSource}
library(EdSurvey)
sdf <- readNAEP(path = system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer"))
gddat <- getData(data = sdf, varnames = c('composite', 'dsex', 'b017451', 'origwt'),
addAttributes = TRUE, dropOmittedLevels = FALSE)
```
The base R function `gsub` allows users to substitute one string for another. The following step recodes "Every day" to "Seven days a week". The `head` function reveals the first 6 values of the recoded variable `b017451` accessed by the `$` operator:
```{r baseRecodeColumn}
# 1. Recode a Column Based on a String
gddat$b017451 <- gsub(pattern = "Every day", replacement = "Seven days a week",
x = gddat$b017451)
head(x = gddat$b017451)
```
After manipulating the data, you can use a `light.edsurvey.data.frame` with any `EdSurvey` function. As shown in the previous example, after retrieving a dataset, it can be used with most other R package functions, but occasionally one might encounter errors. A helper function to circumvent these errors is `rebindAttributes`.
## Applying `rebindAttributes` to Use `EdSurvey` Functions With Manipulated Data Frames
The `rebindAttributes` function allows users to reassign the survey data attributes required by `EdSurvey` to a data frame that might have had its attributes stripped during the manipulation process. After rebinding the attributes, all variables---including those outside the original dataset---are available for `EdSurvey` analytical functions.
For example, a user might want to run a linear model using `composite`, the default weight `origwt`, the variable `dsex`, and the categorical variable `b017451` recoded into a binary variable. To do so, we can return a portion of the `sdf` survey data as the `gddat` object. Next, use the base R function `ifelse` to conditionally recode the variable `b017451` by collapsing the levels `"Never or hardly ever"` and `"Once every few weeks"` into one level (`"Rarely"`) and all other levels into `"At least once a week"`.
```{r rebindAttributesPrep9, cache=FALSE, warning=FALSE}
gddat <- getData(data = sdf, varnames = c("dsex", "b017451", "origwt", "composite"),
dropOmittedLevels = TRUE)
gddat$studyTalk <- ifelse(gddat$b017451 %in% c("Never or hardly ever",
"Once every few weeks"),
"Rarely", "At least once a week")
```
From there, apply `rebindAttributes` from the attribute data `sdf` to the manipulated data frame `gddat`. The new variables are now available for use in `EdSurvey` analytical functions:
```{r rebindAttributes9, cache=FALSE, warning=FALSE}
gddat <- rebindAttributes(data = gddat, attributeData = sdf)
lm2 <- lm.sdf(formula = composite ~ dsex + studyTalk, data = gddat)
summary(object = lm2)
```
## Integration With `dplyr`
One popular package for data manipulation in the R ecosystem is `dplyr`. Given its ubiquity, it merits noting common errors that one might encounter when performing analyses using `EdSurvey` together with `dplyr`.
Let's say a user is interested in predicting how often a student talks about studies at home based on their gender and disability status. The following example demonstrates how to predict whether a student talks about studies at home (`b017451`) based on their sex (`dsex`) and whether they have an individualized education plan (`iep`) using the weight `origwt`. The dependent variable `b017451` specified using the outcome level of the regression with `I(b017451 == "Never or hardly ever")`:
```{r dplyr, message=FALSE}
library(dplyr)
library(tidyr)
```
```{r dplyr2, error=TRUE}
gddat <- getData(data = sdf, varnames = c("dsex", "b017451", "iep", "lep", "origwt", "composite"),
addAttributes = TRUE, dropOmittedLevels = TRUE)
```
The `dplyr` function `unite()` takes multiple variables and concatenates them, similar to the base R function `paste0()`. The `%>%` (pipe) operator allows an object to be passed forward to another function call.
```{r combineVariables}
# Unite columns
gddat <- gddat %>% unite(col = "combinedVar", dsex, iep, sep = "_")
table(gddat$combinedVar)
```
```{r logit1, error = TRUE}
# Specify level in I()
logit1 <- logit.sdf(formula = I(b017451 == "Never or hardly ever") ~ combinedVar,
data = gddat)
```
When we attempt to run the logistic regression, `EdSurvey` returns an error that it cannot locate the survey weights for this data frame. After creating a new variable, `EdSurvey` can no longer access the survey attributes needed to complete this analysis. To remedy, apply `rebindAttributes` from the attribute data `sdf` to the manipulated data frame `gddat`:
```{r rebind2}
gddat <- rebindAttributes(data = gddat, attributeData = sdf)
logit1 <- logit.sdf(formula = I(b017451 =="Never or hardly ever") ~ combinedVar,
data = gddat)
```
Other functions, such as `rowwise()`, `group_by()`, and `ungroup()` silently override the class of the `light.edsurvey.data.frame`, causing the attributes to be inaccessible. In the following example, we use `mutate()` to create a new variable `mrpcmAverage` that calculates the mean of each row's plausible values:
```{r getDataExample2}
gddat <- getData(data = sdf,
varnames = c("dsex", "b017451", "iep", "lep", "origwt", "composite"),
addAttributes = TRUE, dropOmittedLevels = TRUE)
gddat <- gddat %>%
rowwise() %>%
mutate(mrpcmAverage = mean(c(mrpcm1, mrpcm2, mrpcm3, mrpcm4, mrpcm5), na.rm = TRUE))
class(gddat)
```
The function `rebindAttributes()`reapplies survey attributes and prepares the data for use with `EdSurvey` analysis functions.
```{r rebind4}
gddat <- rebindAttributes(data = gddat, attributeData = sdf)
class(gddat)
```