forked from jrnold/bugs-examples-in-stan
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathcampaign.Rmd
141 lines (114 loc) · 5.28 KB
/
campaign.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
# Pooling the Polls Over an Election Campaign {#campaign}
```{r campaign_setup,message=FALSE,cache=FALSE}
library("rstan")
library("tidyverse")
library("lubridate")
library("stringr")
library("pscl")
```
This is an example of pooling polls (public opinion surveys) over time to estimate public opinion.
This example comes from @Jackman2005a and @Jackman2009a, it uses Australian polls of political party preferences between the 2004 and 2007 Australian general elections.
This model is similar to, and a simplified form of the polling aggregation methods recently and most notably used by FiveThirtyEight, Huffington Post, Daily Kos, New York Times Upshot, and others in US elections [@Linzer2013a].
## Data
The data for Australian Elections polls between 2004 and 2007.
```{r AustralianElectionPolling}
data("AustralianElectionPolling", package = "pscl")
glimpse(AustralianElectionPolling)
```
The election result data for the House of Representatives: "first preferences".[^auselecsrc]
```{r elections}
elections <-
list(
`2007` = list(
date = ymd(20071124),
ALP = 43.4,
Lib = 36.3,
Nat = 5.5,
Green = 7.8,
FamilyFirst = 2,
Dems = 0.7,
OneNation = 0.3,
sampleSize = 12419863
),
`2004` = list(
date = ymd(20041009),
ALP = 37.6,
Lib = 40.5,
Nat = 5.9,
Green = 7.2,
FamilyFirst = 2.0,
Dems = 1.2,
OneNation = 1.2,
sampleSize = 11715132
)
)
```
```{r}
START_DATE <- elections[["2004"]][["date"]]
AustralianElectionPolling <-
AustralianElectionPolling %>%
mutate(midDate = as.Date(startDate + difftime(endDate, startDate)),
ALP_se = ALP * (100 - ALP) / sampleSize,
time = as.integer(difftime(midDate, START_DATE, units = "days")) + 1L,
pollster = as.integer(factor(org)))
```
## Model
The Stan model is:
```{r campaign_mod,results='hide',cache.extra=tools::md5sum("stan/campaign.stan")}
campaign_mod <- stan_model("stan/campaign.stan")
```
```{r echo=FALSE,results='asis'}
campaign_mod
```
For House effects assume mean zero and a standard deviation of 7.5. The standard deviation corresponds to house effects with a 95% confidence interval of between -15 and 15 (which would be large).
It is expected that most polling movements are $\pm 2$ percentage points.
## Estimation
```{r campaign_data}
campaign_data <- within(list(), {
y <- AustralianElectionPolling$ALP
s <- AustralianElectionPolling$ALP_se
time <- AustralianElectionPolling$time
house <- AustralianElectionPolling$pollster
H <- max(AustralianElectionPolling$pollster)
N <- length(y)
T <- as.integer(difftime(elections[["2007"]][["date"]], elections[["2004"]][["date"]], units = "days")) + 1
xi_init <- elections[["2004"]][["ALP"]]
xi_final <- elections[["2007"]][["ALP"]]
delta_loc <- 0
tau_scale <- sd(y)
zeta_scale <- 5
})
```
```{r campaign_fit,results='hide'}
campaign_fit <- sampling(campaign_mod, data = campaign_data,
chains = 1,
init = list(list(xi = rep(mean(campaign_data$y), campaign_data$N))))
```
```{r}
campaign_fit
```
Plot the posterior distribution of the share supporting the Liberal party.
```{r campaign_plot_xi}
xi <- summary(campaign_fit, par = "xi")$summary %>%
as.data.frame() %>%
rownames_to_column("parameter") %>%
filter(!str_detect(parameter, ",2\\]")) %>%
mutate(date = START_DATE + row_number() - 1)
ggplot() +
geom_ribbon(data = xi, mapping = aes(x = date, ymin = mean - 2 * sd, ymax = mean + 2 * sd), alpha = 0.3) +
geom_line(data = xi, mapping = aes(x = date, y = mean)) +
geom_point(data = AustralianElectionPolling,
mapping = aes(x = midDate, y = ALP, colour = org))
```
## Questions
1. Which polling firm is the most biased? Do any polling firms have 95 percent credible intervals
1. This is a retrospective model. How would you turn this into a forecasting model? How would you evaluate its performance?
1. Model this with a logit transformed output. Do your results change? Would you have expected the results to change much?
1. Model this with all parties and a multinomial response
1. Model this with a varying-slope model
1. Adjust the model to (1) allow for outliers in polls (2) allow for sudden changes in sentiment in the polling average
1. Currently the only effect of polling firms is through their bias. How would you alter the model to allow some polling firms to have more variable polls? How is this similar to weighting different firms differently?
1. Why is the constraint that the "house effects" are on average zero necessary? What would happen if you let it be anything?
1. It is known that the polls can be systematically biased in elections, meaning that the average of polls can differ from the true result beyond what is implied by sampling uncertainty. How would you incorporate that into the model?
1. Instead of fixing $\xi_1$ and $\xi_T$, we could treat elections as another public opinion survey. What values of $y$, $s$, and $\delta$ would you give to elections? Model and compare the results to the original method.
[^auselecsrc]: Sources: [2007](https://en.wikipedia.org/wiki/Results_of_the_Australian_federal_election,_2007_%28House_of_Representatives%29), [2004](https://en.wikipedia.org/wiki/Full_national_lower_house_results_for_the_2004_Australian_federal_election).