-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathgender_pay_gap_analysis.Rmd
319 lines (255 loc) · 10.6 KB
/
gender_pay_gap_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
---
title: "Gender pay gap analysis"
output:
html_document:
toc: true # contents
toc_float: true # contents float
number_sections: true
df_print: paged # paged dataframes
css: styles.css # use css file from this project
pdf_document: default
---
# Introduction
This report is based on analysis of UK companies gender pay data provided by the
companies themselves.
The data is from https://gender-pay-gap.service.gov.uk/viewing/download
I explore the trends and patterns within the data and make comparisons between
Scotland the the rest of the UK.
## Load in packages and data
```{r}
library(tidyverse)
library(janitor)
library(infer)
```
```{r}
# here is how I would read the data files one at a time
gender_pay_2017_18 <- read_csv("data/UK Gender Pay Gap Data - 2017 to 2018.csv")
gender_pay_2018_19 <- read_csv("data/UK Gender Pay Gap Data - 2018 to 2019.csv")
# read in multiple files in one:
list_of_files <- list.files(path = "data", # specify the folder path
pattern = "\\.csv$", # only return files that end in csv
full.names = TRUE # foldeer path attached to the beginning of the file name
)
pay_combined <- read_csv(list_of_files, id = "file_name")
pay_combined
```
## Clean the data
```{r}
pay_trimmed <- pay_combined %>%
clean_names() %>%
mutate(year_starting = as.numeric(str_extract(file_name, "[0-9]+")), .before = 1) %>% # take the first year listed in the file name
select(contains(c("year", "post", "percent", "quartile", "size"))) # selecting desired columns
pay_trimmed
```
## Find out the Scottish postcodes
Can be found many places but I noted them down from here:
https://www.townscountiespostcodes.co.uk/postcodes-in-scotland/
```{r}
scottish_postcodes <- c("AB", "DD", "PH", "FK", "G", "PA", "DG", "KA", "ML", "DG", "EH", "KY", "IV", "TD", "ZE", "HS", "KW")
```
Cut down the postcode column to just the first section and use this to determine if a company is in scotland or not:
```{r}
pay_region <- pay_trimmed %>%
mutate(post_code = str_extract(post_code, "[A-Z0-9]+")) %>%
mutate(post_code = str_remove_all(post_code, "[0-9]+")) %>%
mutate(region = if_else(post_code %in% scottish_postcodes, "Scotland", "Rest of UK"), .after = post_code) %>%
select(-post_code)
pay_region
```
# What pay differences can we see in Scotland year on year?
diff_mean_hourly_percent: mean % difference between male and female hourly pay (negative = women's mean hourly pay is higher)
```{r}
pay_region %>%
filter(region == "Scotland") %>%
ggplot() +
geom_boxplot(aes(x = year_starting, y = diff_mean_hourly_percent, group = year_starting), colour = "blue", alpha = 0.5) +
labs(
title = "Gender pay differences across Scottish companies",
subtitle = "2017 - 2023\n",
x = "\nFinancial year starting",
y = "Mean % difference in hourly pay\n(M - F)"
) +
scale_x_continuous(breaks = c(2017, 2018, 2019, 2020, 2021, 2022, 2023)) +
theme_minimal() +
theme(panel.grid = element_blank(),
axis.line = element_line(colour = "black")) +
geom_hline(yintercept = 0, linetype = "dashed") +
coord_flip()
```
Let's check out 2023
```{r}
pay_region %>%
filter(region == "Scotland") %>%
group_by(year_starting) %>%
summarise(nrows = n())
```
# Are there any differences in bonuses given in Scottish companies?
male_bonus_percent: % of male employees paid a bonus
female_bonus_percent: % of female employees paid a bonus
```{r}
pay_region %>%
filter(region == "Scotland") %>%
select(male_bonus_percent, female_bonus_percent) %>%
filter(male_bonus_percent != 0 & female_bonus_percent != 0) %>%
pivot_longer(cols = c("male_bonus_percent",
"female_bonus_percent"),
names_to = "gender",
values_to = "bonus_percent") %>%
ggplot() +
geom_boxplot(aes(x = gender, y = bonus_percent, colour = gender), linewidth = 1.5, show.legend = FALSE) +
labs(
title = "Bonuses paid across Scottish companies",
subtitle = "2017 - 2023\n",
x = "",
y = "% of employees paid a bonus"
) +
scale_x_discrete(labels = c("Women", "Men")) +
theme_minimal() +
theme(panel.grid = element_blank(),
axis.line = element_line(colour = "black"))
```
Can see that there is very little difference in the bonuses paid to men and women across Scottish companies.
# Does company size affect pay disparity?
```{r}
pay_region %>%
filter(region == "Scotland") %>%
mutate(employer_size = factor(employer_size, levels = c("Less than 250", "250 to 499", "500 to 999", "1000 to 4999", "5000 to 19,999", "20,000 or more", "Not Provided"))) %>%
filter(!employer_size %in% c("Not Provided", NA)) %>%
ggplot() +
geom_boxplot(aes(x = employer_size, y = diff_mean_hourly_percent), colour = "blue") +
labs(
title = "Gender pay differences across Scottish companies",
subtitle = "2017 - 2023\n",
x = "\nCompany size",
y = "Mean % difference in hourly pay\n(M - F)"
) +
theme_minimal() +
theme(panel.grid = element_blank(),
axis.line = element_line(colour = "black")) +
geom_hline(yintercept = 0, linetype = "dashed") +
coord_flip()
```
This plot appears to show that the larger a company is, the larger the pay gap between men and women (with men earning more per hour).
# How does Scotland compare to the rest of the UK year on year?
```{r}
pay_region %>%
group_by(region, year_starting) %>%
mutate(region = factor(as.factor(region), levels = c("Scotland", "Rest of UK"))) %>%
summarise(mean_diff_hourly_across_companies = mean(diff_mean_hourly_percent)) %>%
ggplot() +
geom_line(aes(x = year_starting, y = mean_diff_hourly_across_companies, colour = region),
linewidth = 2) +
scale_x_continuous(breaks = c(2017, 2018, 2019, 2020, 2021, 2022, 2023)) +
labs(
title = "Average gender pay differences between Scotland and the rest of the UK",
subtitle = "2017 - 2023\n",
x = "\nFinancial year starting",
y = "Overall mean % difference in hourly pay\n(M - F)",
colour = ""
) +
theme_minimal() +
theme(panel.grid = element_blank(),
axis.line = element_line(colour = "black")) +
ylim(c(0, 15))
```
# How does Edinburgh compare to London when it comes to the gender pay gap?
First divide region variable appropriately:
```{r}
# https://postcodefinder.net/england/london
london_postcodes <- c("BR", "CR", "DA", "E", "EC", "EN", "HA", "IG", "KT", "N", "NW", "RM", "SE", "SM", "SW", "TN", "TW", "UB", "W", "WC")
pay_lon_ed <- pay_trimmed %>%
mutate(post_code = str_extract(post_code, "[A-Z0-9]+")) %>%
mutate(post_code = str_remove_all(post_code, "[0-9]+")) %>%
mutate(region = case_when(
post_code %in% london_postcodes ~ "London",
post_code == "EH" ~ "Edinburgh",
.default = "Rest of UK"), .after = post_code) %>%
mutate(region = as.factor(region)) %>%
select(-post_code)
pay_lon_ed
```
```{r}
pay_lon_ed %>%
filter(region != "Rest of UK") %>% # filter out irrelevant companies
ggplot() +
geom_boxplot(aes(x = year_starting, y = diff_mean_hourly_percent, group = year_starting), alpha = 0.5) +
facet_wrap(~ region) +
labs(
title = "Gender pay differences across Edinburgh and London companies",
subtitle = "2017 - 2023\n",
x = "\nFinancial year starting",
y = "Mean % difference in hourly pay\n(M - F)"
) +
scale_x_continuous(breaks = c(2017, 2018, 2019, 2020, 2021, 2022, 2023)) +
theme_minimal() +
theme(panel.grid = element_blank(),
axis.line = element_line(colour = "black")) +
geom_hline(yintercept = 0, linetype = "dashed") +
coord_flip()
```
We can see here that, across the years 2017 to 2023, the disparity in hourly pay
between men and women is very similar between Edinburgh- and London-based companies.
Specifically, the values of mean percentage difference in pay are mostly positive
which indicates men are paid more than women.
```{r}
pay_lon_ed %>%
filter(region != "Rest of UK") %>%
group_by(region) %>%
summarise(mean_diff_hourly_across_companies = mean(diff_mean_hourly_percent),
sd_diff_hourly_across_companies = sd(diff_mean_hourly_percent))
```
The total mean difference for Edinburgh companies across all years is 12.2% (SD: 13.3%) and for London companies is 13.6% (SD: 14.7%). These positive values indicate men being paid more hourly on average.
## Hypothesis test comparing means between Edinburgh and London
```{r}
hyp_test_data <- pay_lon_ed %>%
filter(region != "Rest of UK") %>%
select(region, diff_mean_hourly_percent)
# visualise data
hyp_test_data %>%
ggplot(aes(x = region, y = diff_mean_hourly_percent)) +
geom_boxplot()
```
We can see the interquartile ranges of the distributions overlap substantially, and the
London values are much wider spread.
Will now create the null distribution:
```{r}
null_distribution <- hyp_test_data %>%
specify(diff_mean_hourly_percent ~ region) %>%
hypothesize(null = "independence") %>% #the null hypothesis is there is no relationship i.e. they are independent
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("London", "Edinburgh"))
head(null_distribution)
```
We can see here that we have 1000 (only 6 shown) replicated samples, and each has
a difference in means shown (stat).
Now we have our null distribution, let’s calculate our observed statistic and then
visualise the null distribution and where the observed statistic lies on the distribution.
```{r}
observed_stat <- hyp_test_data %>%
specify(diff_mean_hourly_percent ~ region) %>%
calculate(stat = "diff in means", order = c("London", "Edinburgh"))
observed_stat
```
So the observed statistic is that the total mean difference in percent in London
is 1.4% higher than in Edinburgh. Will now test to see if this is significant at the
0.05 level.
```{r}
null_distribution %>%
visualise() +
shade_p_value(obs_stat = observed_stat, direction = "right")
```
So we see from the visualisation that the observed statistic is towards the
edge of the null distribution. So there would be a small probability
of getting a more extreme value than the observed value under the null hypothesis
(that the difference between cities = 0)
Let’s calculate the p-value to be sure.
```{r}
p_value <- null_distribution %>%
get_p_value(obs_stat = observed_stat, direction = "right")
p_value
```
P is less than 0.05, meaning this difference is statistically significant. I can
therefore reject the null hypothesis and conclude that I have found enough evidence
in the data to to suggest that the total mean percentage difference in hourly pay
(Male - Female) across London companies is statistically significantly greater than
across Edinburgh companies.