-
Notifications
You must be signed in to change notification settings - Fork 0
/
SDS2_final_project - JAGS.Rmd
2094 lines (1727 loc) · 88.2 KB
/
SDS2_final_project - JAGS.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Influence of Social Norms and Community Interactions on Crime Rates: A Statistical Exploration"
author: "Oddi Livia"
date: "2024-11-07"
output:
html_document:
toc: true
toc_depth: 3
toc_float: true
number_sections: false
theme: paper
highlight: tango
df_print: paged
word_document: default
pdf_document: default
urlcolor: magenta
linkcolor: cyan
geometry: margin=1.25cm
fontsize: 12pt
header-includes:
- \usepackage{bbold}
- \usepackage{mdframed, xcolor}
- \usepackage{graphicx}
- \mdfsetup{frametitlealignment=\center}
- \usepackage{multirow}
- \definecolor{shadecolor}{rgb}{0.89,0.8,1}
- \newcommand{\Prob}{\mathbb{P}}
- \newcommand{\Exp}{\mathbb{E}}
- \newcommand{\Var}{\mathbb{V}\mathrm{ar}}
- \newcommand{\Cov}{\mathbb{C}\mathrm{ov}}
- \newcommand{\blue}{\textcolor{blue}}
- \newcommand{\darkgreen}{\textcolor[rgb]{0,.5,0}}
- \newcommand{\gray}{\textcolor[rgb]{.3,.3,.3}}
- \newcommand{\blueA}{\textcolor[rgb]{0,.1,.4}}
- \newcommand{\blueB}{\textcolor[rgb]{0,.3,.6}}
- \newcommand{\blueC}{\textcolor[rgb]{0,.5,.8}}
- \newcommand{\evidenzia}{\textcolor[rgb]{0,0,0}}
- \newcommand{\nero}{\textcolor[rgb]{0,0,0}}
- \newcommand{\darkyel}{\textcolor[rgb]{.4,.4,0}}
- \newcommand{\darkred}{\textcolor[rgb]{.6,0,0}}
- \newcommand{\blueDek}{\textcolor[rgb]{0.6000000, 0.7490196, 0.9019608}}
- \newcommand{\purpLarry}{\textcolor[rgb]{0.6901961, 0.2431373, 0.4784314}}
- \newcommand{\lightgray}{\textcolor[rgb]{.8,.8,.8}}
- \newcommand{\bfun}{\left\{\begin{array}{ll}}
- \newcommand{\efun}{\end{array}\right.}
editor_options:
markdown:
wrap: 72
---
### Introduction
This project aims to explore how social norms and community interactions
influence crime rates. By examining the social fabric and relational
dynamics within communities, we gain deeper insights into crime that go
beyond environmental or individual factors.<br> Focusing on the
sociological perspective, this study investigates the relationship
between community cohesion, social norms, and crime rates, highlighting
the impact of social structures and collective behaviors on criminal
activity.
#### **Objective**
The primary objective is to explore the connections between community
cohesion and crime dynamics. This involves analyzing how variations in
community cohesion correlate with crime rates and how social structures
within communities contribute to these patterns. The study aims to use a
Bayesian Hierarchical Model to account for different levels of social
interactions, from individual to community-wide scales, to better
understand these relationships.
#### **Dataset**
The crime dataset used in this project is obtained from the [UCI Machine
Learning
Repository](https://archive.ics.uci.edu/dataset/183/communities+and+crime),
specifically the *Communities and Crime* dataset. This dataset is
fetched using Python as per the provided instructions on the UCI website
and then uploaded to a platform suitable for analysis in R.
The dataset contains 128 variables chosen for their potential link to
crime, including community characteristics and law enforcement metrics,
and includes the following types of data:
- ***Social Cohesion Indicators*** : Data on community engagement,
participation in local events, sense of community, and social trust
derived from surveys.
- ***Socio-economic Data*** : Information on income distribution,
educational attainment, and employment rates within communities.
- ***Crime Statistics*** : Detailed crime reports categorized by type
and intensity, including data on locations and times
The target variable, *`Per Capita Violent Crimes`*, was calculated using
population data and the sum of violent crimes (murder, rape, robbery,
and assault). Due to inconsistencies in rape counts, some cities, mainly
from the Midwestern USA, were excluded. All numeric data were normalized
to a 0.00-1.00 range using an unsupervised, equal-interval binning
method, preserving the distribution and skew of each attribute but not
the relationships between different attributes. Extreme values more than
3 standard deviations from the mean were capped at 1.00 or 0.00.
Due to time and memory constraints, the following variables were
selected for this project:
- ***Social Cohesion Indicators*** *Teen_2Par,YoungKids_2Par,
Families_2Parents, Large_Families, Working_mom, Illegitimate_Births*
- ***Socio-economic Data*** *Median_Income, Employed, Unemployed,
Below_Poverty, Degree_BS_Or_More, Inc_from_inv, Poor_English,
Welfare_Public_Assist*
### Exploratory Data Analysis
The initial step of this project involved performing Exploratory Data
Analysis (EDA) to understand the structure and distribution of the data,
identify patterns, and detect any anomalies or outliers. This analysis
provided valuable insights and helped in making informed decisions for
data preprocessing and model building.
```{r message=FALSE, warning=FALSE, include=FALSE}
#Libraries
library(clubSandwich)
library(ggplot2)
library(corrplot)
library(stats)
library(brms)
library(scales)
library(bayesplot)
library(e1071)
library(reshape2)
library(GGally)
library(plotly)
library(patchwork)
library(posterior)
library(lme4)
library(broom.mixed)
library(gamlss)
library(glmmTMB)
library(DHARMa)
library(coda)
library(rjags)
```
Part of the preprocessing of the dataset done in Google Colab using
Python (`SDS2_preprocessing.ipynb`) ensured the selection of an
appropriate number of variables and that the data was clean, consistent,
and suitable for further analysis.<br> The resulting reduced dataset,
which will be used for the project, includes the following variables:
```{r echo=FALSE, message=FALSE, warning=FALSE}
data = read.csv("crime_data.csv")
head(data,5)
data$State = as.factor(data$State)
#Because the States are encoded as numbers but I need them as categorical data
```
As a first step zeros and ones were removed in both the target and other
variables to ensure data consistency, going from 1994 observations to
1038.<br> After this transformation *Normal Q-Q Plot* and *Residuals vs
Fitted values* were plotted to check if the distribution meets normality
and homoscedasticity assumptions:
```{r echo=FALSE, message=FALSE, warning=FALSE}
data = data[!apply(data == 0 | data == 1, 1, any), ]
#Let's see the residuals
residuals_lm = lm(target ~ Employed, data = data)$residuals
qqnorm(residuals_lm)
qqline(residuals_lm)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
#Linear model
lm_model_target <- lm(target ~ Employed, data = data)
residuals_lm <- residuals(lm_model_target)
fitted_lm <- fitted(lm_model_target)
ggplot(data, aes(x = fitted_lm, y = residuals_lm)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Residuals vs Fitted Values",
x = "Fitted Values",
y = "Residuals") +
theme_minimal()
```
The **Normal Q-Q plot** shows that residuals deviate from the reference
line, particularly at the tails, suggesting some non-normality that
might impact model assumptions. In the **Residuals vs. Fitted Values
plot**, the residuals are scattered around zero, indicating general
support for homoscedasticity, although there is slight variation across
the fitted values. This variation could signal minor heteroscedasticity.
Applying a **Box-Cox transformation** can help mitigate these issues by
making the residuals closer to normal and stabilizing their variance.
This transformation improves the model's overall fit and makes parameter
estimates more reliable, potentially leading to more accurate
predictions:
$y(\lambda) =
\begin{cases}
\frac{y^{\lambda} - 1}{\lambda} & \text{if } \lambda \neq 0 \\
\log(y) & \text{if } \lambda = 0
\end{cases}$
where:
- $y(\lambda)$ is the transformed variable.
- $y$ is the original variable (must be positive for the
transformation).
- $\lambda$ is the transformation parameter, which is typically
determined to maximize the normality of the data (0.26 in our case).
```{r echo=FALSE, message=FALSE, warning=FALSE}
data$bc_target = (data$target^(0.26)-1)/0.26
residuals_bc = lm(bc_target ~ Employed, data = data)$residuals
qqnorm(residuals_bc, main = "Q-Q Plot of Box-Cox Transformed Residuals")
qqline(residuals_bc)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
#Linear model
lm_model <- lm(bc_target ~ Employed, data = data)
residuals_bc <- residuals(lm_model)
fitted_bc <- fitted(lm_model)
ggplot(data, aes(x = fitted_bc, y = residuals_bc)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Residuals vs Fitted Values of Box-Cox transformation",
x = "Fitted Values",
y = "Residuals") +
theme_minimal()
```
The **Normal Q-Q plot** and **Residuals vs. Fitted Values plot** after
the Box-Cox transformation show clear improvements. In the Q-Q plot,
residuals align more closely with the reference line, especially in the
middle, indicating a closer-to-normal distribution. Minor deviations
remain at the tails but are less severe than before. The Residuals vs.
Fitted plot now shows a more consistent spread around zero, with no
evident pattern, supporting homoscedasticity.
Overall, these improvements suggest that the Box-Cox transformation has
helped the model better meet normality and constant variance
assumptions, enhancing its reliability and predictive robustness.
Subsequentially and Boxplots were plotted for each variable to visualize
the distribution:
**Histograms**
```{r echo=FALSE, message=FALSE, warning=FALSE}
par(mfrow=c(2, 2))
first_half_vars = names(data[, setdiff(1:8, which(names(data) %in% c("State", "target")))])
for (i in first_half_vars) {
hist(data[[i]], main=paste("Histogram of", i), xlab=i, col="skyblue", breaks=30, probability=TRUE)
}
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
par(mfrow=c(2, 2))
second_half_vars = names(data[, setdiff(9:length(names(data)), which(names(data) %in% c("State", "target", "bc_target")))])
for (i in second_half_vars) {
hist(data[[i]], main=paste("Histogram of", i), xlab=i, col="skyblue", breaks=30, probability=TRUE)
}
```
**Boxplots**
```{r echo=FALSE, message=FALSE, warning=FALSE}
par(mfrow=c(2, 2))
first_half_vars <- names(data)[1:(length(names(data)) / 2)]
first_half_vars <- first_half_vars[!first_half_vars %in% c("State", "target")]
for (i in first_half_vars) {
boxplot(data[[i]], xlab=i, col="darkblue",
main=paste("Boxplot of", i))
}
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
par(mfrow=c(2, 2))
second_half_vars <- names(data)[((length(names(data)) / 2) + 1):length(names(data))]
second_half_vars <- second_half_vars[!second_half_vars %in% c("State", "target", "bc_target")]
for (i in second_half_vars) {
boxplot(data[[i]], xlab=i, col="darkblue",
main=paste("Boxplot of", i))
}
```
Overall, the histograms and boxplots demonstrate significant skewness in
several variables, like `Large_Families`, `Poor_English`,
`Welfare_Public_Assist`, `Below_Poverty`, and `Illegitimate_Births`,
`Speak_Eng_Only`. Applying a normalization or scaling could help reduce
the imbalance in the distribution.
```{r echo=FALSE, message=FALSE, warning=FALSE}
# Normalize the data
normalize <- function(x) {
return ((x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE))
}
df = data[, c("YoungKids_2Par", "Teen_2Par", "Employed", "Below_Poverty", "Degree_BS_Or_More", "Inc_from_inv", "Speak_Eng_Only", "Illegitimate_Births", "Large_Families", "Poor_English", "Families_2Parents", "Working_mom", "Median_Income", "Unemployment", "Welfare_Public_Assist")]
normalized_data = as.data.frame(lapply(df, normalize))
not_normalized_data = data[, c("State", "target", "bc_target")]
data = cbind(normalized_data, not_normalized_data)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
columns_to_exclude <- c("State", "target", "bc_target")
data_subset <- data[, !(names(data) %in% columns_to_exclude)]
means <- apply(data_subset, 2, mean)
sds <- apply(data_subset, 2, sd)
summary_stats <- data.frame(
Variable = colnames(data_subset),
Mean = means,
SD = sds
)
print(summary_stats)
```
Now we can have a look at how the distribution changed after the
normalization:
**Histograms**
```{r echo=FALSE, message=FALSE, warning=FALSE}
par(mfrow=c(2, 2))
first_half_vars = names(data[, setdiff(1:8, which(names(data) %in% c("State", "target")))])
for (i in first_half_vars) {
hist(data[[i]], main=paste("Histogram of", i), xlab=i, col="skyblue", breaks=30, probability=TRUE)
}
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
par(mfrow=c(2, 2))
second_half_vars = names(data[, setdiff(9:length(names(data)), which(names(data) %in% c("State", "target", "bc_target")))])
for (i in second_half_vars) {
hist(data[[i]], main=paste("Histogram of", i), xlab=i, col="skyblue", breaks=30, probability=TRUE)
}
```
**Boxplots**
```{r echo=FALSE, message=FALSE, warning=FALSE}
par(mfrow=c(2, 2))
first_half_vars <- names(data)[1:(length(names(data)) / 2)]
first_half_vars <- first_half_vars[!first_half_vars %in% c("State", "target")]
for (i in first_half_vars) {
boxplot(data[[i]], xlab=i, col="darkblue",
main=paste("Boxplot of", i))
}
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
par(mfrow=c(2, 2))
second_half_vars <- names(data)[((length(names(data)) / 2) + 1):length(names(data))]
second_half_vars <- second_half_vars[!second_half_vars %in% c("State", "target", "bc_target")]
for (i in second_half_vars) {
boxplot(data[[i]], xlab=i, col="darkblue",
main=paste("Boxplot of", i))
}
```
**State vs target**
Then, it was helpful to also investigate the relationship between the
`target` and the `State` variables, the categorical stratifying
variable:
```{r echo=FALSE, message=FALSE, warning=FALSE}
par(mfrow=c(1,1))
state_mapping = c(
"1" = "Alabama", "2" = "Alaska", "3" = "Arizona", "4" = "Arkansas", "5" = "California", "6" = "Colorado", "7" = "Connecticut", "8" = "Delaware",
"9" = "Florida", "10" = "Georgia", "11" = "Hawaii", "12" = "Idaho", "13" = "Illinois", "14" = "Indiana", "15" = "Iowa", "16" = "Kansas",
"17" = "Kentucky", "18" = "Louisiana", "19" = "Maine", "20" = "Maryland", "21" = "Massachusetts", "22" = "Michigan", "23" = "Minnesota", "24" = "Mississippi",
"25" = "Missouri", "26" = "Montana", "27" = "Nebraska", "28" = "Nevada", "29" = "New Hampshire", "30" = "New Jersey", "31" = "New Mexico", "32" = "New York",
"33" = "North Carolina", "34" = "North Dakota", "35" = "Ohio", "36" = "Oklahoma", "37" = "Oregon", "38" = "Pennsylvania", "39" = "Rhode Island", "40" = "South Carolina",
"41" = "South Dakota", "42" = "Tennessee", "43" = "Texas", "44" = "Utah", "45" = "Vermont", "46" = "Virginia", "47" = "Washington", "48" = "West Virginia",
"49" = "Wisconsin", "50" = "Wyoming"
)
# Filter state labels to match the unique states in your data
state_labels = state_mapping[as.character(unique(data$State))]
colfunc = colorRampPalette(c("red", "orange","green"))
colors = colfunc(100)[as.numeric(cut(data$target, breaks = 100))]
plot(data$State, data$target, main = "State and Target relationship",
xlab = "", ylab = "Target", col = colors, pch = 16, cex = 0.6,
xaxt = 'n')
axis(1, at = 1:length(state_labels), labels = FALSE)
text(x = 1:length(state_labels), y = par("usr")[3] - 0.05,
labels = state_labels, srt = 45, adj = 1, xpd = TRUE, cex = 0.7)
mtext("States", side = 1, line = 4)
legend("topright", legend = c("High crime rate","Medium crime rate", "Low crime rate"),
fill = colfunc(3), cex = 0.5)
```
From the plot, states like `Arizona`, `Michigan`, and `Pennsylvania`
show higher median crime rates (in red), suggesting that these areas
might experience socio-economic or cultural factors that contribute to
higher incidences of crime.
In contrast, states like `Montana`, `Wyoming`, and `Vermont`, indicated
in green, have lower median crime rates. These states might benefit from
stronger community cohesion, effective law enforcement, or other
socio-economic factors that mitigate crime.
The plot hints at a potential influence of climate on crime rates. For
example, states with harsher winters (like `Vermont` and `Wyoming`)
might have lower crime rates, supporting the CLASH model's theory that
significant seasonal variation promotes future-oriented behaviors and
self-control. Conversely, states with milder climates and less seasonal
variation might experience higher crime rates due to reduced need for
long-term planning and increased impulsivity.<br> This theory posits
that consistent climates with less variation require less future
planning, leading to a "faster" life strategy characterized by
present-focused behaviors and reduced self-control, which can contribute
to higher rates of aggression and violence.
[1](https://news.osu.edu/how-does-climate-affect-violence-researchers-offer-new-theory/)
**Correlation plot**
Then, to gain a clearer understanding of the relationships between
various socio-economic factors and their impact on crime rates, a
correlation plot was employed. This visual representation helps to
identify significant positive and negative correlations within the
dataset, providing a foundation for more detailed analysis.
```{r echo=FALSE, message=FALSE, warning=FALSE}
par(mfrow=c(1, 1))
par(mar=c(5, 4, 3, 2))
#"target" and "bc_target" exhibit really similar correlation patterns, I choose the original one for interpretability
data_without_state <- data[, !(names(data) %in% c("State", "bc_target"))]
correlations = cor(data_without_state)
corrplot(correlations, method = "color",
order = "hclust",
cex.main = 1,
cex.axis = 0.75,
tl.cex = 0.70)
mtext("Correlation Matrix", side=2, line=3, las=0, cex=1.2)
```
The correlation plot reveals several key insights. There are strong
positive relationships between `Families_2Parents`, `Kids_2Parents` and
`Teen_2Par`, indicating communities with a high percentage of two-parent
families also have many kids and teens in such households. Similarly,
higher educational attainment (`Degree_BS_Or_More`) correlates with
higher `Inc_from_inv`, reflecting that individuals with higher education
levels are likely to have more investment-related income, which aligns
with general socio-economic trends.. Another notable positive
correlation exists between `YoungKids_2Par` and `Teen_2Par`, suggesting
consistency in family structures.
Another notable positive correlation exists between `YoungKids_2Par` and
`Teen_2Par`, suggesting consistency in family structures where
households with younger children are also likely to have teenagers,
indicating stable family environments.
Conversely, significant negative correlations are observed between
`Below_Poverty` and `Median_Income`, `Employed`, and
`Families_2Parents`, indicating that higher income, employment rates,
and stable family structures are associated with lower poverty levels.
Regarding crime rates (`target`), there are positive correlations with
`Below_Poverty`, `Unemployment`, `Welfare_Public_Assist`, and
`Illegitimate_Births`, suggesting that higher levels of poverty,
unemployment, reliance on public assistance, and instances of
illegitimate births are linked to increased crime rates. In contrast,
negative correlations between target and variables like
`Degree_BS_Or_More`, `Median_Income`, and `Employed` show that higher
education, income, and employment levels are associated with lower crime
rates, reflecting the socio-economic benefits of stability and
education.
Other notable correlations include a positive relationship between
`Poor_English` and `Large_Families`, which could indicate that families
with language barriers tend to have more children. However the
relationship between `Illegitimate_Births` and `Below_Poverty` appears
to be neutral or slightly positive, highlighting potential
socio-economic challenges but not a strong inverse relationship.
### Bayesian Analysis
In this section of the project, we will employ a Hierarchical Bayesian
Model to analyze the relationships between various socio-economic
factors and crime rates. Hierarchical Bayesian models are particularly
powerful for this type of analysis because they allow us to account for
both fixed effects and random effects, making them ideal for data that
is grouped or nested, such as our data which is grouped by states.
#### Hierarchical Bayesian Model
A hierarchical Bayesian model includes both fixed effects, which
represent overall effects estimated across all groups, and random
effects, which account for variations within each group. In this
context, our fixed effects include socio-economic and demographic
variables such as `YoungKids_2Par`, `Teen_2Par`, `Employed`,
`Below_Poverty`, `Degree_BS_Or_More`, `Inc_from_inv`, `Speak_Eng_Only`,
`Illegitimate_Births`, `Large_Families`, `Poor_English`,
`Families_2Parents`, `Working_mom`, `Median_Income`, `Unemployment`, and
`Welfare_Public_Assist`. The random effects are represented by the
`State` variable, allowing the relationship between these predictors and
the crime rates to vary across different states.
#### Model Specification
Given that the target variable is the percentage of crime rates per
100,000 people, a continuous variable between 0 and 1, we have chosen
the beta distribution for our response variable. The beta family is
well-suited for modeling proportions and rates constrained within the 0
to 1 interval.
#### Priors
We have selected weakly informative priors for our model to incorporate
some prior knowledge while still allowing the data to inform the
posterior estimates significantly. Specifically:
- **Normal(0, 1)** prior for the fixed effects coefficients (class =
"b"). This prior assumes that the coefficients are normally
distributed with a mean of 0 and a standard deviation of 1,
reflecting an assumption that most effects are small but allowing
for the possibility of larger effects.
- **Gamma(1, 0.01)** prior for the phi parameter, which controls the
dispersion of the beta distribution for each observation. This
choice mitigates the risk of extremely small values, ensuring a more
stable estimation.
- **Normal(0, tau_state)** prior for the random effects associated
with states. This prior captures the variability across states while
maintaining a focus on the overall mean effect.
- **Gamma(1, 1)** prior for the standard deviation of the random
effects (class = "sd"). This prior is selected for its ability to
maintain positive values, reflecting the inherent property of
standard deviations.
## **MODELS**
### **BASE MODEL**
We first started with a ***basic hierarchical model*** where the target
variable is rescaled between 0.001 and 0.999 so that it can be used for
the beta model and it won't affect much the results:
```{r message=FALSE, warning=FALSE, include=FALSE}
data$State = as.factor(data$State)
data$bc_target_bayes = rescale(data$bc_target, to = c(0.001, 0.999))
```
```{r message=FALSE, warning=FALSE, include=FALSE}
model_string_diag <- "
model {
# Likelihood for each observation
for (i in 1:N) {
# Linear predictor with logit transformation
logit[i] <- beta[1] +
beta[2] * YoungKids_2Par[i] +
beta[3] * Teen_2Par[i] +
beta[4] * Employed[i] +
beta[5] * Below_Poverty[i] +
beta[6] * Degree_BS_Or_More[i] +
beta[7] * Inc_from_inv[i] +
beta[8] * Speak_Eng_Only[i] +
beta[9] * Illegitimate_Births[i] +
beta[10] * Large_Families[i] +
beta[11] * Poor_English[i] +
beta[12] * Families_2Parents[i] +
beta[13] * Working_mom[i] +
beta[14] * Median_Income[i] +
beta[15] * Unemployment[i] +
beta[16] * Welfare_Public_Assist[i] +
state_effect[State[i]] # Random effect for State
# Compute mu[i] from the logit transformation with capping
mu[i] <- max(1e-5, min(exp(logit[i]) / (1 + exp(logit[i])), 1 - 1e-5)) # Bound mu[i]
# Beta distribution for the response variable, with clamped phi[i]
bc_target_bayes[i] ~ dbeta(mu[i] * max(phi[i], 1e-3), (1 - mu[i]) * max(phi[i], 1e-3)) # Clamp phi[i] to avoid extremely small values
# Prior for phi[i] - gamma distribution for each observation
phi[i] ~ dgamma(1, 0.01) # Adjusted gamma prior to avoid extremely small values
}
# Priors for beta coefficients
for (j in 1:16) {
beta[j] ~ dnorm(0, 1) # Normal prior for the fixed effects
}
# Random effects for states
for (s in 1:S) {
state_effect[s] ~ dnorm(0, tau_state) # Random effect for states
}
# Hyperparameters for state random effects
sd_state ~ dgamma(1, 1) # Gamma prior for sd_state
tau_state <- pow(sd_state, -2) # Convert to precision
}
"
writeLines(model_string_diag, con = "model_diag.jags")
```
```{r message=FALSE, warning=FALSE, include=FALSE}
jags_data <- list(
N = nrow(data),
S = length(unique(data$State)),
bc_target_bayes = data$bc_target_bayes,
YoungKids_2Par = data$YoungKids_2Par,
Teen_2Par = data$Teen_2Par,
Employed = data$Employed,
Below_Poverty = data$Below_Poverty,
Degree_BS_Or_More = data$Degree_BS_Or_More,
Inc_from_inv = data$Inc_from_inv,
Speak_Eng_Only = data$Speak_Eng_Only,
Illegitimate_Births = data$Illegitimate_Births,
Large_Families = data$Large_Families,
Poor_English = data$Poor_English,
Families_2Parents = data$Families_2Parents,
Working_mom = data$Working_mom,
Median_Income = data$Median_Income,
Unemployment = data$Unemployment,
Welfare_Public_Assist = data$Welfare_Public_Assist,
State = as.numeric(factor(data$State))
)
# Initial values
inits <- function() {
list(
beta = rnorm(16, 0, 1), # Normal initialization for beta
phi = rgamma(nrow(data), 1, 0.01), # Gamma initialization for phi
sd_state = rgamma(1, 1, 1), # Gamma initialization for sd_state (positive)
state_effect = rnorm(length(unique(data$State)), 0, 1) # Normal for state_effect
)
}
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
params_diag <- c("beta", "sd_state", "state_effect")
jags_model_diag <- jags.model("model_diag.jags", data = jags_data, inits = inits, n.chains = 3)
update(jags_model_diag, 1000) #20% of the iterations
samples_diag <- coda.samples(jags_model_diag, variable.names = params_diag, n.iter = 5000)
summary(samples_diag)
```
The summary of this Bayesian model provides insight into the relative
influence of various predictors on the target variable. A few key
predictors emerge as particularly impactful. For example,
`Illegitimate_Births` and `Large_Families` show strong positive
associations with the target, meaning higher values of these variables
tend to increase the predicted outcome. This positive effect is
consistent across the samples, as indicated by relatively narrow
credible intervals that do not include zero. On the other hand,
variables like `Families_2Parents` and `Inc_from_inv` exhibit clear
negative effects, suggesting that higher values in these predictors are
associated with a decrease in the target. The confidence in these
negative relationships is underscored by credible intervals that remain
below zero, reinforcing the idea that these variables reliably
contribute to lowering the predicted outcome.
There are, however, some predictors with more ambiguous or mixed
effects. For instance, variables such as `Teen_2Par` and `Median_Income`
have wider credible intervals that encompass zero, indicating they may
not exert a consistent or strong influence on the target. This
uncertainty suggests that, within the context of this model, these
predictors do not contribute significantly to explaining the variation
in the outcome.
Additionally, the model incorporates random effects for `State`, which
capture variability at the state level that could arise from unobserved
regional factors. This addition helps control for regional differences,
thereby refining the accuracy of the fixed effects. By adjusting for
state-level variability, the model can offer a more accurate assessment
of the impact of individual predictors while accounting for unmeasured
state-specific influences.
The model’s structure, particularly the use of a beta distribution for
the target with an individual dispersion parameter (`phi`) for each
observation, reflects an approach tailored to data that lie between 0
and 1. This setup helps address variability effectively across
observations and enhances the model's robustness in capturing the
nuances of the response variable. Overall, the model reveals that
certain predictors, such as `Illegitimate_Births` and
`Families_2Parents`, play significant roles, while others appear to have
a more marginal or uncertain impact. The inclusion of both fixed and
random effects makes the model a well-rounded framework, capable of
balancing individual-level and state-level variability, thus enhancing
the reliability of its predictions and parameter estimates.
#### **Model check diagnostics**
For the model check *Posterior Predictive check plot* and *Deviance Information Criterion (DIC)* were employed.<br> The Posterior Predictivemcheck plot allows to compare the observed data with the data generated by the model, helping to assess how well the model captures the underlying structure of the data.<br> The DIC, is a statistical measure used to evaluate the predictive accuracy of a Bayesian model, taking into account both the goodness of fit and the complexity of the model.
```{r message=FALSE, warning=FALSE, include=FALSE}
model_string <- "
model {
# Likelihood for each observation
for (i in 1:N) {
# Linear predictor with logit transformation
logit[i] <- beta[1] +
beta[2] * YoungKids_2Par[i] +
beta[3] * Teen_2Par[i] +
beta[4] * Employed[i] +
beta[5] * Below_Poverty[i] +
beta[6] * Degree_BS_Or_More[i] +
beta[7] * Inc_from_inv[i] +
beta[8] * Speak_Eng_Only[i] +
beta[9] * Illegitimate_Births[i] +
beta[10] * Large_Families[i] +
beta[11] * Poor_English[i] +
beta[12] * Families_2Parents[i] +
beta[13] * Working_mom[i] +
beta[14] * Median_Income[i] +
beta[15] * Unemployment[i] +
beta[16] * Welfare_Public_Assist[i] +
state_effect[State[i]] # Random effect for State
y_rep[i] ~ dbeta(mu[i] * phi[i], (1 - mu[i]) * phi[i]) # Predicted values
# Compute mu[i] from the logit transformation with capping
mu[i] <- max(1e-5, min(exp(logit[i]) / (1 + exp(logit[i])), 1 - 1e-5)) # Bound mu[i]
# Beta distribution for the response variable, with clamped phi[i]
bc_target_bayes[i] ~ dbeta(mu[i] * max(phi[i], 1e-3), (1 - mu[i]) * max(phi[i], 1e-3)) # Clamp phi[i] to avoid extremely small values
# Prior for phi[i] - gamma distribution for each observation
phi[i] ~ dgamma(1, 0.01) # Adjusted gamma prior to avoid extremely small values
}
# Priors for beta coefficients
for (j in 1:16) {
beta[j] ~ dnorm(0, 1) # Normal prior for the fixed effects
}
# Random effects for states
for (s in 1:S) {
state_effect[s] ~ dnorm(0, tau_state) # Random effect for states
}
# Hyperparameters for state random effects
sd_state ~ dgamma(1, 1) # Gamma prior for sd_state
tau_state <- pow(sd_state, -2) # Convert to precision
}
"
writeLines(model_string, con = "model.jags")
```
```{r message=FALSE, warning=FALSE, include=FALSE}
jags_data <- list(
N = nrow(data),
S = length(unique(data$State)),
bc_target_bayes = data$bc_target_bayes,
YoungKids_2Par = data$YoungKids_2Par,
Teen_2Par = data$Teen_2Par,
Employed = data$Employed,
Below_Poverty = data$Below_Poverty,
Degree_BS_Or_More = data$Degree_BS_Or_More,
Inc_from_inv = data$Inc_from_inv,
Speak_Eng_Only = data$Speak_Eng_Only,
Illegitimate_Births = data$Illegitimate_Births,
Large_Families = data$Large_Families,
Poor_English = data$Poor_English,
Families_2Parents = data$Families_2Parents,
Working_mom = data$Working_mom,
Median_Income = data$Median_Income,
Unemployment = data$Unemployment,
Welfare_Public_Assist = data$Welfare_Public_Assist,
State = as.numeric(factor(data$State))
)
inits <- function() {
list(
beta = rnorm(16, 0, 1),
phi = rgamma(nrow(data), 1, 0.01),
sd_state = rgamma(1, 1, 1),
state_effect = rnorm(length(unique(data$State)), 0, 1)
)
}
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
params_ppc <- c("beta", "sd_state", "state_effect", "y_rep")
jags_model_ppc <- jags.model("model.jags", data = jags_data, inits = inits, n.chains = 3)
update(jags_model_ppc, 1000)
samples_ppc <- coda.samples(jags_model_ppc, variable.names = params_ppc, n.iter = 5000)
y_rep_matrix <- as.matrix(samples_ppc)[, grep("y_rep", colnames(as.matrix(samples_ppc)))]
y_rep_mean <- apply(y_rep_matrix, 2, mean)
#PP-CHECK
ppc_data <- data.frame(
observed = data$bc_target_bayes,
predicted = y_rep_mean
)
plot1 <- ggplot(ppc_data, aes(x = observed)) +
geom_density(aes(y = ..density..), color = "darkblue", fill = "darkblue", alpha = 0.3) +
geom_density(aes(x = predicted, y = ..density..), color = "lightblue", fill = "lightblue", alpha = 0.3) +
labs(title = "Posterior Predictive Check for the Base Model",
x = "Crime Rate",
y = "Density") +
annotate("text", x = 0.7, y = 3, label = "Observed Data", color = "darkblue", hjust = 0) +
annotate("text", x = 0.7, y = 3.2, label = "Model Predictions", color = "lightblue", hjust = 0) +
theme_minimal()
print(plot1)
```
This posterior predictive check plot compares the density of the
observed crime rate data (in dark blue) with the model's predictions (in
light blue). The model's predictive distribution closely follows the
general shape of the observed data, suggesting that the model captures
the main characteristics of the data. However, it slightly overestimates
the density in the mid-range (around 0.5) and underestimates it in some
lower and upper parts of the distribution. These discrepancies indicate
that while the model provides a reasonable fit, there may be room for
refinement to better capture the tails of the distribution.
```{r echo=FALSE, message=FALSE, warning=FALSE}
params_diag <- c("beta", "sd_state", "state_effect")
jags_model_diag <- jags.model("model_diag.jags", data = jags_data, inits = inits, n.chains = 3)
update(jags_model_diag, 1000)
samples_diag <- coda.samples(jags_model_diag, variable.names = params_diag, n.iter = 5000)
#DIC
dic_result <- dic.samples(jags_model_diag, n.iter = 5000)
#mean deviance
mean_deviance <- mean(dic_result$deviance)
#mean penalty
mean_penalty <- mean(dic_result$penalty)
#final DIC
dic <- mean_deviance + mean_penalty
print(paste("Single DIC value for the model:", dic))
```
The Raftery-Lewis diagnostic is used to calculate the number of
iterations required to ensure that the Markov Chain Monte Carlo (MCMC)
sampler has sufficient precision and convergence. This diagnostic helps
determine how many iterations are needed to estimate the quantiles of
the posterior distribution with a specific accuracy and probability. So
we will employ it to check if the number of samples we choose is right, that in this case is 3746.
```{r message=FALSE, warning=FALSE, include=FALSE}
raftery_diag <- raftery.diag(samples_diag)
print(paste("Needed sample size:", raftery_diag))
```
#### **Convergence diagnostic**
For the evaluation of the MCMC convergence *Traceplot*, *density plot*
and *Rhat* from the model summary were used.
***Traceplot***
```{r echo=FALSE, message=FALSE, warning=FALSE}
samples_matrix = as.matrix(samples_diag)
all_params <- colnames(samples_matrix)
params <- all_params[grepl("beta|sd_state|state_effect", all_params)]
for (i in seq(1, length(params), by = 4)) {
end = min(i + 3, length(params))
param_subset = params[i:end]
traceplot = mcmc_trace(as.mcmc(samples_matrix), pars = param_subset)
print(traceplot)
}
```
The *trace plots* provide valuable insights into the convergence and
mixing of the MCMC chains for the Bayesian hierarchical model. Each plot
represents the sampling process for different parameters across the four
chains.
The frequent crossing over of chains indicates good mixing, suggesting
that the MCMC sampler is exploring the parameter space efficiently.
There are no signs of divergence or significant drift, which would be
evident if the chains moved in a consistent direction without crossing.
Instead, the chains hover around a stable mean, indicating convergence.
Furthermore, the chains appear stationary, with fluctuations occurring
around a consistent mean, suggesting that the MCMC process has likely
reached a stable distribution. This visual evidence supports the
expectation of a high effective sample size (ESS), implying that the
estimates are reliable. While the specific metric for ESS isn't
displayed in the trace plots, the overall visual evidence strongly
indicates good chain mixing and parameter stability.
***Density plot***
```{r echo=FALSE, message=FALSE, warning=FALSE}
density_colors <- c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3")
for (i in seq(1, length(params), by = 4)) {
end <- min(i + 3, length(params))
param_subset <- params[i:end]
density_plot <- mcmc_dens_overlay(samples_diag, pars = param_subset) +
scale_color_manual(values = density_colors) +
ggtitle("Density Plots")
print(density_plot)
}
```
The *density plots* of the posterior distributions for the parameters
reveal several important insights:
- Firstly, the absence of multimodal behavior is evident, indicating
that the MCMC chains are sampling from a single mode of the
posterior distribution. This is beneficial as it suggests there are
no issues related to multiple modes, which can complicate the
interpretation of results.
- Secondly, the overlapping density curves from different chains show
strong agreement among the chains. This overlap further supports the
notion of convergence, affirming that all chains are sampling from
the same posterior distribution.
- Lastly, the smooth and unimodal shapes of the density plots suggest
that the parameter estimates are well-defined and stable. The
density plots illustrate the uncertainty around the parameter
estimates, with narrower peaks indicating more precise estimates.
#### **Error check**
To validate the accuracy of our Bayesian hierarchical model, we perform
a comprehensive error check using various statistical metrics. By
extracting posterior samples and summarizing key statistics such as
mean, median, standard deviation (SD), mean absolute deviation (MAD),
Monte Carlo Standard Error (MCSE), and Effective Sample Size (ESS), we
can assess the convergence and precision of our parameter estimates.
```{r echo=FALSE, message=FALSE, warning=FALSE}
samples_matrix <- as.matrix(samples_diag)
posterior_samples <- as_draws_df(samples_matrix)
summary_stats <- summarize_draws(posterior_samples,
"mean", "median", "sd", "mad", "mcse_mean", "mcse_sd", "rhat", "ess_bulk", "ess_tail")
print(summary_stats, n = Inf)
```
The summary of the beta parameters reveals valuable insights into the
model's findings. The intercept, with a mean of 0.13, indicates a
positive baseline effect on the response variable. Notably, `beta[6]`,
representing the impact of individuals with a Bachelor’s degree or
higher, has a mean estimate of 0.06, suggesting a slight positive
influence on the outcome.
Conversely, the parameter for `Families_2Parents` (`beta[12]`) exhibits
a substantial negative effect, with a mean of -0.46. This indicates that
having two parents is associated with a decrease in the response
variable, highlighting the potential challenges faced by families with
this structure. Similarly, `beta[2]`, which corresponds to the effect of
`YoungKids_2Par`, has a mean of -0.03, suggesting a small negative
impact.
In terms of uncertainty, the standard deviations (SD) for most
parameters are relatively low, indicating that the estimates are stable.
The R-hat values around 1 further confirm that the chains have