分层正态模型与 Stan (#116)

* 正态总体方差齐性检验方法的适用范围 * 制作概率分布的分位数表 * 区域数据分析 * 分层正态模型 * Fix Duplicate chunk label * 对比结果 * 函数 mcmc_pairs 替代函数 pairs * 补充参考文献 * tweak * summary 函数 * Fix typo * 添加更多的说明 * tweak table * 相比于 nlme 包，lme4 包和 blme 包提供更多严格的检查
XiangyunHuang · Oct 8, 2023 · 9497342 · 9497342
1 parent 439375d
commit 9497342
Show file tree

Hide file tree

Showing 9 changed files with 901 additions and 8 deletions.
diff --git a/_quarto.yml b/_quarto.yml
@@ -64,6 +64,7 @@ book:
         - analyze-text-data.qmd
         - analyze-time-series-data.qmd
         - analyze-spatial-data.qmd
+        - analyze-areal-data.qmd
     - part: "优化建模"
       chapters:
         - statistical-computation.qmd
@@ -73,6 +74,7 @@ book:
       chapters:
         - probabilistic-reasoning-framework.qmd
         - generalized-linear-models.qmd
+        - hierarchical-normal-models.qmd
         - mixed-effects-models.qmd
         - generalized-additive-models.qmd
         - gaussian-processes-regression.qmd

diff --git a/analyze-areal-data.qmd b/analyze-areal-data.qmd
@@ -0,0 +1,137 @@
+# 区域数据分析 {#sec-analyze-areal-data}
+
+## 苏格兰唇癌数据分析 {#sec-scotland-lip-cancer}
+
+> Everything is related to everything else, but near things are more related than distant things.
+>
+> --- Waldo Tobler [@Tobler1970]
+
+::: {#spatial-areal-data .callout-note title="空间区域数据分析"}
+空间区域数据的贝叶斯建模
+
+-   Bayesian spatial and spatio-temporal GLMMs with possible extremes [glmmfields](https://github.com/seananderson/glmmfields)
+-   Bayesian spatial analysis [geostan](https://github.com/ConnorDonegan/geostan/)
+-   [Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data](https://mc-stan.org/users/documentation/case-studies/icar_stan.html)
+-   [Exact sparse CAR models in Stan](https://github.com/mbjoseph/CARstan) [网页文档](https://mc-stan.org/users/documentation/case-studies/mbjoseph-CARStan.html)
+-   [Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data](https://github.com/stan-dev/example-models/tree/master/knitr/car-iar-poisson) [网页文档](https://mc-stan.org/users/documentation/case-studies/icar_stan.html) 原始数据和代码，接上面苏格兰唇癌数据分析，用 CmdStanR 更新后的[代码](https://github.com/stan-dev/example-models/tree/master/knitr/car-iar-poisson)
+-   [Spatial modeling of areal data. Lip cancer in Scotland](https://www.paulamoraga.com/book-geospatial/sec-arealdataexamplespatial.html) INLA 建模
+-   [CAR models Scotland Lip cancer dataset](https://rafaelcabral96.github.io/nigstan/sar-and-car-models.html#car-models) Stan 建模
+-   空间计量 [区域数据分析](https://rsbivand.github.io/emos_talk_2304/bivand_emos_230419.pdf) [on-the-use-of-r-for-spatial-econometrics](https://github.com/rsbivand/emos_talk_2304)
+:::
+
+响应变量服从泊松分布
+
+-   BYM-INLA [@blangiardo2013;@moraga2020]
+-   BYM-Stan [@morris2019; @donegan2022; @cabral2022]
+
+记录 1975-1986 年苏格兰 56 个地区的唇癌病例数，这是一个按地区汇总的数据。
+
+```{r}
+library(sf)
+scotlips <- st_read('data/scotland/scotland.shp', crs = st_crs("EPSG:27700"))
+str(scotlips)
+```
+
+```{r}
+#| label: fig-lip-cancer-map
+#| fig-cap: 苏格兰各地区唇癌病例数分布
+#| fig-width: 5
+#| fig-height: 5
+#| fig-showtext: true
+
+library(ggplot2)
+ggplot() +
+  geom_sf(data = scotlips, aes(fill = Observed)) +
+  scale_fill_viridis_c() +
+  theme_minimal()
+```
+
+## 美国各州犯罪率分析
+
+响应变量服从高斯分布的调查数据 [@bivand2001]
+
+数据集 USArrests 记录 1973 年美国各州每 10 万居民中因谋杀 Murder、袭击 Assault 和强奸 Rape 被警察逮捕的人数以及城市人口所占百分比（可以看作城市化率）。
+
+```{r}
+#| echo: false
+#| label: tbl-us-arrests
+#| tbl-cap: "数据集 USArrests（部分）"
+
+us_arrests <- data.frame(
+  state_name = rownames(USArrests),
+  state_region = state.region,
+  USArrests, check.names = FALSE
+)
+
+knitr::kable(head(us_arrests), col.names = c(
+  "州名", "区域划分", "谋杀犯", "袭击犯", "城市化率", "强奸犯"
+), row.names = FALSE)
+```
+
+```{r}
+#| label: fig-us-arrests-sf
+#| fig-cap: 因袭击被逮捕的人数分布
+#| fig-showtext: true
+#| fig-width: 7
+#| fig-height: 4
+
+library(sf)
+# 州数据
+us_state_sf <- readRDS("data/us-state-map-2010.rds")
+# 观测数据
+us_state_df <- merge(x = us_state_sf, y = us_arrests,
+  by.x = "NAME", by.y = "state_name", all.x = TRUE)
+
+ggplot() +
+  geom_sf(
+    data = us_state_df, aes(fill = Assault), color = "gray80", lwd = 0.25) +
+  scale_fill_viridis_c(option = "plasma", na.value = "white") +
+  theme_void()
+```
+
+1973 年美国各州因袭击被逮捕的人数与城市化率的关系：相关分析
+
+```{r}
+#| label: fig-us-arrests-point
+#| fig-cap: 逮捕人数比例与城市化率的关系
+#| fig-width: 7
+#| fig-height: 5.5
+#| code-fold: true
+#| echo: !expr knitr::is_html_output()
+#| fig-showtext: true
+
+library(ggrepel)
+ggplot(data = us_arrests, aes(x = UrbanPop, y = Assault)) +
+  geom_point(aes(color = state_region)) +
+  geom_text_repel(aes(label = state_name), size = 3, seed = 2022) +
+  theme_classic() +
+  labs(x = "城市化率（%）", y = "因袭击被逮捕人数", color = "区域划分")
+```
+
+阿拉斯加州和夏威夷州与其它州都不相连，属于孤立的情况，下面在空间相关性的分析中排除这两个州。
+
+```{r}
+# 州的中心
+centers48 <- subset(
+  x = data.frame(x = state.center$x, y = state.center$y),
+  subset = !state.name %in% c("Alaska", "Hawaii")
+)
+# 观测数据
+arrests48 <- subset(
+  x = USArrests,
+  subset = !rownames(USArrests) %in% c("Alaska", "Hawaii")
+)
+```
+
+```{r}
+#| message: false
+
+library(spData)
+library(spdep)
+# KNN
+k4.48 <- knn2nb(knearneigh(as.matrix(centers48), k = 4))
+# Moran I test
+moran.test(x = arrests48$Assault, listw = nb2listw(k4.48))
+# Permutation test for Moran's I statistic
+moran.mc(x = arrests48$Assault, listw = nb2listw(k4.48), nsim = 499)
+```
diff --git a/common-statistical-tests.qmd b/common-statistical-tests.qmd
@@ -111,6 +111,20 @@ qnorm(p = 1 - 0.05, mean = 0, sd = 1)
 1 - pnorm(q = u)
 ```
 
+::: callout-important
+随机变量 $X$ 服从标准正态分布，它的概率分布函数如下：
+
+$$
+P(X \leq u)= \phi(u) = \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{u}\mathrm{e}^{-t^2/2}\mathrm{dt}
+$$
+
+若已知概率 $p = 0.95$ ，则对应的下分位点可用函数 `qnorm()` 计算。
+
+```{r}
+qnorm(p = 0.95, mean = 0, sd = 1)
+```
+:::
+
 #### 方差未知
 
 $$
@@ -150,6 +164,25 @@ qt(p = 1 - 0.05, df = n - 1)
 
 ::: callout-note
 英国统计学家 William Sealy Gosset (1876-1937) 于 1908 年在杂志 《Biometrics》 上以笔名 Student 发表论文《The Probable Error of a Mean》[@Gosset1908]，论文中展示了独立同正态分布的样本 $x_1, \ldots, x_n \stackrel{i.i.d}{\sim} \mathcal{N}(\mu,\sigma^2)$ 的样本方差 $s^2$ 和样本标准差 $s$ 的抽样分布，根据均值和标准差不相关的性质导出 t 分布，宣告 t 分布的诞生，因其在小样本领域的突出贡献，W. S. Gosset 进入世纪名人录 [@Heyde2001]。
+
+```{r}
+#| label: tbl-t-quantile
+#| tbl-cap: $t$ 分布的分位数表
+#| echo: false
+
+vec_prob <- c(
+  0.75, 0.80, 0.90, 0.95,
+  0.975, 0.99, 0.995, 0.999
+)
+vec_df <- 1:10
+
+tmp <- mapply(FUN = qt,
+  p = vec_prob,
+  MoreArgs = list(df = vec_df), SIMPLIFY = TRUE
+)
+row.names(tmp) <- vec_df
+knitr::kable(tmp, row.names = TRUE, col.names = vec_prob, digits = 4)
+```
 :::
 
 ### 正态总体方差检验
@@ -181,6 +214,41 @@ qchisq(p = 1 - 0.05, df = n -1)
 1 - pchisq(q = chi, df = n -1)
 ```
 
+::: callout-important
+R 软件提供很多统计分布的计算，因此，不再需要查分位数表，现算即可。计算自由度为 $n$ 概率为 $p$ 的 $\chi^2$ 分布的分位数 $\chi^2_p(n)$ ，即
+
+$$
+P(\chi^2(n) \leq \chi^2_p(n)) = p
+$$
+
+若已知自由度为 1 ，概率为 0.05，则可借助分位数函数 `qchisq()` 计算对应的（下）分位点。
+
+```{r}
+qchisq(p = 0.05, df = 1)
+```
+
+同理，也可以获得 $\chi^2$ 分布的分位数 @tbl-chisq-quantile ，计算出来的分位数保留 4 位小数。
+
+```{r}
+#| label: tbl-chisq-quantile
+#| tbl-cap: $\chi^2$ 分布的分位数表
+#| echo: false
+
+vec_prob <- c(
+  0.005, 0.01, 0.025, 0.05, 0.1,
+  0.9, 0.95, 0.975, 0.99, 0.995
+)
+vec_df <- 1:10
+
+tmp <- mapply(FUN = qchisq,
+  p = vec_prob,
+  MoreArgs = list(df = vec_df), SIMPLIFY = TRUE
+)
+row.names(tmp) <- vec_df
+knitr::kable(tmp, row.names = TRUE, col.names = vec_prob, digits = 4)
+```
+:::
+
 ### 总体未知均值检验
 
 有了均值和方差，为什么还要位置参数和尺度参数？为了更一般地描述问题，扩展范围。特别是在总体分布未知或知之甚少的情况下做检验，不再仅限于均值和方差这样的特征量。
@@ -531,7 +599,7 @@ flowchart LR
   B1 --> C1(均值检验)
   C1 --> D2(方差相等) --> E2(F 检验)
   C1 --> D3(方差不等) --> E3(F 检验)
-  B1 --> C2(方差检验) --> E4(Bartlett 检验)
+  B1 --> C2(方差检验) --> E4(Hartley 检验\n Bartlett 检验\n 修正的 Bartlett 检验\n Levene 检验)
   B2 --> C3(均值检验) --> E5(Kruskal-Wallis 秩和检验\n Friedman 秩和检验\n Quade 检验)
   B2 --> C4(方差检验) --> E7(Fligner-Killeen 检验)
 ```
@@ -644,15 +712,24 @@ logLik(fit_gls)
 
 ### 正态总体方差检验
 
-后面总体分布未知的情况下的方差检验也都可以用在这里。
+总体服从正态分布，有四种常见的参数检验方法：
+
+1.  Hartley 检验：各组样本量必须相等。
+2.  Bartlett 检验：各组样本量可以相等或不等，但每个组的样本量必须不低于 5。
+3.  修正的 Bartlett 检验：在样本量较大或较小、相等或不等场合都可使用。
+4.  Levene 检验：相当于单因素组间方差分析，相比于 Bartlett 检验，Levene 检验更加稳健。
+
+::: callout-tip
+在总体分布未知的情况下，检验方差齐性的非参数方法也都可以用在这里。
+:::
 
 设 $x_1,\cdots,x_{n_1}$ 是来自总体 $\mathcal{N}(\mu_1,\sigma_1^2)$ 的样本，设 $y_1,\cdots,y_{n_2}$ 是来自总体 $\mathcal{N}(\mu_2,\sigma_2^2)$ 的样本，设 $z_1,\cdots,z_{n_3}$ 是来自总体 $\mathcal{N}(\mu_3,\sigma_3^2)$ 的样本。
 
 $$
 \sigma_1^2 = \sigma_2^2 = \sigma_3^2 \quad vs. \quad \sigma_1^2,\sigma_2^2,\sigma_3^2 \quad  \text{不全相等}
 $$
 
-Bartlett （巴特利特）检验 `bartlett.test()` 要求总体的分布为正态分布，检验各个组的方差是否有显著性差异，即方差齐性检验，属于参数检验，适用于多个样本的情况。相比于 Bartlett 检验，Levene 检验更加稳健。
+Bartlett （巴特利特）检验 `bartlett.test()` 要求总体的分布为正态分布，检验各个组的方差是否有显著性差异，即方差齐性检验，属于参数检验，适用于多个样本的情况。
 
 ```{r}
 # 三样本
@@ -1018,11 +1095,11 @@ apply(Wish, MARGIN = 1:2, var)
 | [W. Kruskal](https://en.wikipedia.org/wiki/William_Kruskal)              | 美国       | 1919-10-10 | 2005-04-21 | 85   | Kruskal-Wallis 检验            |
 | [George E. P. Box](https://en.wikipedia.org/wiki/George_E._P._Box)       | 英国、美国 | 1919-10-18 | 2013-03-28 | 93   | Box-Pierce 检验                |
 | [C. R. Rao](https://en.wikipedia.org/wiki/C._R._Rao)                     | 印度、美国 | 1920-09-10 | 2023-08-22 | 102  | Score 检验                     |
-| [M. Wilk](https://en.wikipedia.org/wiki/Martin_Wilk)                     | 加拿大     | 1922-12-18 | 2013-02-19 | 90   | Shapiro-Wilk 正态性检验        |
+| [M. Wilk](https://en.wikipedia.org/wiki/Martin_Wilk)                     | 加拿大     | 1922-12-18 | 2013-02-19 | 90   | Shapiro-Wilk 检验              |
 | [J. Durbin](https://en.wikipedia.org/wiki/James_Durbin)                  | 英国       | 1923-06-30 | 2012-06-23 | 88   | Durbin 检验                    |
 | [L. Le Cam](https://en.wikipedia.org/wiki/Lucien_Le_Cam)                 | 法国       | 1924-11-18 | 2000-04-25 | 75   | 渐近理论                       |
 | [H. Lilliefors](https://en.wikipedia.org/wiki/Hubert_Lilliefors)         | 美国       | 1928-06-14 | 2008-02-23 | 80   | Lilliefors 检验                |
-| [S. S. Shapiro](https://en.wikipedia.org/wiki/Samuel_Sanford_Shapiro)    | 美国       | 1930-07-13 | \-         | 93   | Shapiro-Wilk 正态性检验        |
+| [S. S. Shapiro](https://en.wikipedia.org/wiki/Samuel_Sanford_Shapiro)    | 美国       | 1930-07-13 | \-         | 93   | Shapiro-Wilk 检验              |
 
 : 对假设检验理论有重要贡献的学者 {#tbl-statistican tbl-colwidths="\[20,15,15,15,7,28\]"}
 

diff --git a/data/scotland/scotland.dbf b/data/scotland/scotland.dbf
diff --git a/data/scotland/scotland.shp b/data/scotland/scotland.shp
diff --git a/data/scotland/scotland.shx b/data/scotland/scotland.shx