-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathrepro_neumann_evert_2021.Rmd
420 lines (347 loc) · 20.8 KB
/
repro_neumann_evert_2021.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
---
title: "Adapted reproduction script from Neumann & Evert (2021)"
author: "Stella Neumann & Stefan Evert, adapted by the QuanTOR team"
date: "16 May 2020 / 9 Feb 2025"
output:
html_document:
fig_height: 7
fig_width: 7
number_sections: yes
toc: yes
toc_float: yes
code_folding: hide
---
This is a slighlty adapted version of Neumann & Evert's (2021) reproduction script available from https://www.stephanie-evert.de/PUB/NeumannEvert2021/ (RMarkdown notebook `analysis_proceedings.Rmd` in ZIP archive [`analysis_scripts.zip`](https://www.stephanie-evert.de/PUB/NeumannEvert2021/data/analysis_scripts.zip)). In particular, it uses the original support functions in `multivar_utils.R` whereas our own replication study builds on the new R package `gmatools`. We have made the following changes to the reproduction script:
- interactive 3D plots were removed, so package `rgl` is no longer needed
- since our reproduction and replication study does not extend to the interactive scatterplot and weights viewers, data preparation for these viewers was excluded
- some small adaptations were necessary to make the script work with our new data set `ice_preprocessed.rda`; in particular, our ICE9 data set is reduced to the original three ICE3 components
- minor bugs had to be fixed in `mvar_utils.R`, which are caught by more recent versions of R than the one used by Neumann & Evert (2021)
- PDF plots are saved to subdirectory `pdf_repro/` and can easily be compared with our main reproduction/replication based on `gmatools`
```{r setup, include=FALSE, cache=FALSE}
knitr::opts_chunk$set(cache = TRUE)
knitr::opts_chunk$set(dev.args=list(pointsize=12)) # adjust graphics device
```
```{r setupScript, include=FALSE, cache=FALSE}
source("multivar_utils.R")
library(data.table)
library(MASS) # for LDA
library(e1071) # for SVM
library(ggplot2) # for modern-style lattice plots
library(magrittr) # work more naturally with mvar.space objects
library(DT)
library(gplots) # for heatmap.2
## and some reasonable colour palettes
library(colorspace)
library(corpora)
seaborn.pal <- corpora.palette("seaborn")
muted.pal <- corpora.palette("muted")
bright.pal <- corpora.palette("bright")
default.pal <- muted.pal
```
```{r utils, include=FALSE, cache=FALSE}
## wrapper function for saving plots to PDF file
save.pdf <- function (file, ..., out.dir="pdf_repro") {
if (!is.null(out.dir)) file <- sprintf("%s/%s", out.dir, file)
invisible(dev.copy2pdf(file=file, ..., out.type="cairo"))
}
```
# The ICE data set
Load the preprocessed data set.
```{r loadData, echo=1}
var.names <- load("ice_preprocessed.rda")
cat(paste(var.names, collapse=", "), "\n")
```
All metadata variables are already coded as factors with a sensible ordering of categories, so no further pre-processing is required here. The data set also includes rainbow colours for text categories and readable feature names. There are `r nrow(Meta)` texts. See `prepare_data.Rmd` for details about the distribution of metadata categories and text lengths.
## Reduce to ICE3
Since the original reproduction script expects a data set covering only the ICE3 components, we have to reduce data matrices, metadata, and the list of language variety types.
```{r reduceToICE3}
ice3.comp <- qw("icehk icejam icenz")
idx <- Meta$variety %in% types.variety[ice3.comp]
Meta <- droplevels(Meta[idx, ])
Features <- Features[idx, ]
M <- M[idx, ]
Z <- Z[idx, ]
ZL <- ZL[idx, ]
rand.idx <- rank(rand.idx[rand.idx %in% which(idx)])
types.variety <- types.variety[ice3.comp]
types.shortvar <- types.shortvar[ice3.comp]
```
# Dimensions of variation
## Unsupervised PCA
A standard PCA based on z-scores reveals dimensions of register variation that correspond fairly well to the broad text categories in ICE.
```{r normalPCA}
PCA <- mvar.space(Z)
Z.pca <- mvar.projection(PCA, space="both")
mvar.pairs(Z.pca, 1:4, Meta=Meta, col=textcat20, pch=variety,
pch.vals=c(1, 3, 4), col.vals=rainbow.20,
cex=.6, legend.cex=.7, iso=TRUE, compact=TRUE)
save.pdf("pca4z_type.pdf")
```
The PCA based on log-transformed looks quite similar. Individual outliers are reduced and the main dimensions in the top left panel seem to show a little more structure. Therefore, we will exclusively use the log-transformed z-scores from now on.
```{r logPCA}
PCA <- mvar.space(ZL)
ZL.pca <- mvar.projection(PCA, space="both")
mvar.pairs(ZL.pca, 1:4, Meta=Meta,
col=textcat20, col.vals=rainbow.20,
pch=variety, pch.vals=c(1, 3, 4),
cex=.6, legend.cex=.7, iso=TRUE, compact=TRUE)
save.pdf("pca4_type.pdf")
```
Do the main dimensions of variation also capture differences between the language varieties? A few regions occupied by a single variety likely correspond to discrepancies between text categories in the three corpora.
```{r logPCAvar}
mvar.pairs(ZL.pca, 1:4, Meta=Meta,
col=variety, col.vals=bright.pal,
pch=variety, pch.vals=c(1, 3, 4),
cex=.6, legend.cex=.8, iso=TRUE, compact=TRUE)
save.pdf("pca4_var.pdf")
```
# LDA discriminant for text categories
## Minimally supervised LDA and rotation
We now carry out an LDA by text category. Since there are 20 distinct categories, there will be a much larger number of discriminant dimensions.
```{r typeLDA}
lda.type <- mvar.discriminant(ZL, Meta$textcat20)
ByType <- mvar.space(ZL, lda.type, normalize=TRUE)
ByType.M <- mvar.projection(ByType, "both")
lda.type.P <- mvar.basis(ByType, "space")
mvar.pairs(ByType.M, 1:6, Meta=Meta,
col=textcat20, pch=variety, pch.vals=c(1, 3, 4), col.vals=rainbow.20,
cex=.6, legend.cex=.4, iso=TRUE, compact=TRUE)
save.pdf("lda6type_type.pdf", width=8, height=8)
```
There are `r ncol(ByType$basis)` LDA dimensions, most of which capture interesting and substantial differences between text categories. An SVM classifier shows that together they separate the 20 text categories fairly well (with classification accuracy > 70%, though overtrained by the LDA).
```{r typeLDAsvm}
res.type <- svm(ByType.M[, 1:19], Meta$textcat20, kernel="radial", cross=10)
svm.report <- function (res) {
acc <- mean(res$accuracies)
cat(sprintf("Mean accuracy: %.1f%%\n", acc))
cat("Cross-validation folds:\n")
print(round(res$accuracies, 1))
}
svm.report(res.type)
```
However, this is not a useful perspective on the high-dimensional data set: it is impossible to grasp a `r ncol(ByType$basis)`-dimensional visualization intuitively and it would not provide a substantial reduction from the original feature space. The first 5 or 6 dimensions provide a discrimination accuracy well above 60%, and the first 4 are already close to 60%. We will therefore focus on LDA dims 1 to 4.
```{r typeLDAsvm6}
res.type <- svm(ByType.M[, 1:4], Meta$textcat20, kernel="radial", cross=10)
svm.report(res.type)
```
In order to estimate how much information is lost by focusing on these dimensions, we compute pairwise discrimination quality between text categories. The only practicable way seems to be to find a single LDA discriminant dimension for each category pair and compute classification accuracy and Cohen $d$:
```{r discriminatePairwise}
discriminate.categories <- function (M, cats, digits=NULL) {
stopifnot(length(cats) == 2, all(cats %in% levels(Meta$textcat20)))
idx <- Meta$textcat20 %in% cats
y <- droplevels(Meta$textcat20[idx])
x <- M[idx, , drop=FALSE]
res <- predict(lda(x, y)) # LDA classification + dimension scores
acc <- 100 * sum(res$class == y) / length(y)
d <- abs(cohen.d(res$x[y == cats[1]], res$x[y == cats[2]]))
if (!is.null(digits)) {
acc <- round(acc, digits)
d <- round(d, digits)
}
data.frame(acc=acc, d=d, cat1=cats[1], cat2=cats[2], row.names=NULL,
stringsAsFactors=FALSE)
}
discriminate.pairwise <- function (M, cats, sort=FALSE, digits=NULL) {
cat.pairs <- combn(cats, 2, simplify=FALSE)
res <- lapply(cat.pairs, discriminate.categories, M=M, digits=digits)
res <- do.call(rbind, res)
if (sort) res <- res[order(res$d), ]
res
}
```
In the full LDA space, all pairs of text categories can bed discriminated fairly well, but with fewer dimensions some categories are collapsed. We obtain pairwise discrimination scores for different numbers of LDA dimensions and combine them into a single data frame. Interactively explore this table in order to find text categories that collapse due to our focus on 4 dimensions.
```{r discriminateCategoriesLDA}
res <- discriminate.pairwise(ByType.M, types.textcat20, digits=2)
res.6 <- discriminate.pairwise(ByType.M[, 1:6], types.textcat20, digits=2)
res.4 <- discriminate.pairwise(ByType.M[, 1:4], types.textcat20, digits=2)
stopifnot(all.equal(res[, qw("cat1 cat2")], res.6[, qw("cat1 cat2")]),
all.equal(res[, qw("cat1 cat2")], res.4[, qw("cat1 cat2")]))
res$acc.6 <- res.6$acc; res$d.6 <- res.6$d
res$acc.4 <- res.4$acc; res$d.4 <- res.4$d
discrim.table <- res[order(res$d.4), ]
datatable(discrim.table, options=list(order=list(list(8, "asc"))),
caption = "Text category discrimination in full LDA space")
```
```{r, fig.width=6, fig.height=6}
discrim.mat <- matrix(0, length(types.textcat20), ncol=length(types.textcat20),
dimnames=list(types.textcat20, types.textcat20))
with(discrim.table, {
discrim.mat[cbind(cat1, cat2)] <<- acc.4
discrim.mat[cbind(cat2, cat1)] <<- acc.4
})
discrim.pal <- heat_hcl(20, h=c(-20, 90), l=c(20,100))
heatmap.2(discrim.mat, zlim=c(0, 100), col=discrim.pal, margins=c(12, 12),
cexRow=1.2, cexCol=1.2, srtRow=30, srtCol=60, trace="none",
keysize=1, key.xlab="discrimination accuracy",
main="discrimination accuracy in 4 LDA dims")
save.pdf("lda4type_discrimination.pdf", width=8, height=8)
```
In an extension of our previous methodology, we now apply a rotation in the reduced target space so that interesting structure is better aligned with the subspace dimensions (instead of visually picking out an “axis” of interest). Since dims 1 and 2 are strongly correlated (Pearson $r$ = `r cor(ByType.M[,1], ByType.M[,2])`), a PCA-based rotation should align the diagonal axis with the first dimension. Note that a full PCA rotation in all four dimensions would lose too much of the discriminative structure brought out by the LDA (where the first dimension maximises the ratio of between-group and within-group variance). In an earlier version of the analysis we also flipped the two dimensions after rotation so that the largest variance is on the horizontal axis in the top-left panel of the scatterplot matrix and in 3D plots. However, it is much clearer to have the main dimension of variation as dimension 1.
```{r typeLDA4rotation}
ByType4 <- ByType %>% mvar.basis("space") %>% extract(, 1:4) %>% mvar.space(ZL, .)
ByType4 %<>% mvar.rotation("pca", dims=1:2) # %>% mvar.rotation("swap", dims=1:2)
ByType4.M <- mvar.projection(ByType4, "both")
ByType4.P <- mvar.basis(ByType4, "space")
```
This four-dimensional latent space is the basis for all further analysis and interpretation.
```{r typeLDA4}
mvar.pairs(ByType4.M, 1:4, Meta=Meta,
col=textcat20, pch=variety, pch.vals=c(1, 3, 4), col.vals=rainbow.20,
cex=.6, legend.cex=.7, iso=TRUE, compact=TRUE)
save.pdf("lda4type_type.pdf", width=8, height=8)
```
There appear to be two overlapping “cigars” formed by the written and spoken texts. A scatterplot matrix colour-coded for mode shows this clearly.
```{r typeLDA4mode}
mvar.pairs(ByType4.M, 1:4, Meta=Meta,
col=mode, pch=variety, pch.vals=c(1, 3, 4), col.vals=bright.pal,
cex=.6, legend.cex=.8, iso=TRUE, compact=TRUE)
save.pdf("lda4type_mode.pdf", width=8, height=8)
```
We create a custom version of the scatterplot matrix for inclusion in the paper, showing only the top row of the scatterplot matrix separately for spoken and written texts. It is written directly to a PDF file to ensure proper font sizes and layout.
```{r customLDA4scatterplot}
# cairo_pdf(file="pdf_repro/lda4type_for_paper.pdf", width=12, height=8)
pch.vec <- c(1, 3, 4)[Meta$variety]
col.vec <- rainbow.20[Meta$textcat20]
plot.panel <- function (d, idx, cex=1, # -> 3:4 aspect ratio
xlim=c(-2.05, 2.0), ylim=c(-3.1, 2.3)) {
plot(ByType4.M[idx, d], ByType4.M[idx, 1],
pch=pch.vec[idx], col=col.vec[idx],
xlim=xlim, ylim=ylim, cex=cex,
xlab="", ylab="", main="", xaxt="n", yaxt="n")
}
textcat.W <- unique(Meta$textcat20[Meta$mode == "written"])
idx.lW <- types.textcat20 %in% textcat.W
par(mfrow=c(2, 4), mar=c(0, 0, 0, 0)+.2)
idx.W <- Meta$mode == "written"
plot.panel(2, idx.W)
text(-2, -0.4, "Dim 1", cex=1.2, srt=90, font=2)
plot.panel(3, idx.W)
plot.panel(4, idx.W)
plot(0, 0, type="n", ann=FALSE, bty="n", xaxt="n", yaxt="n")
legend(0, 0, xjust=0.5, yjust=0.5, cex=1.3,
title="Written Texts", bty="n",
legend=types.textcat20[idx.lW],
fill=rainbow.20[idx.lW], border=rainbow.20[idx.lW])
idx.S <- Meta$mode == "spoken"
plot.panel(2, idx.S)
text(-2, -0.4, "Dim 1", cex=1.4, srt=90, font=2)
text(0, 2.3, "Dim 2", cex=1.4, font=2)
plot.panel(3, idx.S)
text(0, 2.3, "Dim 3", cex=1.4, font=2)
plot.panel(4, idx.S)
text(0, 2.3, "Dim 4", cex=1.4, font=2)
plot(0, 0, type="n", ann=FALSE, bty="n", xaxt="n", yaxt="n")
legend(0, 0, xjust=0.5, yjust=0.5, cex=1.3,
title="Spoken Texts", bty="n",
legend=types.textcat20[!idx.lW],
fill=rainbow.20[!idx.lW], border=rainbow.20[!idx.lW])
save.pdf("lda4type_for_paper.pdf", width=12, height=8)
# invisible(dev.off())
```
## LDA on original text categories
The original 32 text categories in the fine-grained ICE design schema were aggregated into 20 broader categories, which are more manageable in our visualisation-based approach. This seems to be corroborated by the fact that no differences between the finer sub-categories are visible in our four LDA dimensions: they are spread out and intermingled evenly across the broader category. However, another explanation is that our LDA -- based on the 20-category scheme -- has ignored differences between the subcategories, aiming to reduce variability within the broader category.
Here, we carry out an alternative LDA using the 32-category scheme as a “gold standard” and compare the first four latent dimensions. The dimensions are automatically reordered and flipped to best match those of the original LDA.
```{r type32LDA}
lda.type32 <- mvar.discriminant(ZL, Meta$textcat32)
ByType32 <- mvar.space(ZL, lda.type32[, 1:4], normalize=TRUE)
lda.type32.P <- mvar.basis(ByType32, "space") # original orthogonalised dims
ByType32 %<>% mvar.rotation("pca", dims=1:2) %>% mvar.rotation("match", basis=ByType4.P)
ByType32.M <- mvar.projection(ByType32, "both")
ByType32.P <- mvar.basis(ByType32, "space")
mvar.pairs(ByType32.M, 1:4, Meta=Meta,
col=textcat32, pch=variety, pch.vals=c(1, 3, 4), col.vals=rainbow.32,
cex=.6, legend.cex=.55, iso=TRUE, compact=TRUE)
save.pdf("lda4type32_type.pdf", width=8, height=8)
```
Overall the visualisation looks reassuringly similar. We do get better separation between the more fine-grained categories, but the overall shape remains the same. It seems safe thus to report our findings based on the 20-category scheme.
We compute similarity of the two LDA analyses as the (fractional) number of common dimensions (see SIGIL Unit #7 for details), which indicates a very good match between the two spaces.
```{r LDAsimilarity}
mvar.similarity(ByType4.P, ByType32.P)
```
The expected $R^2$ for projection between the subspaces is `r sprintf("%.1f%%", 100 * mvar.similarity(ByType4.P, ByType32.P, "R2"))`. The vector of singular values shows that the two subspaces (almost) share three dimensions, with relatively high cosine similarity in the fourth dimension.
```{r LDAsimilarityDetails}
mvar.similarity(ByType4.P, ByType32.P, method="sigma")
```
Verify that we indeed get the same result without matching the basis vectors beforehand:
```{r LDAsimilarityCheck}
mvar.similarity(lda.type.P[, 1:4], lda.type32.P, method="sigma")
```
# Interpreting the dimensions
## Feature weights and boxplots
The standard approach in multidimensional analysis is to interpret the feature weights (or “factor loadings”) of each latent dimension directly. In our case, these weights are the coordinates of the orthogonal basis vectors of the LDA space. The barplot below visualizes only features $i$ that have a substantial weight $|p_{ij}| \geq .1$ in at least one dimension $j$. Keep in mind that feature weights are relative within each basis vector (because $\|\mathbf{p}_{\bullet j}\|_2 = 1$); a discriminant characterised by consistently large values of many different features would assign relatively low weights to all of them.
```{r LDAweights, fig.height=4}
ByType4.P <- mvar.basis(ByType4, "space")
idx.weights <- apply(abs(ByType4.P), 1, max) >= .1 # only show features with substantial weight
ggbar.weights(ByType4.P, feature.names=feature.names, names=paste("Dim", 1:4),
idx=idx.weights, ylim=c(-.75, .4))
save.pdf("lda4type_weights.pdf", width=12, height=8)
```
For comparison, we show the first two original LDA dimensions before PCA rotation (but only for the features selected above).
```{r LDAweightsOrig, fig.height=3, fig.width=6}
ggbar.weights(lda.type.P[, 1:2], feature.names=feature.names,
idx=idx.weights, ylim=c(-.75, .4),
names=c("Original LDA dim 1", "Original LDA dim 2"))
save.pdf("lda6type_weights.pdf", width=12, height=4.5)
```
For the paper, we create individual plots for each dimension with suitable margins and labelling. They are directly written to PDF files for optimal formatting.
```{r LDAweightsByDim}
dim2label <- c("LD1" = "Dim 1: conceptual speaking / conceptual writing",
"LD2" = "Dim 2: dialogic written / neutral",
"LD3" = "Dim 3: descriptive-narrative / instructive-regulative",
"LD4" = "Dim 4: neutral / online production")
for (d in seq_len(ncol(ByType4.P))) {
cairo_pdf(file=sprintf("pdf_repro/lda4type_weights_dim%d.pdf", d),
width=12, height=4)
p <- ByType4.P[, d, drop=FALSE] %>%
ggbar.weights(feature.names=feature.names, names=paste("Dim", d),# names=dim2label[d],
idx=idx.weights, ylim=c(-.75, .4), main=dim2label[d])
print(p + theme(axis.text.x=element_text(size=14)))
dev.off()
}
```
As we've argued before, especially for LDA-based dimensions, it is more informative to visualise how each feature pushes the texts from some category or group towards the positive or negative end of a dimension. We use a wrapper around `ggbox.features()` to pick out categories and plot accordingly.
```{r ggboxSelected, fig.width=12, fig.height=8}
ggbox.selected <- function (M, Meta, weights, cats,
variable="short20", colours=rainbow.20,
what="contribution", main="",
group.labels=FALSE, ...) {
stopifnot(all(cats %in% names(colours)),
all(cats %in% levels(Meta[[variable]])))
group.vec <- as.character(Meta[[variable]])
Meta$grouping <- factor(ifelse(group.vec %in% cats, group.vec, "other"),
levels=c("other", cats))
col.values <- c("#666666", colours[cats])
ggbox.features(M, Meta, what=what,
weights=weights, id.var="id",
group=grouping, group.palette=col.values,
feature.names=feature.names,
main=main, group.labels=group.labels, ...) +
theme(strip.text.x=element_text(angle=70, hjust=0.2, vjust=0.2))
}
ggbox.selected(ZL, Meta, ByType4.P[, 1], select=idx.weights,
cats=qw("conv,acad", sep=",\\s*"),
main=dim2label["LD1"])
save.pdf("lda4type_box_example1.pdf", width=12, height=8)
```
We can also plot other dimensions, or select subsets of texts (e.g. a particular variety). The code chunk below illustrates relevant parameters, focusing on the third LDA dimension.
```{r ggboxSelectedExamples, fig.width=12, fig.height=8}
ggbox.selected(ZL, Meta, ByType4.P[, 3],
cats=qw("conv,acad", sep=",\\s*"),
select=idx.weights, subset=(shortvar == "NZ"),
main=sprintf("%s - New Zealand", dim2label["LD3"]))
save.pdf(file="lda4type_box_example2.pdf", width=12, height=8)
ggbox.selected(ZL, Meta, ByType4.P[, 3],
cats=qw("acad,popSci,creat", sep=",\\s*"),
variable="short12", colours=rainbow.12,
select=idx.weights,
main=sprintf("%s", dim2label["LD3"]))
save.pdf(file="lda4type_box_example3.pdf", width=12, height=8)
```
Boxes can be formed based on arbitrary metadata variables, provided that we have a suitably labelled vector of colour codes. Here we create one bespoke box with modified formatting for inclusion in the paper:
```{r ggboxSelectedForPaper, fig.width=12, fig.height=8}
ggbox.selected(ZL, Meta, ByType4.P[, 1], select=idx.weights,
cats=qw("conv,news", sep=",\\s*"),
main=dim2label["LD1"], ylim=c(-1.1, 1.0), group.labels=TRUE) +
theme(axis.text.x=element_text(angle=52, hjust=1), legend.position="none")
save.pdf("lda4type_box_figure4.pdf", width=12, height=9)
```