-
Notifications
You must be signed in to change notification settings - Fork 0
/
Part-II-Writeup.Rmd
1163 lines (886 loc) · 70.6 KB
/
Part-II-Writeup.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Final Data Analysis II"
subtitle: "Final Data Analysis"
author: "Yaoyao Fan, Lucie Jacobson, Zining Ma, Jiajun Song"
date: "`r Sys.Date()`"
output:
tufte::tufte_handout:
citation_package: natbib
latex_engine: xelatex
tufte::tufte_book:
citation_package: natbib
latex_engine: xelatex
link-citations: yes
---
```{r setup, include=FALSE}
library(tufte)
# invalidate cache when the tufte version changes
knitr::opts_chunk$set(tidy = FALSE, cache.extra = packageVersion('tufte'),
echo = F,cache=T)
options(htmltools.dir.version = FALSE)
library(tidyverse)
library(gridExtra)
library(mice)
library(randomForest)
library(car)
library(ggmosaic)
```
```{r read-data, echo=FALSE}
#Read in the training data:
load("paintings_train.Rdata")
load("paintings_test.Rdata")
#The Code Book is in the file `paris_paintings.md` provides more information about the data.
```
# Summary.
We are four art consultants analyzing the prices of auctioned paintings in Paris from the years 1764 to 1780. The principal objective of our analysis is to predict the final sale price of auctioned paintings in 18th century Paris, identifying the driving factors of painting prices and thereby determining instances of under- and over-valuation.
## Data.
The data utilized in the analysis is provided by Hilary Coe Cronheim and Sandra van Ginhoven, Duke University Art, Art History & Visual Studies PhD students, as part of the Data Expeditions project sponsored by the Rhodes Information Initiative at Duke. To begin, there are three subsets of the complete data set - one subset for training, one subset for testing, and one subset for validation. The training subset, which is utilized during exploratory data analysis and initial modelling, is comprised of 1,500 observations (paintings) of 59 variables that provide information pertaining to the origin and characteristics of the artworks.^[Detailed descriptions of all variables are available in the attached MD file, `paris_painting_codebook.md`.]
## Research Question.
What are significant predictors for the final auction sale of a given painting in Paris from the years 1764 to 1780? Is the resulting statistical model diagnostically adequate for the prediction of the sale price for a given painting?
## Why Our Work is Important.
"Speaking in the most basic economic terms, high demand and a shortage of supply creates high prices for artworks. Art is inherently unique because there is a limited supply on the market at any given time" ^[referenced from "Art Demystified: What Determines an Artwork’s Value?", available at https://news.artnet.com/market/art-demystified-artworks-value-533990]. Indisputably, art is extremely important across cultural and economic spheres. Art history provides exposure to and generates appreciation for historical eras and global culture, and thus correct art valuation provides a standard metric for both the trained and the untrained eye to distinguish amongst historical artworks, consequently influencing the framework of modern art as well.
# Exploratory Data Analysis.
Using EDA and any numerical summaries, get to know the data - identify what you might consider the 10 best variables for predicting `logprice` using scatterplots with other variables represented using colors or symbols, scatterplot matrices or conditioning plots.
## Response Variable.
To begin, we analyze the selected response variable, `price`, and the log-transformation of `price`, to ensure that the response variable is approximately normally distributed.
```{r price-histogram, fig.margin = TRUE, echo = FALSE, fig.cap="Histogram of Painting Price Fetched at Auction (Sales Price in Livres)", fig.width=6.5, fig.height=3.5}
#Clean and convert price to numeric.
paintings_train$price = as.numeric(gsub(",","",paintings_train$price))
xfit <- seq(-100, max(paintings_train$price), length = 100)
yfit <- dnorm(xfit, mean = mean(paintings_train$price), sd = sd(paintings_train$price))
ggplot() +
geom_histogram(aes(x = paintings_train$price),bins = 15, fill = "slategray3") +
geom_line(aes(x = xfit, y = yfit*2000*1500), col = "darkblue", size = 1) +
labs(x = "Livres", y = "Count")
```
```{r price-qq, fig.margin = TRUE, echo = FALSE, fig.cap="Normal probability plot of Painting Price Fetched at Auction (Sales Price in Livres)", fig.width=6.5, fig.height=3.5}
#Generate a normal probability plot.
ggplot() +
geom_qq(aes(sample = paintings_train$price), col = "dodgerblue4") +
geom_qq_line(aes(sample = paintings_train$price)) + labs(x = "Theoretical Quantiles", y = "Sample Quantiles")
```
From *Figure 1*, we observe that the distribution of the variable `price`, with range from 1 to 29000 (note: 1 livre sterling is approximately equal to $1.30 U.S. dollars), is strongly skewed to the right. This is corroborated by the normal probability plot for the data, which fails to conform to a linear trend. This is expected, as it is reasonable to assume that on average, prices of paintings at auction will fall within a reasonable budget range: the entire range, however, has a lower bound greater than 0 and potentially no upper bound - the price can be whatever an individual is willing and able to pay for a particular painting.
Given the histogram for `price` is strongly skewed, we now consider the log-transformation of the variable. Logarithmic transformation is a convenient means of transforming a highly skewed variable into a more closely normally-distributed variable, and this transformation is commonly used in economics and business for price data.
```{r logprice-histogram, fig.margin = TRUE, echo = FALSE, fig.cap="Histogram of Log Painting Price Fetched at Auction (Sales Price in Livres)", fig.width=6.5, fig.height=3.5}
#Generate a histogram. Outline for superimposed normal curve from:
xfit <- seq(min(paintings_train$logprice), max(paintings_train$logprice), length = 100)
yfit <- dnorm(xfit, mean = mean(paintings_train$logprice), sd = sd(paintings_train$logprice))
ggplot() +
geom_histogram(aes(x = paintings_train$logprice),bins = 15, fill = "slategray3", color = "black") +
geom_line(aes(x = xfit, y = yfit*0.5*1500), col = "darkblue", size = 1) +
labs(x = "Livres(Log)", y = "Count")
```
```{r logprice-qq, fig.margin = TRUE, echo = FALSE, fig.cap="Normal probability plot of Log Painting Price Fetched at Auction (Sales Price in Livres)", fig.width=6.5, fig.height=3.5}
#Generate a normal probability plot.
ggplot() +
geom_qq(aes(sample = paintings_train$logprice), col = "dodgerblue4") +
geom_qq_line(aes(sample = paintings_train$logprice)) + labs(x = "Theoretical Quantiles", y = "Sample Quantiles")
```
The histogram of the variable `logprice` now exhibits significantly less skew, and much more closely approximates the normal distribution. We also observe that the normal probability plot for the data follows a general linear trend, except in the tail areas of the distribution. We conclude that the conditions for inference regarding the distribution of the variable of interest are sufficiently met, and we continue with the exploratory data analysis.
## Data Manipulation.
To begin data manipulation, we categorize variables based on data type and analyze.
We first consider all character variables. We observe that the variable `lot` should be numeric. We then determine which character variables should be categorical factor variables, where the number of unique levels is restricted to less than 15^[We omit variables `sale`, `subject`, `authorstandard`, `material`, `mat` at this step. Further analysis determines that these variables cause multicollinearity and interpretability issues, and furthermore do not have sufficient numbers of observations in all levels to generate robust estimates.](this is an arbitrary cut-off point, but is necessary - variables with too many levels will not have enough observations in every level to generate robust estimates). To initially handle "NA" and blank observations, we:
- impute a value of "Unknown" to all "n/a" variables for `authorstyle`,
- a value of unknown ("X") to all blank observations for `winningbiddertype`,
- a value of unknown ("X") to all blank observations for `endbuyer`,
- a value of "Unknown" to all blank observations for `type_intermed`,
- a value of "Other" to all blank observations for `Shape`, and
- a value of "other" to all blank observations for `materialCat`.
```{marginfigure, echo=TRUE}
$Data Type$ | $Count$ |
---------------|----------|
character | 17 |
categorical | 10 |
continuous | 32 |
```
```{r, echo = FALSE}
chr_vars <- names(paintings_train)[map_lgl(paintings_train, ~ typeof(.x) == "character")]
chr_vars <- chr_vars[-c(2)] # `lot` should be numeric.
## Look for levels; determine if can be categorical.
uniques <- lapply(paintings_train[chr_vars], unique)
n.uniques <- sapply(uniques, length)
## We only want categories with less than 15 levels.
chr_vars <- chr_vars[n.uniques < 15]
df_chr <- paintings_train[chr_vars]
## Handle the "Unknown".
df_chr$authorstyle[df_chr$authorstyle == "n/a"] = "Unknown"
df_chr$winningbiddertype[df_chr$winningbiddertype == ""] = "X"
df_chr$endbuyer[df_chr$endbuyer == ""] = "X"
df_chr$type_intermed[df_chr$type_intermed == ""] = "Unknown"
df_chr$Shape[df_chr$Shape == ""] = "Other"
df_chr$materialCat[df_chr$materialCat == ""] = "other"
## Convert to factor.
df_chr <- df_chr %>% map_df(as.factor)
```
Our initial data analysis reveals that there are 7 unique levels for the variable `Shape`. We observe that two levels are "round" and "ronde", and two levels are "oval" and "ovale". We learn that "ronde" is the French word for "round" and "ovale" is the French word for "oval", and thus we combine observations in the respective levels. The resulting levels are: "squ_rect", "round", "oval", "octagon", "miniature", and "Other".
Similarly, multiple levels of the variable `authorstyle` are quite similar: "in the taste of", "in the taste", and "taste of": thus, we group all of these unique levels into one level, "in the taste of". A summary table of the character variables is presented below.
We then coerce all variables in the character type data frame to be of type factor.
```{r, echo = FALSE}
shape = df_chr$Shape
shape[shape == "ovale"] <- "oval"
shape[shape == "ronde"] <- "round"
shape <- droplevels(shape)
df_chr <- mutate(df_chr, Shape = shape)
style <- df_chr$authorstyle
style[style %in% c("in the taste", "taste of")] <- "in the taste of"
style <- droplevels(style)
df_chr <- mutate(df_chr, authorstyle = style)
```
`r margin_note("Summary of All Initial Character Variables. Note that here X and Unknown both stand for missingness or data not available. Such imputation may lead to bias in prediction. We should be careful with these variables.")`
```{r}
options(knitr.kable.NA = '')
knitr::kable(summary(df_chr)[, c(1:4)], format = "markdown")
knitr::kable(summary(df_chr)[, c(5:10)], format = "markdown")
```
## Missing Data.
We now identify factor, continuous, and discrete numeric variables, and generate a large data frame with all variables coerced to appropriate type. Let us determine which variables have unknown and/or missing data:
```{r, echo = FALSE}
num_vars <- names(paintings_train)[map_lgl(paintings_train, ~ typeof(.x) != "character")]
## Find factor variables.
uniques <- lapply(paintings_train[num_vars], unique)
n.uniques <- sapply(uniques, length)
fct_vars <- num_vars[n.uniques <= 3]
df_fct <- paintings_train[fct_vars] %>% map_df(as.factor)
## Numerical variables.
ctn_vars <- num_vars[n.uniques > 3]
df_ctn <- paintings_train[ctn_vars]
#df_ctn$year <- as.factor(df_ctn$year)
```
```{r, echo = FALSE}
df <- cbind(df_chr, df_fct, df_ctn)
```
```{r, echo = FALSE, fig.width = 10, fig.height = 6, fig.cap="Determining NA Observations in the Data"}
image(t(is.na(df) | df == "Unknown"), axes=FALSE)
axis(1, at = (0:(ncol(df)-1))/(ncol(df)-1),labels = colnames(df), las=3, cex.axis=0.6)
```
From *Figure 5*, we observe that the variables
`authorstyle`, `type_intermed`, `Interm`, `Height_in`, `Width_in`, `Surface_Rect`, `Diam_in`, `Surface_Rnd` and `Surface`
all have unknown and/or missing data. We will analyze these variables further, beginning with `authorstyle`.
From *Figure 6* we observe that data is not missing at random; the missingness is associated with our response. Thus, we cannot simply omit observations and we need to further analyze these predictors.
```{r, fig.margin=TRUE, fig.width=6.5, fig.height=3.5, fig.cap="Missingness Effect on Response",warning=FALSE}
df %>%
select(winningbiddertype, endbuyer,
Shape, materialCat,
logprice) %>%
gather(key = "key", value = "value", -logprice) %>%
ggplot(aes(x = value %in% c("X","Other","other"), y = logprice)) +
geom_boxplot(fill = "lightblue") +
labs(x = "If Missing", y="logprice") +
facet_wrap(~ key, scales = "free")
```
```{r, warning=FALSE, echo = FALSE,fig.margin=TRUE,fig.width=6.5, fig.height=3.5, fig.cap="Counts of Author Style for Auctioned Paintings"}
Authorstyle_df <- select(df, authorstyle)
Authorstyle_df$authorstyle <- reorder(Authorstyle_df$authorstyle, Authorstyle_df$authorstyle, function(x) -length(x))
Authorstyle_hist <- ggplot(data = Authorstyle_df, aes(x = authorstyle)) + geom_histogram(stat = "count", fill = "slategray3") + theme(axis.text.x = element_text(angle = 90)) + labs(x = "Author Style Category", y = "Counts") + theme(plot.title = element_text(hjust = 0.5))
Authorstyle_hist
```
From *Figure 7*, we observe that the majority of the observations for the variable `authorstyle` are "Unknown", with very few (or no) observations in the remaining levels. Consequently, this variable will likely not contribute much information for the prediction of `logprice` in any specified model, and the minimal number of observations included in the levels may generate extreme standard errors. Given this, we select not to include this term in model specification.
We will continue to analyze variables in the data set with significant numbers of `NA` observations.
Here, we observe that the majority of observations for `Diam_in`, the diameter of a painting in inches, and `Surface_Rnd`, the surface of a round painting, are `NA`. We note that the variable `Surface`, the surface of a painting in squared inches, effectively captures information for the size of a given painting. Including this variable in subsequent model specification captures information provided by the following variables:
`Height_in`, `Width_in`, `Surface_Rect` and `Surface_Rnd`. Thus, we will include `Surface` in subsequent model specification and omit variables that are directly related to `Surface` to avoid issues of multicollinearity.
```{marginfigure, echo=TRUE}
$Variable$ | $Number of Missing$ |
---------------|----------------------|
`Diam_in` | 1469 |
`Surface_Rnd` | 1374 |
```
For "NA" values in `Surface`, we use the package "mice"^[MICE is utilized under the assumption that the missing data are Missing at Random, MAR, and integrates the uncertainty within this assumption into its multiple imputation algorithm (referenced at https://stats.idre.ucla.edu/wp-content/uploads/2016/02/multipleimputation.pdf).] in R. MICE, Multivariate Imputation via Chained Equations, is considered more robust than imputing a single value (in practice, the mean of the data) for every missing value.
```{r, warning = FALSE, echo = FALSE,fig.margin=TRUE,fig.width=6.5,fig.height=3.5,fig.cap="Painting Price and Intermediary Involvement"}
count_Interm <- select(df, logprice, Interm, type_intermed) %>% filter(!is.na(Interm))
count_Interm$Interm <- ifelse(count_Interm$Interm == 0, "No", "Yes")
Interm_hist <- ggplot(data = count_Interm, aes(x = Interm)) + geom_histogram(stat = "count", fill = "slategray3") + labs(x = "Intermediary Involvement", y = "Counts")
Interm_boxplot <- ggplot(data = count_Interm, aes(x = Interm, y = logprice, fill = Interm)) + geom_boxplot(show.legend = F) + labs(x = "Intermediary Involvement", y = "Log Price (Livres)")
grid.arrange(Interm_hist, Interm_boxplot, ncol = 2)
```
We now consider `Interm`, a binary variable that indicates whether an intermediary is involved in the transaction of a painting. This variable consists of 395 `NA` observations, 960 `0` (no) observations, and 145 `1` (yes) observations. Given this, we observe that many auctioned painting sales appear to occur without the involvement of an intermediary. This information is directly related to `type_intermed`, the type of intermediary (B = buyer, D = dealer, E = expert), and is only valid for the observations where an intermediary is involved in the transaction of a painting. Consequently, we select to omit `type_intermed` from the data set. However, we do note that the variable `intermediary` may provide information for the prediction of `logprice`, as *Figure 8* indicates that the median sale price for paintings where an intermediary is involved is noticeably higher than the median sale price for paintings where an intermediary is not involved. While the variability is quite high for both the "No" and "Yes" levels, the boxplot where an intermediary is not involved does not exhibit significant skew, while the boxplot where an intermediary is involved exhibits left skew.
```{r, warning=FALSE, echo = FALSE,fig.margin=TRUE,fig.width=6.5,fig.height=3.5,fig.cap="Painting Price \n and Material Category"}
#unique(paintings_train$materialCat)
materialCat_df <- select(df, materialCat, logprice) %>% filter(df$materialCat != "")
materialCat_df$materialCat <- reorder(materialCat_df$materialCat, materialCat_df$materialCat, function(x) -length(x))
MaterialCat_hist <- ggplot(data = materialCat_df, aes(x = materialCat)) + geom_histogram(stat = "count", fill = "slategray3") + labs(x = "Material Category", y = "Counts")
MaterialCat_boxplot <- ggplot(data = materialCat_df, aes(x = materialCat, y = logprice, fill = materialCat)) + geom_boxplot(show.legend = F) + labs(x = "Material Category", y = "Log Price (Livres)")
grid.arrange(MaterialCat_hist, MaterialCat_boxplot, ncol = 2)
```
We now look at information pertaining to painting material. We observe that there are initially 3 variables in the data set that pertain to painting material: `material`, `materialCat`, and `mat`. The levels of `material` are in French, and the English translations are precisely the levels of the variable `materialCat`. Additionally, we see that the variable `mat` is comprised of more levels (17, excluding "blank" and "n/a") than the variable `materialCat`, and thus is not included in our data frame (restriction of levels < 15). Let us determine if the variable `materialCat` lends information for painting price.
From *Figure 9*, we observe that the material category with the greatest number of observations is canvas, and the material category with the least number of observations is copper. However, the boxplot indicates that paintings with copper material maintain higher mean sale prices than paintings with canvas material; this may give evidence to the statement that "shortage of supply creates high prices for artworks".
Finally, we determine that `year` should be a categorical variable in the data set. While time variables can be either quantitative or qualitative, it is best practice to consider `year` as a categorical variable: the year 1764, for example, is not an explicit measurement of 1,764 units: it is an indicator of the year of sale for a given painting. The range of `year` is (1764, 1780), which creates a factor variable with 17 levels. Given this, we opt to generate a new variable, `YearFactor`, with 6 levels:
Level 1: 1764, 1765, 1766
Level 2: 1767, 1768, 1769
Level 3: 1770, 1771, 1772
Level 4: 1773, 1774, 1775
Level 5: 1776, 1777
Level 6: 1778, 1779, 1780
This level determination, while not perfectly equal, maintains $n > 100$ observations in each level. Overall, we feel that potentially important time trends may be lost if the levels are split homogenously (resulting in year breaks), and so we opt for simple level grouping.
```{r}
year1 <- c(1764, 1765, 1766)
year2 <- c(1767, 1768, 1769)
year3 <- c(1770, 1771, 1772)
year4 <- c(1173, 1774, 1775)
year5 <- c(1776, 1777)
year6 <- c(1778, 1779, 1780)
df <- df %>%
mutate(YearFactor = ifelse(year %in% year1, 1, ifelse(year %in% year2, 2, ifelse(year %in% year3, 3, ifelse(year %in% year4, 4, ifelse(year %in% year5, 5, 6))))))
df$YearFactor <- as.factor(df$YearFactor)
```
```{r, warning=FALSE, echo = FALSE,fig.margin=TRUE,fig.width=6.5,fig.height=3.5,fig.cap="Transformation of Year to Group Factor"}
see_year <- ggplot(data = df, aes(x = year)) + geom_histogram(stat = "count", fill = "slategray3") + labs(x = "Year", y = "Count")
yearcount <- ggplot(data = df, aes(x = YearFactor)) + geom_histogram(stat = "count", fill = "slategray3") + labs(x = "YearFactor", y = "Count")
grid.arrange(see_year, yearcount, ncol = 2)
```
## Identification of Important Variables for the Prediction of Painting Price.
```{r, echo = FALSE}
#Omit selected variables.
df_chr2 <- df_chr %>%
select(- c("authorstyle", "type_intermed", "winningbiddertype"))
#options(knitr.kable.NA = '')
#landscape(kable(summary(df_chr2), caption = "Summary of Character Variables"))
```
```{r, echo = FALSE}
#FACTORS; modify Interm to add Unknown level.
intermna <- addNA(df_fct$Interm)
levels(intermna) <- c(levels(df_fct$Interm), "Unknown")
df_fct$Interm <- intermna
df_fct2 <- df_fct %>% mutate(Interm = intermna) %>% select(-count)
#options(knitr.kable.NA = '')
#landscape(kable(summary(df_fct2), caption = "Summary of Binary Factor Variables"))
```
```{r cache=T, echo = FALSE}
#CONTINUOUS; omit Height_in, Width_in, Surface_Rect, Diam_in, Surface_Rnd, Surface
#Select Surface
#Impute Surface
tempData <- mice(df_ctn[c("logprice", "Surface", "position", "year", "nfigures")], m = 5, maxit = 50, meth='pmm', seed = 521, printFlag = F)
df_ctn2 <- complete(tempData, 1)
year1 <- c(1764, 1765, 1766)
year2 <- c(1767, 1768, 1769)
year3 <- c(1770, 1771, 1772)
year4 <- c(1173, 1774, 1775)
year5 <- c(1776, 1777)
year6 <- c(1778, 1779, 1780)
df_ctn2 <- df_ctn2 %>%
mutate(YearFactor = ifelse(year %in% year1, 1, ifelse(year %in% year2, 2, ifelse(year %in% year3, 3, ifelse(year %in% year4, 4, ifelse(year %in% year5, 5, 6))))))
df_ctn2$YearFactor <- as.factor(df_ctn2$YearFactor)
#tempData$loggedEvents
#options(knitr.kable.NA = '')
#landscape(kable(summary(df_ctn2), caption = "Summary of Continuous Numeric Variables"))
```
```{r, echo = FALSE}
df2 <- cbind(df_chr2, df_fct2, df_ctn2)
```
A boxplot matrix of selected variables of character type for subsequent model specification:
```{r, cache = T, echo = FALSE, warning=FALSE,fig.width=10,fig.height=10}
## chr_vars
dealer_boxplot <- ggplot(data = df2, aes(x = dealer, y = logprice, fill = dealer)) + geom_boxplot(show.legend = F)
origin_boxplot <- ggplot(data = df2, aes(x = origin_author, y = logprice, fill = origin_author)) + geom_boxplot(show.legend = F)
originC_boxplot <- ggplot(data = df2, aes(x = origin_cat, y = logprice, fill = origin_cat)) + geom_boxplot(show.legend = F)
school_boxplot <- ggplot(data = df2, aes(x = school_pntg, y = logprice, fill = school_pntg)) + geom_boxplot(show.legend = F)
endbuyer_boxplot <- ggplot(data = df2, aes(x = endbuyer, y = logprice, fill = endbuyer)) + geom_boxplot(show.legend = F)
Shape_boxplot <- ggplot(data = df2, aes(x = Shape, y = logprice, fill = Shape)) + geom_boxplot(show.legend = F)
MaterialC_boxplot <- ggplot(data = df2, aes(x = materialCat, y = logprice, fill = materialCat)) + geom_boxplot(show.legend = F)
grid.arrange(grobs = list(dealer_boxplot, origin_boxplot, originC_boxplot, school_boxplot, endbuyer_boxplot, Shape_boxplot, MaterialC_boxplot), ncol = 2, bottom = "Boxplot of Character type predictors")
```
We note that different levels of `dealer` appear to have different medians of sale prices, with dealer "R" maintaining a higher median sale price than other dealers. We also note that paintings with Spanish author, origin classification, and school of painting appear to have noticeably higher median sale prices than other authors, origin classifications, and schools of painting (however, we know that there are limited observations pertaining to Spanish author and origin classifications in the data set, so this may not be a robust indication). Overall, all plots indicate trends within the variables that may be important for prediction of the auction price of paintings.
A boxplot matrix of selected variables of binary factor type for subsequent model specification:
```{r, warning = FALSE, cache = T, echo = FALSE, fig.width=10,fig.height=10}
## fct_vars
df2 %>%
select(logprice, names(df_fct2)) %>%
filter(df2$Interm != "Unknown") %>%
gather(key = "key", value = "value", -logprice) %>%
ggplot(aes(x = value, y = logprice, fill = value)) +
facet_wrap(~ key) +
geom_boxplot(show.legend = F) +
labs(caption = "Summary Matrix for Binary Factor Variables")
```
As expected, observations that equal 0 for all binary variables do not contribute information for the auction price of paintings. We note that the variables `lrgfont`, if a dealer devotes an additional paragraph (always written in a larger font size) about a given painting in a catalogue, `Interm`, if an intermediary is involved in the transaction of a painting, and `prevcoll`, if the previous owner of a given painting is mentioned, all have higher medians and higher price ranges with less variability than the other included variables. We also note that the variable `history`, if a description includes elements of history painting, appears to be associated with a lower median price on average.
A scatterplot matrix of the selected variables of continuous numeric type for subsequent model specification:
```{r, echo = FALSE,cache = T, echo = FALSE,warning=FALSE, fig.width=6.5,fig.height=3.5}
## ctn_vars
#Impute a value close to 1 for apparent outliers in the position variable.
df2$position[df2$position > 1] <- 0.99
df2$position <- as.numeric(df2$position)
df2 %>%
select(logprice, Surface, position, nfigures) %>%
gather(key = "key", value = "value", -logprice) %>%
ggplot(aes(x = value, y = logprice)) +
geom_point(alpha = 0.2, col = "darkblue", size = 0.5) +
facet_wrap(~ key, scales = "free", ncol = 2) + theme(plot.title = element_text(hjust = 0.5)) +
labs(x = "", caption = "Scatter Plot Matrix for Continuous Numerical Variables")
```
The variable `nfigures` refers to the number of figures portrayed in a given painting, if specified. Here, we observe that many paintings do not include any specified figures, and the prices for these paintings fall along the entire range of `logprice`. There may be a slight positive trend for paintings that do include figures. Given that this is a count variable with many zeroes, it is not appropriate to transform; previous research has shown that log-transformed count data generally performs poorly in model specification^[see O’Hara and Kotze, https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/j.2041-210X.2010.00021.x].
Continuing, we observe that the plot for `position` is a null plot with no trend. The plot for `Surface` indicates that there may be an association between the surface of a painting in squared inches and the price. Given the large range of the variable with several orders of magnitude, `Surface` should likely be log-transformed.
To further validate the transformation of `Surface`, we use the "powerTransform" method. The “powerTransform” function in R considers transformations of all variables simultaneously: both the explanatory variables and the selected response variable. This method operates under the idea that if the normality of the joint distribution of (Y, X) is improved, the normality of the conditional distribution of (Y|X) is improved. The output of the function shows the exact lambda value to which each variable should be respectively exponentiated. This makes for a quite confusing model that would be difficult to interpret. So, we consider the output values by the following rules:
- If an output value is close to 1, there is not strong evidence that a variable transformation is required.
- If an output value is close to 0.5, there is evidence that a square root transformation of the variable may be required.
- If an output is close to 0, there is evidence that a log transformation of the variable may be required.
```{r}
num_subset <- select(df2, logprice, Surface) %>%
mutate(logprice = (logprice + 0.01)) %>%
mutate(Surface = (Surface + 1))
```
```{r, warning=FALSE, echo = FALSE,fig.margin=TRUE,fig.width=6.5,fig.height=3.5,fig.cap="Results of Power Transform", cache=TRUE}
p_transform <- car::powerTransform(num_subset, family = "bcPower")
knitr::kable(p_transform$lambda, col.names = c("Suggest Order"),
caption = "Power Transformation")
```
From the results of the "powerTransform" method, we conclude that `logprice` does not need to be further transformed (as expected, given that this variable has already been log-transformed) and `Surface` should be log-transformed.
```{r}
df2 <- df2 %>%
mutate(Surface = log(Surface + 1))
#df2$year <- NULL
#summary(df2$Surface)
```
```{r, warning=FALSE, echo = FALSE,fig.margin=TRUE,fig.width=6.5,fig.height=3.5,fig.cap="Surface of Painting in Squared Inches, Log Transformation"}
Surface_histogram <- hist(df2$Surface, breaks = 15,
col = "slategray3", xlab = "log(Surface)", main = NULL)
xfit <- seq(min(df2$Surface), max(df2$Surface), length = 35)
yfit <- dnorm(xfit, mean = mean(df2$Surface), sd = sd(df2$Surface))
yfit <- yfit * diff(Surface_histogram$mids[1:2]) * length(df2$Surface)
lines(xfit, yfit, col = "darkblue", lwd = 2)
```
To further analyze potentially important predictor variables for `logprice`, we generate a random forest model. From the associated variable importance plot, we observe that the 10 variables resulting in the greatest increase in MSE are `YearFactor`, `Surface`, `dealer`, `lrgfont`, `position`, `endbuyer`, `origin_author`, `materialCat`, `paired`, and `finished`.
```{r, warning=FALSE, echo = FALSE}
set.seed(521)
objF <- randomForest(logprice ~ ., data = df2, importance = T)
#How to hide this plot?
Imp <- importance(objF)
```
```{r, fig.width=6.5,fig.height=3.5, fig.cap="Variable Importance based on RandomForest"}
Imp <- as.data.frame(Imp)
Imp$varnames <- rownames(Imp) # row names to column
colnames(Imp)[1] <- "IncreaseMSE"
rownames(Imp) <- NULL
ggplot(Imp, aes(x=reorder(varnames, IncreaseMSE), y= IncreaseMSE)) +
geom_point(col = "darkblue") +
geom_segment(aes(x=varnames,xend=varnames,y=0,yend=IncreaseMSE), col = "lightcyan4") +
ylab("Increase MSE, Percentage") +
xlab("Variable Name") +
theme(axis.text.y = element_text(size = 5)) +
coord_flip()
```
# Discussion of Preliminary Model Part I.
The model we specified in Part I:
`logprice` ~ `year` + `Surface` + `nfigures` + `engraved` + `prevcoll` + `paired` + `finished` + `relig` + `lands_sc` + `portrait` + `materialCat` +
`year:finished` + `year:lrgfont` + `Surface:artistliving`
For specification of this model, we used Akaike information criterion (AIC) for initial variable selection. The AIC is designed to select the model that produces a probability distribution with the least variability from the true population distribution^[referenced from “Akaike Information Criterion”, available at https://www.sciencedirect.com/topics/medicine-and-dentistry/akaike-information-criterion]. While the AIC may result in a fuller model than the Bayesian information criterion (BIC) - which penalizes model complexity more heavily - the AIC criterion may lead to higher predictive power. We then relied on Bayesian model averaging (BMA), which averages over models in a model class by posterior model probability to encompass the model uncertainty inherent in the variable selection problem^[referenced from "Package BMA", available at https://cran.r-project.org/web/packages/BMA/BMA.pdf], to extract the most important variables for use in our linear model. We extracted variables by obtaining the Highest Probability Model (HPM). Our resulting model explained approximately 40% of the variation in the training data (which we considered to be rather low, given the number of variables included in the model), and maintained coverage and RMSE statistics that were not better than the null model.
To improve upon our initial model, we now treat `year` as a factor variable and include `YearFactor` (please refer to EDA for a comprehensive review of this variable) in model specification instead of `year`. Furthermore, we log-transform `Surface`. Proper treatment and transformation of these variables should improve our model.
Given that `logprice` is nearly normally distributed, we do not see an immediate need to diverge from linear regression. Thus, we will again use AIC and BMA for variable selection. However, we will extract variables through the Best Predictive Model (BPM) instead of the HPM, as the BPM concludes with predictions that are closest to the Bayesian model averaging under squared error loss. Additionally, we will include more diagnostic plots to assess our model, and further analyze potential interaction terms. Then, we will consider more flexible modelling methods as needed.
# Development and Assessment of Model.
With our initial modelling results and improved EDA, we decide to further explore Bayesian model averaging.
To begin modeling, we use the “bas.lm” function to conduct Bayesian adaptive sampling for Bayesian model averaging and variable selection in linear models, via sampling without replacement from a posterior distribution on models ^[referenced from “bas.lm”, available at https://www.rdocumentation.org/packages/BAS/versions/1.5.3/topics/bas.lm]. We select the Bayesian information criterion (BIC) for the prior distributions of the coefficients in the regression (approximation to the Bayes factor for large samples), and assume the model prior distribution to be the uniform distribution. Selected sampling method is Markov Chain Monte Carlo (MCMC). We choose these priors because we do not have specific information that will inform our priors, and we want to generate a model with relatively high predictive power. We specify a full model where:
- `YearFactor` is included and `year` is excluded,
- `figures` (binary) is excluded given its high association with `nfigures` (number of figures in a given painting, if specified)
- `origin_cat` and `school_pntg` are excluded to avoid multicollinearity issues with similar variable `origin_author`
```{r, echo = FALSE, cache=TRUE}
library(BAS)
#Set seed to ensure results are reproducible.
set.seed(523)
#Fit the model using Bayesian linear regression.
bma_painting <- bas.lm(logprice ~ . -year -figures -origin_cat -school_pntg, data = df2,
prior = "BIC",
modelprior = uniform(), method = "MCMC")
```
```{r,fig.margin=TRUE, echo = FALSE, cache=TRUE,fig.width=6.5,fig.height=3.5,fig.cap="Covergence Plot of BMA"}
diagnostics(bma_painting, type = "pip", col = "dodgerblue4", pch = 16, cex = 1.5)
```
The plot above indicates if the posterior inclusion probability has converged under the Markov Chain Monte Carlo method. The posterior inclusion probability is the sum of all posterior probabilities associated with the models which includes a certain explanatory variable^[referenced from “What’s the meaning of a posterior inclusion probability (PIP) in Bayesian?”, available at https://www.animalgenome.org/edu/concepts/PPI.php]. From the plot, we observe that all of the points fall on the theoretical convergence line, indicating that the number of MCMC iterations is sufficient for the data in Bayesian model averaging and do not need to be increased.
Next, we plot the marginal inclusion probability and model space:
```{r,echo = FALSE,fig.width=5.5,fig.height=3.5}
plot(bma_painting, which = 4, ask = FALSE, caption = "", sub.caption = "", col.in = "darkturquoise", col.ex = "black", lwd = 1, cex.lab = 0.4)
```
```{r,echo = FALSE,fig.width=5.5,fig.height=3.5}
image(bma_painting, rotate = TRUE, cex.axis = 0.3)
```
Here, explanatory variables that significantly contribute to the prediction of auction price for a given painting - that is, explanatory variables with high marginal inclusion probabilities - are highlighted in blue. From the plot, we observe that the intercept (by default), `dealer`, `origin_author`, `diff_origin`, `artistliving`, `Interm`, `engraved`, `prevcoll`, `paired`, `finished`, `lrgfont`, `lands_sc`, `portrait`, `still_life`, `Surface`, and `YearFactor` all have marginal inclusion probabilities greater than 0.5. The model space visualization provides corroboration for the previous results.
The "bas.lm" algorithm leads to a hierarchical model that represents the full posterior uncertainty after viewing the data^[definition referenced from “An Introduction to Bayesian Thinking: A Companion to the Statistics with R Course”, available at https://statswithr.github.io/book/stochastic-explorations-using-mcmc.html#r-demo-on-bas-package]. We now want to define and generate a concrete model, namely, the best predictive model (BPM). The BPM concludes with predictions that are closest to the Bayesian model averaging under squared error loss. After generating the BPM model, we output the names of the explanatory variables included in the model. These variables are: intercept (by default), `dealer`, `origin_author`, `diff_origin`, `artistliving`, `Interm`, `engraved`, `prevcoll`, `paired`, `finished`, `lrgfont`, `lands_sc`, `portrait`, `still_life`, `other`, `Surface`, and `YearFactor`. This generally agrees with the Bayesian model averaging.
```{r}
set.seed(521)
BPM_prediction <- predict(bma_painting, estimator = "BPM", se.fit = TRUE)
#variable.names(BPM_prediction)
```
From this step, we fit a linear model with all variables identified by BPM, with additional variables identified in BMA that we feel may be important. We then use the Akaike information criterion (AIC) for further variable selection. Using this more parsimonious model, we fit a model with all possible two-way interactions to capture important interaction trends that are prevalent within the model and again use AIC to determine which variables and two-way interactions contribute significant information for the prediction of auction price of a given painting.
This results in a model that is quite overfit. Thus, we individually consider which interaction terms appear to be important. For all interactions involving levels where there are not sufficient numbers of observations, the resulting coefficient estimates are coerced to "NA". We do not include these interaction terms. Overall, the model summary indicates that the following interaction terms may be important: `dealer:difforigin`, `dealer:artistliving`, `dealer:paired`, `dealer:finished`, `materialCat:finished`, `prevcoll:finished`, `paired:lrgfont`, and `paired:YearFactor`. To briefly analyze these interactions, we generate a series of mosaic plots. A mosaic plot allows for identification of interactions between two or more categorical variables. The widths of the plot boxes correspond to the number of observations that comprise each level of the variable on the x-axis, while the heights of the plot boxes correspond to the number of observations that comprise each level of the variable on the y-axis. Overall, each plot indicates to some extent that there may potentially be an interaction effect, and we select to include all terms in subsequent model specification.
```{r, cache=TRUE,echo = FALSE,fig.width=5,fig.height=5,fig.cap="Mosaic plot"}
m1 <- ggplot(data = df2) +
geom_mosaic(aes(x = product(diff_origin, dealer), fill=diff_origin), na.rm=TRUE) + scale_x_productlist(name = "Dealer") + scale_y_productlist(name = "Diff. Author Origin: No, Yes") + scale_fill_manual(values=c("orange", "dodgerblue3")) + theme(legend.position = "none") + theme(plot.title = element_text(hjust = 0.5)) + theme(axis.text.x = element_text(angle = 90))
m2 <- ggplot(data = df2) +
geom_mosaic(aes(x = product(artistliving, dealer), fill=artistliving), na.rm=TRUE) + scale_x_productlist(name = "Dealer") + scale_y_productlist(name = "Artist Living: No, Yes") + scale_fill_manual(values=c("orange", "dodgerblue3")) + theme(legend.position = "none") + theme(plot.title = element_text(hjust = 0.5)) + theme(axis.text.x = element_text(angle = 90))
m3 <- ggplot(data = df2) +
geom_mosaic(aes(x = product(paired, dealer), fill=paired), na.rm=TRUE) + scale_x_productlist(name = "Dealer") + scale_y_productlist(name = "Paired: No, Yes") + scale_fill_manual(values=c("orange", "dodgerblue3")) + theme(legend.position = "none") + theme(plot.title = element_text(hjust = 0.5)) + theme(axis.text.x = element_text(angle = 90))
m4 <- ggplot(data = df2) +
geom_mosaic(aes(x = product(finished, dealer), fill=finished), na.rm=TRUE) + scale_x_productlist(name = "Dealer") + scale_y_productlist(name = "Finished: No, Yes") + scale_fill_manual(values=c("orange", "dodgerblue3")) + theme(legend.position = "none") + theme(plot.title = element_text(hjust = 0.5)) + theme(axis.text.x = element_text(angle = 90))
m5 <- ggplot(data = df2) +
geom_mosaic(aes(x = product(finished, materialCat), fill=finished), na.rm=TRUE) + scale_x_productlist(name = "Material Category") + scale_y_productlist(name = "Finished: No, Yes") + scale_fill_manual(values=c("orange", "dodgerblue3")) + theme(legend.position = "none") + theme(plot.title = element_text(hjust = 0.5)) + theme(axis.text.x = element_text(angle = 90))
m6 <- ggplot(data = df2) +
geom_mosaic(aes(x = product(finished, prevcoll), fill=finished), na.rm=TRUE) + scale_x_productlist(name = "Previous Owner Mentioned") + scale_y_productlist(name = "Finished: No, Yes") + scale_fill_manual(values=c("orange", "dodgerblue3")) + theme(legend.position = "none") + theme(plot.title = element_text(hjust = 0.5)) + theme(axis.text.x = element_text(angle = 90))
m7 <- ggplot(data = df2) +
geom_mosaic(aes(x = product(paired, lrgfont), fill=paired), na.rm=TRUE) + scale_x_productlist(name = "Lrgfont: No, Yes") + scale_y_productlist(name = "Paired: No, Yes") + scale_fill_manual(values=c("orange", "dodgerblue3")) + theme(legend.position = "none") + theme(plot.title = element_text(hjust = 0.5)) + theme(axis.text.x = element_text(angle = 90))
m8 <- ggplot(data = df2) +
geom_mosaic(aes(x = product(paired, YearFactor), fill=paired), na.rm=TRUE) + scale_x_productlist(name = "Year Factor") + scale_y_productlist(name = "Paired: No, Yes") + scale_fill_manual(values=c("orange", "dodgerblue3")) + theme(legend.position = "none") + theme(plot.title = element_text(hjust = 0.5)) + theme(axis.text.x = element_text(angle = 90))
grid.arrange(m1, m2, m3, m4, m5, m6, m7, m8, ncol = 3)
```
```{r}
#options(max.print=1000000)
#summary(AIC_res)
set.seed(523)
result_model2 <- lm(logprice ~ dealer + origin_author + diff_origin + artistliving + Interm + materialCat + engraved + prevcoll + paired + finished + lrgfont + lands_sc + portrait + still_life + Surface + YearFactor + dealer:diff_origin + dealer:artistliving + dealer:paired + dealer:finished + materialCat:finished + prevcoll:finished + paired:lrgfont + paired:YearFactor, data = df2)
AIC2 <- step(result_model2, k = 2, trace = FALSE)
#summary(AIC2)
```
After fitting the model, we determine that all included variables and terms contribute to the prediction of the auction price of a given painting. Performing an ANOVA test, the specified model is statistically significant at the $\alpha$ = 0.05 level and the results indicate that the model with all eight identified interaction terms is preferred to a more parsinomious model.
## Model Diagnostics.
```{r,fig.width=15,fig.height=10}
par(mfrow = c(2,2))
plot(AIC2, ask = F)
```
***Constant variability of residuals. ***
We observe that the fitted values form a horizontal line that very closely conforms to the residual = 0 line. While we note the presence of potential outliers, the plot indicates that the assumption of constant variability of residuals is met. We also note that this plot is improved in comparison to the "Residuals vs Fitted" plot for our initial model.
***Nearly normal residuals.***
To determine if the model has nearly normal residuals, we generate a normal probability plot. In the plot, the data are plotted by residuals generated from a theoretical normal distribution^[referenced from “Normal Probability Plot”, available at https://www.itl.nist.gov/div898/handbook/eda/section3/normprpl.htm]. The plot for the data follows a precise linear trend, and is improved from the "Normal Q-Q" plot for our initial model.
***Homoscedasticity.***
The “Scale-Location” plot is used to verify the assumption of equal variance in linear regression. If the assumption is met, the fitted values - plotted on the x axis - fall along a horizontal line with equal scatter. Here, we observe that the fitted values exhibit more equal scatter across the plot, forming a general horizontal band. Overall, the assumption of equal variance is met.
***Leverage and influential points.***
The “Residuals vs Leverage” plot is used to determine the presence of observations with high leverage using Cook’s distance. The Cook’s distance values are represented by red dashed lines, and observations that fall outside of the lines are considered to be observations with high leverage. From the plot above, we observe that no observations included in the model fit fall outside of the Cook’s distances, and the trend line very closely follows the horizontal standardized residual = 0 line. While observations 81, 114, and 770 are highlighted as observations with potentially high leverage relative to the data, the plot does not strongly indicate the presence of any potentially influential points.
## Discussion of how prediction intervals obtained
For linear model, it is very convenient to get the prediction intervals for new test data, using `predict.lm(obj, newdata = testdata, interval = "prediction")`
## Model testing
To test the model, we apply 5-folds cross validation on training data and see if the model has generalization error or still needs to be improved.
```{r, warning=FALSE}
trainres <- NULL
testres <- NULL
for (i in 1:5) {
index = 1:300 + 300*(i-1)
trainset = df2[-index, ]
testset = df2[index, ]
model <- lm(logprice ~ dealer + origin_author + diff_origin +
artistliving + Interm + materialCat + engraved + prevcoll +
paired + finished + lrgfont + lands_sc + portrait + still_life +
Surface + YearFactor + dealer:diff_origin + dealer:artistliving +
dealer:paired + dealer:finished + materialCat:finished +
prevcoll:finished + paired:lrgfont + paired:YearFactor,
data = trainset)
trainpred = exp(predict(model, interval = "prediction"))
testpred = exp(predict(model, newdata = testset, interval = "prediction"))
testerror = exp(testset$logprice) - exp(predict(model, testset))
trainerror = exp(trainset$logprice) - exp(predict(model))
coverage.sjj = function(y, pred) {
if (!all(c("lwr", "upr") %in% colnames(pred) )) return(0)
cov = mean((pred[,"lwr"] < y) & (pred[,"upr"] > y))
if (is.na(cov)) cov = 0
return(cov)
}
trainres <- rbind(trainres, data.frame(list(Bias = mean(trainerror),
Coverage = coverage.sjj(exp(trainset$logprice), trainpred),
maxDeviation = max(abs(trainerror)),
MeanAbsDeviation = mean(abs(trainerror)),
RMSE = sqrt(mean(trainerror^2)))))
testres <- rbind(testres, data.frame(list(Bias = mean(testerror),
Coverage = coverage.sjj(exp(testset$logprice), testpred),
maxDeviation = max(abs(testerror)),
MeanAbsDeviation = mean(abs(testerror)),
RMSE = sqrt(mean(testerror^2)))))
}
knitr::kable(rbind(apply(trainres,2,mean),
apply(testres,2,mean)),
digits = 5,
caption = "Average statistics under cross validation")
```
According to the summary table, first line is the evaluation metrics on training folds and second is on test folds. We observe that model achieves quite similar results on training folds or test folds, indicating that there does not exist overfitting issue. The coverage rate is satisfying. And when we continue to see how this model perform on test data, it does a rather good job actually, achieving above 95% coverage rate and around 1200 RMSE.
## Variables
Specific summary of this model is:
```{r, echo=FALSE}
star <- function(x) {
if (x < 0.001) return ("***")
if (x < 0.01) return ("**")
if (x < 0.1) return ("*")
if (x < 0.5) return (".")
return ("")
}
data.frame(summary(AIC2)$coef) %>%
mutate(Signifance = sapply(.[, 4], star)) %>%
cbind(confint(AIC2)) %>%
select(c(1,2,6:7, 5)) %>%
knitr::kable(digits = 5, caption = "Coefficients and Confidence Intervals",
format = "markdown")
```
From the summary table of variable estimates and confidence intervals, we find out that almost all the predictors are statistically signifcant in terms of 0.05 level.
```{marginfigure, echo=TRUE}
*Additional statistics:*
Residual standard error: 1.137 on 1447 degrees of freedom
Multiple R-squared: 0.6608
Adjusted R-squared: 0.6486
F-statistic: 54.2 on 52 and 1447 DF, p-value: < 2.2e-16
```
With interaction terms included and increase of number of factor levels, it does not make much sense to talk about the interpretation of one single predictor as it is closed related to other predictors indicated by the model. We still could observe important variable or variable combinations that make a painting expensive. For example, pictures with large font introduced in the subject is expected to be 219.17% more expensive than those not. Also, pictures with type R dealer and not living artist is expected to be 14.67% more expensive than those with other type dealer and artist still live.
##
***To pursue a model with lowest RMSE on test set, we have developed another model which we think is also very interesting and meaningful to included in the report.***
# Random Forest & Linear Regression Model.
Now the question is: can we improve the model?
```{r}
train=paintings_train %>%
mutate(year=as.factor(year)) %>%
mutate(Surface=ifelse(is.na(Surface),0,Surface)) %>%
mutate(Interm=ifelse(is.na(Interm),0,Interm)) %>%
mutate(logS=log(Surface+1)) %>% mutate(logS_n0=logS>0) %>%
mutate(lognf=log(nfigures+1)) %>% mutate(lognf_n0=lognf>0) %>%
filter(origin_author!='A',school_pntg!='A',winningbiddertype!='DB') %>%
mutate_if(is.character, as.factor)
treeFctr=c('year','dealer','origin_author','origin_cat','school_pntg','endbuyer','Shape','materialCat','winningbiddertype')
treeBnry=c('engraved','prevcoll','paired','figures','finished','lrgfont','lands_sc','lands_ment','arch','othgenre','portrait','still_life','discauth','history','pastorale','diff_origin','Interm')
```
To improve the performance of our model, we now consider a two-part model: first, we introduce the tree-based method of random forests to fit an approximate value, and then fit the residual using linear regression. This implementation is similar to boosting, where trees are grown sequentially, using information from previously fit trees.
Overall, the approach is intuitive: we first classify a given painting in a large class, and then account for differences based on painting features.
Here, we note that for the variable `nfigures`, most observations are 0 with the remaining observations sparsely distributed over a large range (please refer to EDA for graph). Thus, we will log-transform this count variable for subsequent model specification.
After log-transformations, we find that there are linear relationships for both log(Surface) and log(nfigures) on range where the value is greater than 0. Our objective is to fit a linear specification for observations with values greater than 0, and fit a point estimation for observations with values equal to 0.
We use a such a method: (Take log-Surface as an example)
1. Create an indicator 'logS_n0', which equals 1 if Surface does not equal 0, and equals 0 otherwise.
2. Use 'logS_n0 + logS' in the model formula. Thus, the final model could be:
$$fit \mid (Surface>0) \space\space= b_1+k*log\_S$$
$$fit \mid (Surface = 0) \space\space= b_0$$
where $b_0\neq b_1$.
## Random Forest.
First, we fit a random forest using relevant predictors (selected from previous linear modelling) and analyze fit:
```{r,echo=F, cache=TRUE,fig.width=5.5,fig.height=3.5}
set.seed(523)
rf.formula = as.formula(paste0(c('logprice~','logS_n0+lognf_n0+lognf+logS+',paste0(c(treeFctr,treeBnry),collapse = '+'))))
rf=randomForest(rf.formula,data=na.omit(train %>% select(logprice,treeBnry,treeFctr,logS_n0,lognf_n0,lognf,logS)))
pred.rf=predict(rf,newdata=train)
ggplot()+
geom_point(aes(x=train$logprice, y=pred.rf),
size = 0.8, alpha = 0.5) +
geom_abline(slope = 1,intercept = 0) +
labs(x='True logprice',y='Fitted logprice',
caption = "Fitted logprice vs True logprice")
```
In this plot, we find that when the true value is low (<5), the model tends to overestimate. When the true value is high (>5), it will underestimate. Thus, it is clear that there is a pattern between residuals and `logprice`. Analyzing further, we have:
```{r,fig.width=5.5,fig.height=3.5}
ggplot()+
geom_point(aes(x=train$logprice, y=train$logprice-pred.rf),
size = 0.8, alpha = 0.5)+
geom_hline(yintercept = 0)+
labs(x='True logprice',y='Residual = true - fitted',
caption = "Residual vs True logprice")
```
There is a clear linear relationship between the residuals of the random forest model and the true values. So, we consider fitting the residuals using linear regression in the next step.
## Linear Regression for Residual.
We now fit a linear model with the predictor variables and significant interactions identified in previous modelling efforts.
Interactions included here are:
- `dealer:diff_origin`
- `engraved:prevcoll`
- `prevcoll:finished`
- `paired:lrgfont`
- `paired:year`
- `materialCat:finished`
```{r,echo=F, warning = FALSE}
rf.se.fl = as.formula(paste0(c('se~','logS_n0+lognf_n0+ dealer:diff_origin + engraved:prevcoll + prevcoll:finished + paired:lrgfont + paired:year + materialCat:finished+',paste0(c(treeFctr,treeBnry),collapse = '+'))))
lm.se=lm(rf.se.fl,data=na.omit(train %>% select(logprice,treeBnry,treeFctr,logS_n0,lognf_n0,lognf,logS) %>%
mutate(se=logprice-pred.rf)))
se.rf=predict(lm.se,newdata=train)
```
How well we fitted residual:
```{r, fig.width=6.5,fig.height=3.5}
ggplot()+geom_point(aes(x=se.rf,y=lm.se$model$se))+geom_abline(slope = 1,intercept = 0)+
labs(x='fitted residual',y='true residual')
```
## Prediction Interval
Our final prediction is of the form:
$$\hat{y}=\hat{y_{rf}}+\hat{residual_{rf}}$$
Our assumption of the model is:
$$y=\hat{y_{rf}}+\hat{residual_{rf}}+\epsilon$$
We can estimate $Var(\epsilon)$ by $$\hat{Var(\epsilon)}=Var(y-\hat{y})$$
And we can obtain prediction interval of $\hat{residual_{rf}}$ from the linear model.
Unfortunately, we cannot obtain a prediction interval of $\hat{y_{rf}}$. Instead, what we could do is to give an under-estimated prediction interval based on $Var(\epsilon)$ and $\hat{residual_{rf}}$.
Sepcifically, we assume that:
$$y \sim Normal(\hat{y},Var(\epsilon)+Var(\hat{residual_{rf}}))$$
Thus the prediction interval is:
$$\hat{y}\pm 1.96*\sqrt{Var(\epsilon)+Var(\hat{residual_{rf}})}$$
```{r,test_prediction, warning = FALSE}
test=paintings_test %>%
mutate(year=as.factor(year)) %>%
mutate(Surface=ifelse(is.na(Surface),0,Surface)) %>%
mutate(Interm=ifelse(is.na(Interm),0,Interm)) %>%
mutate(logS=log(Surface+1)) %>% mutate(logS_n0=logS>0) %>%
mutate(lognf=log(nfigures+1)) %>% mutate(lognf_n0=lognf>0) %>%
mutate(Shape=ifelse(Shape=='octogon','octagon',Shape)) %>%
mutate(winningbiddertype=ifelse(winningbiddertype=='EB','EBC',winningbiddertype)) %>%
mutate_if(is.character, as.factor) %>%
select(logprice,treeBnry,treeFctr,logS_n0,lognf_n0,lognf,logS)
pred.rf.test=predict(rf, newdata=test)
se.rf.test=predict(lm.se,newdata = test,interval = "pred")
y.pred=pred.rf+se.rf
episq=var(y.pred-train$logprice)
sd=sqrt(((se.rf.test[,'upr']-se.rf.test[,'lwr'])/3.92)^2+episq)
pred.log.test=list(fit=pred.rf.test+se.rf.test[,'fit']
,upr=pred.rf.test+se.rf.test[,'fit']+1.96*sd
,lwr=pred.rf.test+se.rf.test[,'fit']-1.96*sd
) %>% as.data.frame()
predictions = as.data.frame(
exp(pred.log.test))
save(predictions, file="predict-test.Rdata")
```
```{r,validate_prediction, warning=FALSE}
load("paintings_validation.Rdata")
valid=paintings_validation %>%
mutate(year=as.factor(year)) %>%
mutate(Surface=ifelse(is.na(Surface),0,Surface)) %>%
mutate(Interm=ifelse(is.na(Interm),0,Interm)) %>%
mutate(logS=log(Surface+1)) %>% mutate(logS_n0=logS>0) %>%
mutate(lognf=log(nfigures+1)) %>% mutate(lognf_n0=lognf>0) %>%
mutate(Shape=ifelse(Shape=='octogon','octagon',Shape)) %>%
mutate(winningbiddertype=ifelse(winningbiddertype=='EB','EBC',winningbiddertype)) %>%
mutate(winningbiddertype=ifelse(winningbiddertype=='DB','D',winningbiddertype)) %>%
mutate(origin_author=ifelse(origin_author=='A','F',origin_author)) %>%
mutate_if(is.character, as.factor) %>%
select(logprice,treeBnry,treeFctr,logS_n0,lognf_n0,lognf,logS)
levels(valid$Shape)=levels(train$Shape)
pred.rf.valid=predict(rf, newdata=valid)
se.rf.valid=predict(lm.se,newdata = valid,interval = "pred")
y.pred=pred.rf+se.rf
episq=var(y.pred-train$logprice)
sd=sqrt(((se.rf.valid[,'upr']-se.rf.valid[,'lwr'])/3.92)^2+episq)
pred.log.valid=list(fit=pred.rf.valid+se.rf.valid[,'fit']
,upr=pred.rf.valid+se.rf.valid[,'fit']+1.96*sd
,lwr=pred.rf.valid+se.rf.valid[,'fit']-1.96*sd
) %>% as.data.frame()
predictions = as.data.frame(
exp(pred.log.valid))
save(predictions, file="predict-validation.Rdata")
```
## Model evaluation
```{r}
## cross validation
coverage = function(y, pred) {
if (!all(c("lwr", "upr") %in% colnames(pred) )) return(0)
mean((pred[,"lwr"] < y) & (pred[,"upr"] > y))
}
cross_validation=function(K=10){
ntrain=nrow(train)
rindex=sample(1:ntrain)
ngrp=ceiling(ntrain/K)
rindex=lapply(1:K,function(i){
rindex[((i-1)*ngrp+1):(min(i*ngrp,ntrain))]
})
estimations=matrix(NA,K,6)
for(i in 1:K){
temp_train=train[-rindex[[i]],]
temp_test =train[rindex[[i]],]
temp_test = temp_test %>%
filter(year %in% temp_train$year) %>%
filter(dealer %in% temp_train$dealer) %>%
filter(origin_author %in% temp_train$origin_author) %>%
filter(origin_cat %in% temp_train$origin_cat) %>%
filter(school_pntg %in% temp_train$school_pntg) %>%
filter(endbuyer %in% temp_train$endbuyer) %>%
filter(Shape %in% temp_train$Shape) %>%
filter(materialCat %in% temp_train$materialCat) %>%
filter(winningbiddertype %in% temp_train$winningbiddertype)
rf=randomForest(rf.formula,data=na.omit(temp_train %>%
select(logprice,treeBnry,treeFctr,logS_n0,lognf_n0,lognf,logS)))
pred.rf=predict(rf,newdata=temp_train)
rf.se.fl = as.formula(paste0(c('se~','logS_n0+lognf_n0+ dealer:diff_origin + engraved:prevcoll + prevcoll:finished + paired:lrgfont + paired:year + materialCat:finished+',paste0(c(treeFctr,treeBnry),collapse = '+'))))
lm.se=lm(rf.se.fl,data=na.omit(temp_train %>%
select(logprice,treeBnry,treeFctr,logS_n0,lognf_n0,lognf,logS) %>%
mutate(se=logprice-pred.rf)))
se.rf=predict(lm.se,newdata=temp_train)
pred.rf.test=predict(rf, newdata=temp_test)
se.rf.test=predict(lm.se,newdata = temp_test,interval = "pred")
y.pred=pred.rf+se.rf
episq=var(y.pred-temp_train$logprice)
sd=sqrt(((se.rf.test[,'upr']-se.rf.test[,'lwr'])/3.92)^2+episq)
pred.log.test=list(fit=pred.rf.test+se.rf.test[,'fit']
,upr=pred.rf.test+se.rf.test[,'fit']+1.96*sd
,lwr=pred.rf.test+se.rf.test[,'fit']-1.96*sd
) %>% as.data.frame()
predictions = as.data.frame(
exp(pred.log.test))
error = exp(temp_test$logprice) - predictions[, "fit"]
Bias = mean(error)
Coverage = coverage(exp(temp_test$logprice), predictions)
maxDeviation = max(abs(error))
MeanAbsDeviation = mean(abs(error))
RMSE= sqrt(mean(error^2))
estimations[i,] = c(Bias, Coverage, maxDeviation, MeanAbsDeviation, RMSE, nrow(temp_test))
}
apply(estimations,2,mean)
}
```
##
***Random Forest***
```{r, fig.width=6.5,fig.height=3.5, fig.cap="Variable Importance based on RandomForest"}
Imp <- importance(rf)
Imp <- as.data.frame(Imp)
Imp$varnames <- rownames(Imp) # row names to column
colnames(Imp)[1] <- "IncreaseMSE"
rownames(Imp) <- NULL
ggplot(Imp, aes(x=reorder(varnames, IncreaseMSE), y= IncreaseMSE)) +
geom_point(col = "darkblue") +
geom_segment(aes(x=varnames,xend=varnames,y=0,yend=IncreaseMSE), col = "lightcyan4") +
ylab("Increase MSE, Percentage") +
xlab("Variable Name") +
theme(axis.text.y = element_text(size = 5)) +
coord_flip()
```
From importance plot of random forest model, we see that year, Surface, winning bidder type, large font, end buyer, dealer and original author are important in decision making. Disparate to what we observed in EDA, 0/1 factor predictors and nfigures do not appear to play a relatively important role in the prediction of auction price for a given painting.
##
***Linear Regression for Residual.***
```{r}
sumry.se=summary(lm.se)
#anova(lm.se)
sumry.tb=data.frame(R=sqrt(sumry.se$r.squared),R2=sumry.se$r.squared,R2_adj=sumry.se$adj.r.squared,std_err=sumry.se$sigma) %>% round(2)
```
Summary table:
```{r}
knitr::kable(sumry.tb, col.names =c("$R$", "$R^2$","$R^2_{adj}$","$\\sigma^2$"))
```
Objectively, we should conclude that this linear regression on residual is not significant. There are only 5 significant variables at the $\alpha=0.05$ level. Furthermore, the model does not explain more than 10% of total variation.
However, we include the regression in our model because there is a distinct difference in performance on the test data: with this linear regression step, RMSE on test data could be under 1000; RMSE could be greater than 1200 if we omit this step.
Besides the discussion of whether to include or omit this step in our model, we also tried variable selection with AIC. Although this does result in increased significant predictors, performance on test data is worse (with RMSE greater than 1200). Thus, we decide to utilize the full linear model in this step.
##
***Performance***
Coverage on training data (in livre):
```{r,train_prediction, warning = FALSE}
pred.rf.train=predict(rf, newdata=train)
se.rf.train=predict(lm.se,newdata = train,interval = "pred")
y.pred=pred.rf+se.rf
episq=var(y.pred-train$logprice)
sd=sqrt(((se.rf.train[,'upr']-se.rf.train[,'lwr'])/3.92)^2+episq)
pred.log.train=list(fit=pred.rf.train+se.rf.train[,'fit']
,upr=pred.rf.train+se.rf.train[,'fit']+1.96*sd
,lwr=pred.rf.train+se.rf.train[,'fit']-1.96*sd
) %>% as.data.frame()
predictions = as.data.frame(
exp(pred.log.train))
save(predictions, file="predict-train.Rdata")
```
```{r, fig.width=5.5, fig.height=3.5}
cover.train=data.frame(y=train$logprice,fit=pred.log.train$fit,lwr=pred.log.train$lwr,upr=pred.log.train$upr) %>% arrange(fit)
ggplot(data=exp(cover.train))+geom_ribbon(aes(x=fit,ymin=lwr,ymax=upr),fill='blue',alpha=0.5)+
geom_point(aes(x=fit,y=y))+labs(x='fitted price',y='true price')+geom_abline(slope=1,intercept = 0) +
labs(caption = "Coverage plot on training data")
```
We can see that the prediction interval covers most of the true values, and overall provides a decent fit on points with large values.
```{r, fig.width=5.5, fig.height=3.5, fig.margin = TRUE, fig.cap="Residual plot on training data"}
resi.train=data.frame(residual=train$logprice-pred.log.train$fit,fit=pred.log.train$fit)
ggplot(data=resi.train)+
geom_point(aes(x=fit,y=residual))+labs(x='fitted logprice',y='residual')+geom_hline(yintercept = 0)
```