-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy path03-data.Rmd
1312 lines (928 loc) · 50.9 KB
/
03-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# (PART) Introduction to R {-}
# Data Structures {#data}
There are different data types that are commonly used in R among which the most important ones are the following:
- **Numeric** (or double): these are used to store real numbers. Examples: -4, 12.4532, 6.
- **Integer**: examples: 2L, 12L.
- **Logical** (or boolean): examples: `TRUE`, `FALSE`.
- **Character**: examples: `"a"`, `"Bonjour"`.
In R there are five types of data structures in which elements can be stored. A data structure is said to *homogeneous* if it only contains elements of the same type (for example it only contains character or only numeric values) and *heterogenous* if it contains elements of more than one type. The five types of data structures are commonly summarized in a table similar to the one below [see e.g. @wickham2014advanced]:
Table: (\#tab:ds) Five most common types of data structures used in R [@wickham2014advanced].
Dimension Homogenous Heterogeneous
------------- ------------- ----------------
1 Vector List
2 Matrix Data Frame
n Array
Consider a simple data set of the top five male single tennis players presented below:
Table: (\#tab:tennis) Five best male single tennis players as ranked by ATP (09-30-2019).
| Name | Date of Birth | Place of Birth | Place of Residence | ATP Ranking | Prize Money | Win Percentage | Grand Slam Wins | Pro Since |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Novak Djokovic | 22 May 1987 | Belgrade, Serbia | Monte Carlo, Monaco | 1 | 135,259,120 | 82.7 | 16 | 2003 |
| Rafael Nadal | 3 June 1986 | Manacor, Spain | Manacor, Spain | 2 | 115,178,858 | 83.1 | 19 | 2001 |
| Roger Federer | 8 August 1981 | Basel, Switzerland | Wollerau, Switzerland | 3 | 126,840,700 | 82.2 | 20 | 1998 |
| Daniil Medvedev | 11 February 1996 | Moscow, Russia | Monte Carlo, Monaco | 4 | 8,171,483 | 63.6 | 0 | 2014 |
| Dominic Thiem | 3 September 1993 | Wiener Neustadt, Austria | Lichtenwörth, Austria | 5 | 18,588,662 | 64.3 | 0 | 2011 |
Notice that this data set contains a variety of data types; in the next sections we will use this data to illustrate how to work with five common data structures.
## Vectors
A vector has three important properties:
```{block2, note-text, type='rmdimportant'}
- The **Type** of objects contained in the vector. The function `typeof()` returns a description of the type of objects in a vector.
- The **Length** of a vector indicates the number of elements in the vector. This information can be obtained using the function `length()`.
- **Attributes** are metadata attached to a vector. The functions `attr()` and `attributes()` can be used to store and retrieve attributes (more details can be found in Section \@ref(vectattr)).
```
`c()` is a generic function that combines arguments to form a vector. All arguments are coerced to a common type (which is the type of the returned value) and all attributes except names are removed. For example, consider the number of grand slams won by the five players considered in the eighth column of Table \@ref(tab:tennis):
```{r}
grand_slam_win <- c(16, 19, 20, 0, 0)
```
To display the values stored in `grand_slam_win` we could simply enter the following in the R console:
```{r}
grand_slam_win
```
Alternatively, we could have created and displayed the value by using `()` around the definition of the object itself as follows:
```{r}
(grand_slam_win <- c(16, 19, 20, 0, 0))
```
Various forms of "nested concatenation" can be used to define vectors. For example, we could also define `grand_slam_win` as
```{r}
(grand_slam_win <- c(16, c(19, 20, c(0, c(0)))))
```
This approach is often used to assemble vectors in various ways.
It is also possible to define vectors with characters, for example we could define a vector with the player names as follows:
```{r}
(players <- c("Novak Djokovic", "Rafael Nadal", "Roger Federer",
"Daniil Medvedev", "Dominic Thiem"))
```
### Type
We can evaluate the kind or type of elements that are stored in a vector using the function `typeof()`. For example, for the vectors we just created we obtain:
```{r}
typeof(grand_slam_win)
typeof(players)
```
This is a little surprising since all the elements in `grand_slam_win` are integers and it would seem natural to expect this as an output of the function `typeof()`. This is because R considers any number as a "double" by default, except when adding the suffix `L` after an integer. For example:
```{r}
typeof(1)
typeof(1L)
```
Therefore, we could express `grand_slam_win` as follows:
```{r}
(grand_slam_win_int <- c(16L, 19L, 20L, 0L, 0L))
typeof(grand_slam_win_int)
```
In general, the difference between the two is relatively unimportant.
### Coercion
As indicated earlier, a vector has a homogeneous data structure meaning that it can only contain a single type among all the data types. Therefore, when more than one data type is provided, R will *coerce* the data into a "shared" type. To identify this "shared" type we can use this simple rule:
\begin{equation*}
\text{logical} < \text{integer} < \text{numeric} < \text{character},
\end{equation*}
which simply means that if a vector has more than one data type, the "shared" type will be that of the "largest" type according to the progression shown above. For example:
```{r}
# Logical + numeric
(mix_logic_int <- c(TRUE, 12, 0.5))
typeof(mix_logic_int)
# Numeric + character
(mix_int_char <- c(14.3, "Hi"))
typeof(mix_int_char)
```
### Subsetting
Naturally, it is possible to "subset" the values of in our vector in many ways. Essentially, there are four main ways to subset a vector. Here we'll only discuss the first three:
- **Positive Index**: We can *access* or *subset* the $i$-th element of a vector by simply using `grand_slam_win[i]` where $i$ is a positive number between 1 and the length of the vector.
```{r}
# Accessing the first element
grand_slam_win[1]
# Accessing the third and first value
grand_slam_win[c(3, 1)]
# Duplicated indices yield duplicated values
grand_slam_win[c(1, 1, 2, 2, 3, 4)]
```
- **Negative Index**: We *remove* elements in a vector using negative indices:
```{r}
# Removing the second observation
grand_slam_win[-2]
# Removing the first and fourth observations
grand_slam_win[c(-1, -4)]
```
- **Logical Indices**: Another useful approach is based on *logical* operators:
```{r}
# Access the first and fourth observations
grand_slam_win[c(TRUE, FALSE, FALSE, TRUE, FALSE)]
```
```{block2, type='rmdcaution'}
Note that it is not permitted to "mix" positive and negative indices. For example, `grand_slam_win[c(-1, 2)]` would produce an error message.
```
### Attributes {#vectattr}
Let's suppose that we conduct an experiment under specific conditions, say a date and place that are stored as attributes of the object containing the results of this experiment. Indeed, objects can have arbitrary additional attributes that are used to store metadata on the object of interest. For example:
```{r}
attr(grand_slam_win, "date") <- "09-30-2019"
attr(grand_slam_win, "type") <- "Men, Singles"
```
To display the vector with its attributes
```{r}
grand_slam_win
```
To only display the attributes we can use
```{r}
attributes(grand_slam_win)
```
It is also possible to extract a specific attribute
```{r}
attr(grand_slam_win, "date")
```
### Adding Labels {#addlab}
In some cases, it is useful to characterize vector elements with labels. For example, we could define the vector `grand_slam_win` and associate the name of each corresponding athlete as a label, i.e.
```{r}
(grand_slam_win <- c("Novak Djokovic" = 16, "Rafael Nadal" = 18,
"Roger Federer" = 20, "Daniil Medvedev" = 0,
"Dominic Thiem" = 0))
```
The main advantage of this approach is that the number of grand slams won can now be referred to by the player's name. For example:
```{r}
grand_slam_win["Novak Djokovic"]
grand_slam_win[c("Novak Djokovic","Roger Federer")]
```
All labels (athlete names in this case) can be obtained with the function `names()`, i.e.
```{r}
names(grand_slam_win)
```
### Working with Dates
When working with dates it is useful to treat them as real dates rather than character strings that *look* like dates (to a human) but don't otherwise *behave* like dates. For example, consider a vector of three dates: `c("03-21-2015", "12-13-2011", "06-27-2008")`.
The `sort()` function returns the elements of a vector in ascending order, but since these dates are actually just character strings that *look* like dates (to a human), R sorts them in alphanumeric order (for characters) rather than chronological order (for dates):
```{r}
# The `sort()` function sorts elements in a vector in ascending order
sort(c("03-21-2015", "12-31-2011", "06-27-2008", "01-01-2012"))
```
Converting the character strings to "yyyy-mm-dd" would solve our sorting problem, but perhaps we also want to calculate the number of days between two events that are several months or years apart.
The `as.Date()` function is one straight-forward method for converting character strings into dates that can be used as such. The typical syntax is of the form:
```{r, eval = FALSE}
as.Dates(<vector of dates>, format = <your format>)
```
Considering the dates of birth presented in Table \@ref(tab:tennis) we can save them in an appropriate format using:
```{r}
(players_dob <- as.Date(c("22 May 1987", "3 Jun 1986", "8 Aug 1981",
"11 Feb 1996", "3 Sep 1993"),
format = "%d %b %Y"))
```
Note the syntax of `format = "%d %b %Y"`. The following table shows common format elements for use with the `as.Date()` function:
Table: (\#tab:dateFormats) Common date formatting elements for use with `as.Date()` reproduced from statmethods.net.
Symbol Meaning Example
------- -------------------------- -------------
%d day as a number (0-31) 01-31
%a abbreviated weekday Mon
%A unabbreviated weekday Monday
%m month (00-12) 00-12
%b abbreviated month Jan
%B unabbreviated month January
%y 2-digit year 07
%Y 4-digit year 2007
There are many advantages to using the `as.Date()` format (in addition to proper sorting). For example, the subtraction between two dates becomes more meaningful and yields the difference in days between them. As an example, the number of days between Rafael Nadal's and Andy Murray's dates of birth can be obtained as
```{r}
players_dob[1] - players_dob[2]
```
In addition, subsetting becomes also more intuitive and, for example, to find the players born after 1 January 1986 we can simply run:
```{r}
players[players_dob > "1986-01-01"]
```
There are many other reasons for using this format (or other date formats). A more detailed discussion on this topic can, for example, be found in [Cole Beck's notes](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/ColeBeck/datestimes.pdf).
### Useful Functions with Vectors
The reason for extracting or creating vectors often lies in the need to collect information from them. For this purpose, a series of useful functions are available that allow to extract information or arrange the vector elements in a certain way which can be of interest to the user. Among the most commonly used functions we can find the following ones
`length()` `sum()` `mean()` `order()` and `sort()`
whose name is self-explanatory in most cases. For example we have
```{r}
length(grand_slam_win)
sum(grand_slam_win)
mean(grand_slam_win)
```
To sort the players by number of grand slam wins, we could use the function `order()` which returns the *position* of the elements of a vector sorted in an ascending order,
```{r}
order(grand_slam_win)
```
Therefore, we can sort the players in ascending order of wins as follows
```{r}
players[order(grand_slam_win)]
```
which implies that Roger Federer won most grand slams. Another related function is `sort()` which simply sorts the elements of a vector in an ascending manner. For example,
```{r}
sort(grand_slam_win)
```
which is compact version of
```{r}
grand_slam_win[order(grand_slam_win)]
```
It is also possible to use the functions `sort()` and `order()` with characters and dates. For example, to sort the players' names alphabetically (by first name) we can use:
```{r}
sort(players)
```
Similarly, we can sort players by age (oldest first)
```{r}
players[order(players_dob)]
```
or in an reversed manner (oldest last):
```{r}
players[order(players_dob, decreasing = TRUE)]
```
There are of course many other useful functions that allow to deal with vectors which we will not mention in this section but can be found in a variety of references [see e.g. @wickham2014advanced].
### Creating sequences
When using R for statistical programming and data analysis it is very common to create sequences of numbers. Here are three common ways used to create such sequences:
- `from:to`: This method is quite intuitive and very compact. For example:
```{r}
1:3
(x <- 3:1)
(y <- -1:-4)
(z <- 1.3:3)
```
- `seq_len(n)`: This function provides a simple way to generate a sequence from 1 to an arbitrary number `n`. In general, `1:n` and `seq_len(n)` are equivalent with the notable exceptions where `n = 0` and `n < 0`. The reason for these exceptions will become clear in Section \@ref(forloop). Let's see a few examples:
```{r}
n <- 3
1:n
seq_len(n)
n <- 0
1:n
seq_len(n)
```
- `seq(a, b, by/length.out = d)`: This function can be used to create more "complex" sequences. It can either be used to create a sequence from `a` to `b` by increments of `d` (using the option `by`) or of a total length of `d` (using the option `length.out`). A few examples:
```{r}
(x <- seq(1, 2.8, by = 0.4))
(y <- seq(1, 2.8, length.out = 6))
```
This can be combined with the `rep()` function to create vectors with repeated values or sequences, for example:
```{r}
rep(c(1,2), times = 3, each = 1)
rep(c(1,2), times = 1, each = 3)
rep(c(1,2), times = 2, each = 2)
```
where the option `times` allows to specify how many times the object needs to be repeated and `each` regulates how many times each element in the object is repeated.
It is also possible to generate sequences of dates using the function `seq()`. For example, to generate a sequence of 10 dates between the dates of birth of Andy Murray and Rafael Nadal we can use
```{r}
seq(from = players_dob[1], to = players_dob[2], length.out = 10)
```
Similarly, we can create a sequence between the two dates by increments of one week (backwards)
```{r}
seq(players_dob[1], players_dob[2], by = "-1 week")
```
or by increments of one month (forwards)
```{r}
seq(players_dob[2], players_dob[1], by = "1 month")
```
### Example: Apple Stock Price
Suppose that someone is interested in analyzing the behavior of Apple's stock price over the last three months. The first thing needed to perform such analysis is to recover (automatically) today's date. In R, this can be obtained as follows
```{r, cache = TRUE}
(today <- Sys.Date())
```
Once this is done, we can obtain the date which is exactly three months ago by using
```{r, cache = TRUE}
(three_months_ago <- seq(today, length = 2, by = "-3 months")[2])
```
With this information, we can now download Apple's stock price and represent these stocks through a candlestick chart which summarizes information on daily opening and closing prices as well as minimum and maximum prices. These charts are often used with the hope of detecting trading patterns over a given period of time.
```{r candleAAPL, message = FALSE, fig.height = 5, fig.width = 6, fig.align = "center", warning = FALSE, results = 'hide', fig.cap = "Candlestick chart for Apple's stock price for the last three months.", cache=TRUE}
library(quantmod)
getSymbols("AAPL", from = three_months_ago, to = today)
candleChart(AAPL, theme = 'white', type = 'candles')
```
Using the price contained in the object we downloaded (i.e. `AAPL`), we can compute Apple's arithmetic returns which are defined as follows
\begin{equation}
R_t = \frac{S_t - S_{t-1}}{S_{t-1}},
(\#eq:returns)
\end{equation}
where $R_t$ are the returns at time *t* and $S_t$ is the stock price. This is implemented in the function `ClCl()` within the `quantmod` package. For example, we can create a vector of returns as follows
```{r, cache=TRUE}
AAPL_returns <- na.omit(ClCl(AAPL))
```
where `na.omit()` is used to remove missing values in the stock prices vector since, if we have $n+1$ stock prices, we will only have $n$ returns and therefore the first return cannot be computed. We can now compute the mean and median of the returns over the considered period.
```{r, cache=TRUE}
mean(AAPL_returns)
median(AAPL_returns)
```
However, a statistic that is of particular interest to financial operators is the Excess Kurtosis which, for a random variable that we denote as $X$, can be defined as
\begin{equation}
\text{Kurt} = \frac{{\mathbb{E}}\left[\left(X - \mathbb{E}[X]\right)^4\right]}{\left({\mathbb{E}}\left[\left(X - \mathbb{E}[X]\right)^2\right]\right)^2} - 3.
\end{equation}
The reason for defining this statistic as *Excess* Kurtosis lies in the fact that the standardized kurtosis is compared to that of a Gaussian distribution (whose kurtosis is equal to 3) which has exponentially decaying tails. Consequently, if the Excess Kurtosis is positive, this implies that the distribution has heavier tails than a Gaussian and therefore has higher probabilities of extreme events occurring. To understand why the Excess Kurtosis is equal to 0 for a Gaussian distribution click on the button below:
<button id="hidebutton1">Excess Kurtosis Derivation</button>
<div id="hideclass1">
```{block2, type='rmdnote'}
Assuming $X$ to be Gaussian, we have $X \sim \mathcal{N}(\mu, \sigma^2)$. Then, we define $Z$ as $Z \sim \mathcal{N}(0,1)$ and $Y$ as $Y = Z^2 \sim \chi^2_1$. Remember that a random variable following a $\chi^2_1$ distribution has the following properties: $\text{Var}[Y] = 2$ and $\mathbb{E}[Y] = 1$. Using these definitions and properties, we obtain:
\begin{equation}
\begin{aligned}
\text{Kurt} &= \frac{{\mathbb{E}}\left[\left(X - \mathbb{E}[X]\right)^4\right]}{\left({\mathbb{E}}\left[\left(X - \mathbb{E}[X]\right)^2\right]\right)^2} - 3 = \frac{{\mathbb{E}}\left[\left(X - \mathbb{E}[X]\right)^4\right]}{\left[\text{Var} \left(X\right)\right]^2} - 3 = \frac{{\mathbb{E}}\left[\left(X - \mathbb{E}[X]\right)^4\right]}{\sigma^4} - 3\\
&= {\mathbb{E}}\left[\left(\frac{X - \mathbb{E}[X]}{\sigma}\right)^4\right] - 3 = \mathbb{E}\left[Z^4\right] - 3 = \text{Var}[Z^2] + \mathbb{E}^2\left[Z^2\right] - 3\\
&= \text{Var}[Y] + \mathbb{E}^2\left[Y\right] - 3 = 0.
\end{aligned}
\end{equation}
This implies that the theoretical value of the excess Kurtosis for a normally distributed random variable is $0$.
```
</div>
\newline
Given this statistic, it is useful to compute this on the observed data and for this purpose a common estimator of the excess Kurtosis is
\begin{equation}
k = \frac{\frac{1}{n} \sum_{t = 1}^{n} \left(R_t -\bar{R}\right)^4}{\left(\frac{1}{n} \sum_{t = 1}^{n} \left(R_t -\bar{R}\right)^2 \right)^2} - 3,
\end{equation}
where $\bar{R}$ denotes the sample average of the returns, i.e.
\begin{equation*}
\bar{R} = \frac{1}{n} \sum_{i = 1}^n R_i.
\end{equation*}
In R, this can simply be done as follows:
```{r, cache=TRUE}
mu <- mean(AAPL_returns)
(k <- mean((AAPL_returns - mu)^4)/(mean((AAPL_returns - mu)^2))^2 - 3)
```
Therefore, we observe an estimated Excess Kurtosis of `r round(k,2)` which is quite high and tends to indicate that the returns have heavier tails than the normal distribution. In Chapter \@ref(control), we will revisit this example and investigate if there is **enough evidence** to conclude that Apple's stock has Excess Kurtosis larger than zero.
## Matrices
Matrices are a common data structure in R which have two dimensions to store multiple vectors of the same length combined as a unified object. The `matrix()` function can be used to create a matrix from a vector:
```{r}
(mat <- matrix(1:12, ncol = 4, nrow = 3))
```
Notice that the first argument to the function is a vector (in this case the sequence 1 to 12) which is then transformed into a matrix with four columns (`ncol = 4`) and three rows (`nrow = 3`).
```{block2, type='rmdtip'}
By default, the vectors are transformed into matrices by placing the elements by column (i.e. top to the bottom of ech column in sequence until all columns are full). If you wish to fill the matrix by row, all you need to do is specify the argument `byrow = T`.
```
```{r}
# Compare with the matrix above
matrix(1:12, ncol = 4, nrow = 3, byrow = T)
```
```{block2, type='rmdwarning'}
Usually the length of the vector (i.e. number of elements in the vector) is the result of the multiplication between the number of columns and number of rows. What happens if the vector has fewer elements for the same matrix dimension? What happens if the vector has more elements?
```
It is often the case that we already have equi-dimensional vectors available and we wish to bundle them together as matrix. In these cases, two useful functions are `cbind()` to combine vectors as vertical **c**olumns side-by-side, and `rbind()` to combine vectors as horizontal **r**ows. An example of `cbind()` is shown here:
```{r}
players <- c("Novak Djokovic", "Rafael Nadal", "Roger Federer",
"Daniil Medvedev", "Dominic Thiem")
grand_slam_win <- c(16, 19, 20, 0, 0)
win_percentage <- c(82.7, 83.1, 82.2, 63.6, 64.3)
(mat <- cbind(grand_slam_win, win_percentage))
```
The result in this case is a $5 \times 2$ matrix (with `rbind()` it would have been a $2 \times 5$ matrix). Once the matrix is defined, we can assign names to its rows and columns by using `rownames()` and `colnames()`, respectively. Of course, the number of names must match the corresponding matrix dimension. In the following example, each row corresponds to a specific player (thereby using the `players` vector) and each column corresponds to a specific statistic of the players.
```{r}
rownames(mat) <- players
colnames(mat) <- c("GS win", "Win rate")
mat
```
### Subsetting
As with vectors, it is possible to subset the elements of a matrix. Since matrices are two-dimensional data structures, it is necessary to specify the position of the elements of interest in both dimensions. For this purpose we can use `[ , ]` immediately following the named matrix. Note the use of `,` within the square brackets in order to specify both row and column position of desired elements within the matrix (e.g. `matrixName[row, column]`). Consider the following examples:
```{r}
# Subset players "Dominic Thiem" and "Roger Federer" in matrix named "mat"
mat[c("Dominic Thiem", "Roger Federer"), ]
# Subset includes row 1 and 3 for all columns
mat[c(1, 3), ]
# Subset includes all rows for column 2
mat[ , 2]
# Subset includes rows 1, 2, & 3 for column 1
mat[1:3, 1]
```
It can be noticed that, when a space is left blank before or after the comma, this means that respectively all the rows or all the columns are considered.
### Matrix Operators in R
As with vectors, there are some useful functions that can be used with matrices. A first example is the function `dim()` that allows to determine the dimension of a matrix. For example, consider the following $4 \times 2$ matrix
\begin{equation*}
\mathbf{A} = \left[
\begin{matrix}
1 & 5\\
2 & 6\\
3 & 7\\
4 & 8
\end{matrix}
\right]
\end{equation*}
which can be created in R as follows:
```{r}
(A <- matrix(1:8, 4, 2))
```
Therefore, we expect `dim(A)` to return the vector `c(4, 2)`. Indeed, we have
```{r}
dim(A)
```
Next, we consider the function `t()` that allows transpose a matrix. For example, $\mathbf{A}^T$ is equal to:
\begin{equation*}
\mathbf{A}^T = \left[
\begin{matrix}
1 & 2 & 3 & 4\\
5 & 6 & 7 & 8
\end{matrix}
\right],
\end{equation*}
which is a $2 \times 4$ matrix. In R, we achieve this as follows
```{r}
(At <- t(A))
dim(At)
```
Aside from playing with matrix dimensions, matrix algebraic operations have specific commands. For example, the operator `%*%` is used in R to denote matrix multiplication while, as opposed to scalar objects, the regular product operator `*` performs the element by element product (or Hadamard product) when applied to matrices. For example, consider the following matrix product:
\begin{equation*}
\mathbf{B} = \mathbf{A}^T \mathbf{A} = \left[
\begin{matrix}
30 & 70\\
70 & 174
\end{matrix}
\right],
\end{equation*}
which can be done in R as follows:
```{r}
(B <- At %*% A)
```
Other common matrix operations include finding the determinant of a matrix and finding its inverse. These are often used, for example, when computing the likelihood function for a variable following a Gaussian distribution or when simulating time series or spatial data. The functions that perform these operations are `det()` and `solve()` that respectively find the determinant and the inverse of a matrix (which necessarily has to be square). The function `det()` can be used to compute the determinant of a square matrix. In the case of a $2 \times 2$ matrix, there exists a simple solution for the determinant which is
\begin{equation*}
\text{det} \left( \mathbf{D} \right) = \text{det} \left( \left[
\begin{matrix}
d_1 & d_2\\
d_3 & d_4
\end{matrix}
\right] \right) = d_1 d_4 - d_2 d_3.
\end{equation*}
Consider the matrix $\mathbf{B}$, we have
\begin{equation*}
\text{det} \left( \mathbf{B}\right) = 30 \cdot 174 - 70^2 = 320.
\end{equation*}
In R, we can simply do
```{r}
det(B)
```
The function `solve()` is also an important function when working with matrices as it allows to inverse a matrix. It is worth remembering that a square matrix that is not invertible (i.e. $\mathbf{A}^{-1}$ doesn't exist) is called *singular* and the determinant offers a way to "check" if this is the case for a given matrix. Indeed, a square matrix is singular if and only if its determinant is 0. Therefore, in the case of $\mathbf{B}$, we should be able to compute its inverse. As for the determinant, there exists a formula to compute the inverse of $2 \times 2$ matrices, i.e.
\begin{equation*}
\mathbf{D}^{-1} = \left[
\begin{matrix}
d_1 & d_2\\
d_3 & d_4
\end{matrix}
\right]^{-1} = \frac{1}{\text{det}\left( \mathbf{D} \right)} \left[
\begin{matrix}
\phantom{-}d_4 & -d_2\\
-d_3 & \phantom{-}d_1
\end{matrix}
\right].
\end{equation*}
Considering the matrix $\mathbf{B}$, we obtain
\begin{equation*}
\mathbf{B}^{-1} = \left[
\begin{matrix}
30 & 70\\
70 & 174
\end{matrix}
\right]^{-1} = \frac{1}{320}\left[
\begin{matrix}
\phantom{-}174 & -70\\
-70 & \phantom{-}30
\end{matrix}
\right] = \left[
\begin{matrix}
\phantom{-}0.54375 & -0.21875\\
-0.21875 & \phantom{-}0.09375
\end{matrix}
\right]
\end{equation*}
```{r}
(B_inv <- solve(B))
```
Finally, we can verify that
\begin{equation*}
\mathbf{G} = \mathbf{B} \mathbf{B}^{-1},
\end{equation*}
should be equal to the identity matrix,
```{r}
(G <- B %*% B_inv)
```
The result is of course extremely close but $\mathbf{G}$ is not exactly equal to the identity matrix due to rounding and other numerical errors.
Another function of interest is the function `diag()` that can be used to extract the diagonal of a matrix. For example, we have
\begin{equation*}
\text{diag} \left( \mathbf{B} \right) = \left[30 \;\; 174\right],
\end{equation*}
which can be done in R as follows:
```{r}
diag(B)
```
Therefore, the function `diag()` computes the trace of matrix (i.e. the sum of the diagonal elements). For example,
\begin{equation*}
\text{tr} \left( \mathbf{B} \right) = 204,
\end{equation*}
or in R:
```{r}
sum(diag(B))
```
Another use of the function `diag()` is to create diagonal matrices. Indeed, if the argument of this function is a vector, its behavior is the following:
\begin{equation*}
\text{diag} \left(\left[a_1 \;\; a_2 \;\; \cdots \;\; a_n\right]\right) = \left[
\begin{matrix}
a_1 & 0 & \cdots & 0 \\
0 & a_2 & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & a_n
\end{matrix}
\right].
\end{equation*}
Therefore, this provides a simple way of creating an identity matrix by combining the functions `diag()` and `rep()` (discussed in the previous section) as follows:
```{r}
n <- 4
(ident <- diag(rep(1, n)))
```
### Example: Summary Statistics with Matrix Notation
A simple example of the operations we discussed in the previous section is given by many common statistics that can be re-expressed using matrix notation. As an example, we will consider three common statistics that are the sample mean, variance and covariance. Let us consider the following two samples of size $n$
\begin{equation*}
\begin{aligned}
\mathbf{x} &= \left[x_1 \;\; x_2 \; \;\cdots \;\; x_n\right]^T\\
\mathbf{y} &= \left[y_1 \;\;\; y_2 \; \;\;\cdots \;\;\; y_n\right]^T.
\end{aligned}
\end{equation*}
The sample mean of $\mathbf{x}$ is
\begin{equation*}
\bar{x} = \frac{1}{n} \sum_{i = 1}^{n} x_i,
\end{equation*}
and its sample variance is
\begin{equation*}
s_x^2 = \frac{1}{n} \sum_{i = 1}^n \left(x_i - \bar{x}\right)^2.
\end{equation*}
The sample covariance between $\mathbf{x}$ and $\mathbf{y}$ is
\begin{equation*}
s_{x,y} = \frac{1}{n} \sum_{i = 1}^n \left(X_i - \bar{x}\right) \left(Y_i - \bar{y}\right),
\end{equation*}
where $\bar{y}$ denotes the sample mean of $\mathbf{y}$.
Consider the sample mean, this statistic can be expressed in matrix notation as follows
\begin{equation*}
\bar{x} = \frac{1}{n} \sum_{i = 1}^{n} x_i = \frac{1}{n} \mathbf{x}^T \mathbf{1},
\end{equation*}
where $\mathbf{1}$ is a column vector of $n$ ones.
\begin{equation*}
\begin{aligned}
s_x^2 &= \frac{1}{n} \sum_{i = 1}^n \left(x_i - \bar{x}\right)^2 = \frac{1}{n} \sum_{i = 1}^n x_i^2 - \bar{x}^2 = \frac{1}{n} \mathbf{x}^T \mathbf{x} - \bar{x}^2\\
&= \frac{1}{n} \mathbf{x}^T \mathbf{x} - \left(\frac{1}{n} \mathbf{x}^T \mathbf{1}\right)^2 = \frac{1}{n} \left(\mathbf{x}^T \mathbf{x} - \frac{1}{n} \mathbf{x}^T \mathbf{1} \mathbf{1}^T \mathbf{x}\right)\\
&= \frac{1}{n}\mathbf{x}^T \left( \mathbf{I} - \frac{1}{n} \mathbf{1} \mathbf{1}^T \right) \mathbf{x} = \frac{1}{n}\mathbf{x}^T \mathbf{H} \mathbf{x},
\end{aligned}
\end{equation*}
where $\mathbf{H} = \mathbf{I} - \frac{1}{n} \mathbf{1} \mathbf{1}^T$. This matrix is often called the *centering* matrix. Similarly, for the sample covariance we obtain
\begin{equation*}
\begin{aligned}
s_{x,y} &= \frac{1}{n} \sum_{i = 1}^n \left(x_i - \bar{x}\right) \left(y_i - \bar{y}\right) = \frac{1}{n}\mathbf{x}^T \mathbf{H} \mathbf{y}.
\end{aligned}
\end{equation*}
In the code below we verify the validity of these results through a simple simulated example where we compare the values of the three statistics based on the different formulas discussed above.
```{r demo_cov_summary}
# Sample size
n <- 100
# Simulate random numbers from a zero mean normal distribution with
# variance equal to 4.
x <- rnorm(n, 0, sqrt(4))
# Simulate random numbers from normal distribution with mean 3 and
# variance equal to 1.
y <- rnorm(n, 3, 1)
# Note that x and y are independent.
# Sample mean
one <- rep(1, n)
x_bar <- 1/n*sum(x)
x_bar_mat <- 1/n*t(x)%*%one
# Sample variance of x
H <- diag(rep(1, n)) - 1/n * one %*% t(one)
s_x <- 1/n * sum((x - x_bar)^2)
s_x_mat <- 1/n*t(x) %*% H %*% x
# Sample covariance
y_bar <- 1/n*sum(y)
s_xy <- 1/n*sum((x - x_bar)*(y - y_bar))
s_xy_mat <- 1/n*t(x) %*% H %*% y
```
To compare, let's construct a matrix of all the results that we calculated.
```{r}
cp_matrix <- matrix(c(x_bar, x_bar_mat, s_x, s_x_mat, s_xy, s_xy_mat), ncol = 2, byrow = T)
row.names(cp_matrix) <- c("Sample Mean", "Sample Variance", "Sample Covariance")
colnames(cp_matrix) <- c("Scalar", "Matrix")
cp_matrix
```
Therefore, using the previously obtained results we can construct the following *empirical* covariance matrix
\begin{equation}
\widehat{\text{Cov}}(X, Y) = \left[
\begin{matrix}
s_x^2 & s_{x,y} \\
s_{x,y} & s_y^2
\end{matrix}
\right].
\end{equation}
In R, this can be done as
```{r demo_cov_summary2}
# Sample variance of y
s_y_mat <- 1/n*t(y) %*% H %*% y
# Covariance matrix
(V <- matrix(c(s_x_mat, s_xy_mat, s_xy_mat, s_y_mat), 2, 2))
```
This result can now be compared to
```{r}
cov(cbind(x, y))
```
We can see that the results are slightly different from what we expected. This is because the calculation of `cov()` within the default R `stats` package is based on an unbiased estimator which is not the one we used. To obtain the same result, we can go back to our estimation by calculating via the below method
```{r}
(n-1)/n*cov(cbind(x, y))
```
### Example: Portfolio Optimization
Suppose that you are interested in investing your money in two stocks, say Apple and Netflix. However, you are wondering how much of each stock you should buy. To make it simple let us assume that you will invest $\omega W$ in Apple (AAPL) and $(1-\omega) W$ in Netflix (NFLX), where $W$ denotes the amount of money you would like to invest, and $\omega \in [0, \, 1]$ dictates the proportion allocated to each investment. Let $R_A$ and $R_N$ denote, respectively, the daily return (see Equation \@ref(eq:returns) if you don't remember what returns are) of Apple and Netflix. To make things simple we assume that the returns $R_A$ and $R_N$ are jointly normally distributed so we can write
\[
\mathbf{R} \stackrel{iid}{\sim} \mathcal{N} \left(\boldsymbol{\mu}, \boldsymbol{\Sigma}\right),
\]
where
\[
\mathbf{R} = \left[
\begin{matrix}
R_A\\
R_N
\end{matrix}
\right], \;\;\;\; \boldsymbol{\mu} = \left[
\begin{matrix}
\mu_A\\
\mu_N
\end{matrix}
\right], \;\; \text{and}\;\; \boldsymbol{\Sigma} = \left[
\begin{matrix}
\sigma_A^2 & \sigma_{AN}\\
\sigma_{AN} & \sigma_N^2
\end{matrix}
\right].
\]
Using these assumptions, the classical portfolio optimization problem, which would allow you to determine a "good" value of $\omega$, can be stated as follows: Given our initial wealth $W$, the investment problem is to decide how much should be invested in our first asset (Apple) and in our second asset (Netflix). As mentioned earlier, we assume that $\omega \in [0,\,1]$, implying that it is only possible to buy the two assets (i.e. *short positions* are not allowed). Therefore, to solve our investment problem we must choose the value of $\omega$ which would maximize your personal *utility*. Informally speaking, the utility represents a measurement of the "usefulness" that is obtained from your investment. To consider a simple case, we will assume that you are a particularly *risk averse* individual (i.e. you are looking for an investment that is as sure and safe as possible) and as such you want to pick the value of $\omega$ providing you with the smallest investment risk. We can then determine the optimal value of $\omega$ as follows. First, we construct a new random variable, say $Y$, which denotes the return of your portfolio. Since you invest $\omega W$ in Apple and $(1 - \omega) W$ in Netflix, it is easy to see that
\[Y = \left[\omega R_A + (1 - \omega) R_N \right]W.\]
In general the risk and variance of an investment are two different things but in our case we assume that the return $\mathbf{R}$ is normally distributed and in this case the variance can perfectly characterize the risk of our investment. Therefore, we can define the risk of our investment as the following function of $\omega$
\[f(\omega) = \text{Risk}(Y) = \text{Var} (Y) = \left[\omega^2 \sigma_A^2 + (1 - \omega)^2 \sigma_N^2 + 2 \omega (1 - \omega) \sigma_{AN}\right]W^2.\]
The function $f(\omega)$ is minimized for the value $\omega^*$ which is given by
\begin{equation}
\omega^* = \frac{\sigma^2_N - \sigma_{AN}}{\sigma^2_A + \sigma^2_N - 2\sigma_{AN}}.
(\#eq:omega)
\end{equation}
If you are interested in understanding how Equation \@ref(eq:omega) was obtained, click on the button below:
<button id="hidebutton2">Minimum Variance Portfolio Derivation</button>
<div id="hideclass2">
```{block2, type='rmdnote'}
To obtain $\omega^*$ we first differentiate the function $f(\omega)$ with respect to $\omega$, which gives
\begin{equation}
\frac{\partial}{\partial \omega} \, f(\omega) = \left[2 \omega \sigma_A^2 - 2 (1 - \omega) \sigma_N^2 + 2 (1 - 2\omega) \sigma_{AN}\right]W^2
\end{equation}
Then, we define $\omega^*$ as
\begin{equation}
\omega^* \; : \frac{\partial}{\partial \omega} \, f(\omega) = 0.
\end{equation}
Therefore, we obtain
\begin{equation}
\omega^* \left(\sigma_A^2 + \sigma_N^2 - 2 \sigma_{AN}\right) = \sigma_N^2 - \sigma_{AN},
\end{equation}
which simplifies to
\begin{equation}
\omega^* = \frac{\sigma_N^2 - \sigma_{AN}}{\sigma_A^2 + \sigma_N^2 - 2 \sigma_{AN}}.
\end{equation}
Finally, we verify that our result is a minimum by considering the second derivative, i.e.
\begin{equation}
\frac{\partial^2}{\partial \omega^2} \, f(\omega) = 2W\left[\sigma_A^2 + \sigma_N^2 - 2\sigma_{AN} \sigma_{AN}\right].
\end{equation}
Since $W$ is strictly positive, all that remains to conclude our derivation is to verify that $\sigma_A^2 + \sigma_N^2 - 2\sigma_{AN} \sigma_{AN} \leq 0$. In fact, this is a well known inequality which is a direct consequence of the Tchebychev inequality. However, here is a simpler argument to understand why this is the case. Indeed, it is obvious that $\text{Var}(R_A - R_N) \leq 0$ and we also have $\text{Var}(R_A - R_N) = \sigma_A^2 + \sigma_N^2 - 2\sigma_{AN}$. Thus, we obtain
\[\text{Var}(R_A - R_N) = \sigma_A^2 + \sigma_N^2 - 2\sigma_{AN} \leq 0,\]
which verifies that our result is a minimum.
```
</div>
\newline
Using \@ref(eq:omega), we obtain that the expected value and variance of our investment are given by
\[
\begin{aligned}
\text{Expected Value Investment} &=\left[\omega^* \mu_A + (1 - \omega^*) \mu_N\right] W,\\
\text{Variance Investment} &= \left[(\omega^*)^2 \sigma_A^2 + (1 - \omega^*)^2 \sigma_N^2 + 2 \omega^* (1 - \omega^*) \sigma_{AN}\right]W^2.
\end{aligned}
\]
In practice, the vector $\boldsymbol{\mu}$ and the matrix $\boldsymbol{\Sigma}$ are unknown and must be estimated using historical data. In the code below, we compute the optimal value of $\omega$ based on \@ref(eq:omega) by using the last five years of historical data for the two stocks considered. For simplicity we set $W$ to one but note that $W$ plays no role in determining $\omega^*$.
```{r downloadaaplandamaz, message = FALSE, warning = FALSE, results = 'hide', cache = TRUE}
# Load quantmod
library(quantmod)
# Download data
today <- Sys.Date()
five_years_ago <- seq(today, length = 2, by = "-5 year")[2]
getSymbols("AAPL", from = five_years_ago, to = today)
getSymbols("NFLX", from = five_years_ago, to = today)
# Compute returns
Ra <- na.omit(ClCl(AAPL))
Rn <- na.omit(ClCl(NFLX))
# Estimation of mu and Sigma
Sigma <- cov(cbind(Ra, Rn))
mu <- c(mean(Ra), mean(Rn))
# Compute omega^*
omega_star <- (Sigma[2, 2] - Sigma[1, 2])/(Sigma[1, 1] + Sigma[2, 2] - 2*Sigma[1, 2])
# Compute investment expected value and variance
mu_investment <- omega_star*mu[1] + (1 - omega_star)*mu[2]
var_investment <- omega_star^2*Sigma[1,1] + (1 - omega_star)^2*Sigma[2,2] +
2*omega_star*(1 - omega_star)*Sigma[1,2]
```
From this code, we obtain $\omega^* \approx$ `r round(100*omega_star, 2)`%. In the table below we compare the empirical expected values and variances of the stocks as well as those of our investment:
```{r tableprtexample2, eval = FALSE}
investment_summary <- matrix(NA, 2, 3)
dimnames(investment_summary)[[1]] <- c("Expected value", "Variance")
dimnames(investment_summary)[[2]] <- c("Apple", "Netflix", "Investment")
investment_summary[1, ] <- c(mu, mu_investment)
investment_summary[2, ] <- c(diag(Sigma), var_investment)
knitr::kable(investment_summary)
```
```{r tableprtexample, echo = FALSE}