-
Notifications
You must be signed in to change notification settings - Fork 39
/
Text-Distance-Algo.Rmd
205 lines (154 loc) · 11 KB
/
Text-Distance-Algo.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
---
title: "lalala"
author: "Eric He"
date: "August 30, 2017"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r}
library("quanteda")
library("dplyr")
library("purrr")
library("stringr")
library("readtext")
library("reshape2")
library("magrittr")
library("ggplot2")
library("gridExtra")
masterIndex <- read.csv("masterIndex.csv")
masterIndex$filing %<>% as.character
tickers <- readLines("tickers.txt") # use unique(masterIndex) if we wish to scale this across multiple years, or keep tickers.txt updated
StopWordsList <- readLines("StopWordsList.txt")
sections <- c("1", "1A", "3", "4", "7", "8", "9", "9A")
section_extractor <- function(statement, section){
name <- statement$doc_id
pattern <- paste0("(?i)°Item ", section, "[^\\w|\\d]", ".*°")
section_hits <- str_extract_all(statement, pattern, simplify=TRUE)
if (is_empty(section_hits) == TRUE){
empty_vec <- "empty"
names(empty_vec) <- paste(name, section, sep = "_")
print(paste("No hits for section", section, "of filing", name))
return(empty_vec)
}
word_counts <- map_int(section_hits, ntoken)
max_hit <- which(word_counts == max(word_counts))
max_filing <- section_hits[[max_hit[length(max_hit)]]]
if (max(word_counts) < 250 & str_detect(max_filing, pattern = "(?i)(incorporated by reference)|(incorporated herein by reference)") == TRUE){
empty_vec <- "empty"
names(empty_vec) <- paste(name, section, sep = "_")
print(paste("Section", section, "of filing", name, "incorporates by reference its information"))
return(empty_vec)
}
names(max_filing) <- paste(name, section, sep = "_")
return(max_filing)
}
section_dfm <- function(statements_list, section, min_words, tf){
map(statements_list, section_extractor, section=section) %>%
map(corpus) %>%
reduce(`+`) %>%
dfm(tolower=TRUE, remove=StopWordsList, remove_punct=TRUE) %>%
dfm_subset(., rowSums(.) > min_words) %>%
when(tf==TRUE ~ tf(., scheme="log"),
~ .)
}
filing_dfm <- function(sections, filings_list, min_words, tf){
map(sections, section_dfm, statements_list=filings_list, min_words=min_words, tf=tf)
}
dist_parser <- function(distObj, section){
melted_frame <- as.matrix(distObj) %>%
{. * upper.tri(.)} %>%
melt(varnames = c("previous_filing", "filing"), value.name = paste0("sec", section, "dist"))
melted_frame$previous_filing %<>% str_extract(pattern = ".*?(?=\\.)")
melted_frame$filing %<>% str_extract(pattern = ".*?(?=\\.)")
return(melted_frame)
}
filing_similarity <- function(dfm_list, method){
map(dfm_list, textstat_simil, method=method) %>%
map(dist_parser)}
index_filing_filterer <- function(ticker, index){
filter(index, TICKER == ticker) %>%
arrange(DATE_FILED) %>%
pull(filing)
}
file_path <- "parsed/"
file_type <- ".txt"
distance_returns_calculator <- function(the_ticker){
file_names <- index_filing_filterer(the_ticker, masterIndex)
if (length(file_names) <= 1){
empty_list <- data_frame()
print(paste("Only one filing available for ticker", the_ticker))
return(empty_list)
}
file_locations <- paste0(file_path, file_names, file_type)
filings_list <- map(file_locations, readtext)
similarity_list <- map(sections, section_dfm, statements_list=filings_list, min_words=10, tf=TRUE) %>%
map(textstat_simil, method="cosine") %>%
map2(sections, dist_parser) %>%
reduce(left_join, by = c("previous_filing", "filing"))
prev_current_mapping <- data_frame(previous_filing = file_names[-length(file_names)], filing = file_names[-1])
distance_returns_df <- left_join(prev_current_mapping, similarity_list, by = c("previous_filing", "filing"))
print(paste("Successfully mapped distance scores to financial returns for ticker", the_ticker))
return(distance_returns_df)}
distance_df <- map(tickers, distance_returns_calculator) %>%
reduce(rbind)
masterIndex %<>% left_join(distance_df, by = "filing")
write.csv(masterIndex, file = "index_distance.csv", row.names = FALSE)
```
Summary statistics
```{r}
map(distance_df, summary)
```
```{r eval = FALSE}
$previous_filing
Length Class Mode
16001 character character
$filing
Length Class Mode
16001 character character
$sec1dist
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0368 0.8723 0.9239 0.8794 0.9555 1.0000 874
$sec1Adist
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0593 0.9244 0.9568 0.9205 0.9768 1.0000 2224
$sec3dist
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0281 0.7404 0.9529 0.8323 1.0000 1.0000 2675
$sec4dist
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.074 0.909 0.917 0.905 0.954 1.000 10332
$sec7dist
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0581 0.8157 0.8637 0.8229 0.8961 1.0000 927
$sec8dist
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0535 0.8464 0.8846 0.8697 0.9568 1.0000 1286
$sec9dist
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.077 0.909 0.973 0.892 1.000 1.000 14197
$sec9Adist
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.1617 0.9086 0.9711 0.9217 0.9834 1.0000 1338
```
As expected, the majority of the master index's first year does not have any distance information.
Many of the extremely distant filings (with distance scores less than 0.2) tend to be due to parsing errors. However, it is difficult to distinguish these from the relatively few, genuinely drastic changes which constitute highly relevant data.
Filing 17468, filed by Sandisk Corp (SNDK) on 2016-02-12, had a parsing error in which several mentions of Item 3 in Item 1A were tagged as sections. The actual Item 3 referred to legal proceedings in its note, and had a low word count, allowing the longer, mistagged mentions of Item 3 to take precedence. Meanwhile, its previous filing, filing 11883 filed on 2015-02-10, was parsed correctly.
The same is true of Items 1 and 7 in filings 11 and 6045, DISH Network Corp's (DISH) 2013-02-20 and 2014-02-21 filings, respectively. Item 1 is tagged 5 times each in both filings, and Item 7 is tagged a total of 33 times in the 2014 filing and 31 times in 2013!
Another cause of low similarity scores is when the item is either not present, or the item is not tagged at all. Filing 6050, by Northwest Pipe Co (NWPX) in 2014-03-17, failed to tag Item 7 due to an HTML number/name wedged in between the 'I' and 'tem' in "Item 7" , while Northwest's other filings (filing 16, 2013-03-18, and filing 11899, 2015-03-16) contain it. Thus, the similarity scores for both filing 6050 and 11899 are extremely (and wrongly) very low.
Still another cause of low similarity scores occurs when a company which has previously "incorporated by reference" the relevant text of the filing item by referring the reader to another document, decides to put the text directly into the filing the next year. Filing 11915, filed by PG&E (PCG) in 2015-02-10, has a similarity score of 0.1291636 with its predecessor, filing 9160 in 2014-02-11 for section 7. Here is the text of Item 7, filing 9160:
"Management's Discussion and Analysis of Financial Condition and Results of Operations A discussion of PG E Corporation's and the Utility s consolidated financial condition and results of operations is set forth under the heading Management's Discussion and Analysis of Financial Condition and Results of Operations as well as the Glossary in the 2013 Annual Report, which **discussion is incorporated herein by reference**."
However, the actual Item 7 text is present in filing 11915, with a total of 17103 words. Sections 1A and 8 of the filing also experience this change in reporting practice, leading to very low similarity scores for all 3 sections, even though the scores would probably be very high if we were able to look at the reference text.
Extremely high similarity scores also raise some problems, although in general they are technically correct.
Very high similarity scores on or about 1 represent no changes in the text between years. They tend to be boiler plate language. A common similarity score seen in the dataset is 0.9166667, which corresponds to having one change in word out of 12 words. For example, filings 42 and 6072 of United Continental Holdings (UAL) in Items 4:
"°Item 4. MINE SAFETY DISCLOSURES. Not applicable. 28 Table of Contents PART II"
"°Item 4. MINE SAFETY DISCLOSURES. Not applicable. 24 Table of Contents PART II"
"Of", which is a stop word, is removed during the processing for a total of 12 words. The only difference is the the page numbers, 28 and 24. A solution would be to remove numbers during the distance analysis.
This situation is very common for Items 3 and 4.
Item 3 discusses Legal Proceedings, and small companies with no legal proceedings generally have nothing to report here. For example, Oceanfirst Financial Corp (OCFC) had no legal proceedings to report in its 2015-03-13 and 2016-03-15 filings, filings 11913 and 17497, respectively. Here is the boiler plate:
"°Item 3. Legal Proceedings The Company and the Bank are not involved in any pending legal proceedings other than routine legal proceedings occurring in the ordinary course of business. Such other routine legal proceedings in the aggregate are believed by management to be immaterial to the Company s financial condition or results of operations."
Item 4 is a specific section about mine safety disclosures, which is not relevant for the majority of companies.
We cannot eliminate boiler plate language, especially not for legal proceedings, because changes from the boiler language is very important. For example, filing 67 by Jakks Pacific Inc. (JAKK) reports no relevant legal proceedings in its 2013-03-15 filing, but discusses a new lawsuit in its 2014-03-17 filing, leading to a relatively low similarity score of 0.3348315 between the two filings. A problem creating lots of false positives is the incorporation by reference issue discussed previously. A possible solution for incorporation by reference would be to remove hits with less than x (250 according to McDonald) words and which contain variations of the phrase "incorporate by reference".
High similarity scores are also seen in filings which use incorporation by reference, as discussed earlier.
Another niche situation occurs when the company has filed more than one of the same report at the same time. This occurs when a company is a holding company, and its subsidiaries are required to file reports with the SEC as well. The oil and gas company, Pacific Gas and Electric, has a parent company called PG&E Corporation. They are both traded under the same ticker (PCG), but are identified by the SEC as two different entities and thus are identified twice. Thus, half the comparisons of the PG&E filings return the maximum similarity score of 1 for each section of the filing. Some holding companies have multiple subsidiaries, all filing the exact same report and all traded under the exact same ticker. Luckily, the SEC text file names for the reports are the same, allowing us to throw out file names which are duplicates during the preprocessing stage.