forked from kayleealexander/TidyText
-
Notifications
You must be signed in to change notification settings - Fork 0
/
TidyText.Rmd
595 lines (442 loc) · 25.2 KB
/
TidyText.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
---
title: "Introduction to TidyText"
author: "Kaylee Alexander, Duke University"
date: '2019-18-04'
output:
html_notebook: default
github_document: default
---
[RStudio](https://www.rstudio.com/) is a free and open-source integrated development environment (IDE) for R, a programming language for statistical computing and graphics. Working in RStudio, you can easy install R packages, such as the Tidyverse packages for data science. These packages (i.e. dplyr, tidyr, ggplot2, etc.) facilitate data cleaning, analysis and visualization. R can also be used to analyze non-quantitative data, such as texts. In this workshop we will use the **[tidytext](https://www.tidytextmining.com/)** package (as well as others) to analyze and visualize written works, such as novels and song lyrics.
###1. Getting Started
To begin, we need to open RStudio and create a new script.
* Go to **File** → **New File** → **R Script**
For this workshop, you will be using the following packages:
1. [dplyr](https://dplyr.tidyverse.org/)
2. [tidytext](https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html)
3. [gutenbergr](https://cran.r-project.org/web/packages/gutenbergr/vignettes/intro.html)
4. [ggplot2](https://ggplot2.tidyverse.org/)
5. [ggthemes](https://www.rdocumentation.org/packages/ggthemes/versions/3.5.0)
6. [stringr](https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html)
7. [tidyr](https://tidyr.tidyverse.org/)
If you have not worked with these packages in RStudio before, you will need to install them. To install these packages in RStudio:
* click select the **Packages** tab in the bottom right window
* click **Install**
* type in the names of the packages above (separated by a space or a comma)
* click **Install**
###2. Manually Loading Text & Creating a Data Frame
Text can be entered into RStudio manually and transformed into a data frame for analysis. You can separate multiple lines of text by separating each line with a comma ( , ). Go to your script window and type in the following code chunk:
```{r}
text <- c("The time has come,' the Walrus said,",
"To talk of many things:",
"Of shoes — and ships — and sealing-wax —",
"Of cabbages — and kings —",
"And why the sea is boiling hot —",
"And whether pigs have wings.'")
text
```
Next, you will need to transform the value **text** into a **tibble** (data frame), which will allow you to perform text analysis. For this, you will use the **dplyr** package that you installed earlier. To use an installed package, you need to first load it into your R session. To load **dplyr**, run the following:
```{r}
library(dplyr)
```
Now you can create a data frame from the value you previously created. Run the following:
```{r}
text_df <- data_frame(line = 1:6, text = text)
text_df
```
###3. Text Tokenization
In order to perform more analysis on this text, we want to convert the data frame so that we have one token (in this case, a word) per document per row, a process called **tokenization**. Tokenizing text will retain the line number, remove punctuation, and default all words to lowercase characters.^[If you would like to retain the original case of the text, you can include an additional argument (to_lower = FALSE).] For this, we will be using the **tidytext** package.
Load the **tidytext** package into the session, as we did with **dplyr**.
```{r echo=TRUE}
library(tidytext)
```
Then, run the following to tokenize:
```{r echo=TRUE}
text_df %>%
unnest_tokens(word, text)
```
You can replace the data in your environment with this new tibble by running the following:
```{r echo=TRUE}
text_df <- text_df %>%
unnest_tokens(word, text)
text_df
```
This is the format in which we can begin to analyze texts using R. In the next section, we will work with a larger text, and begin to perform some basic analyses and visualizations.
###4. Analyzing Mary Shelley’s *Frankenstein* (1831)
Project Gutenberg is a free database of eBooks, predominately texts for which the U.S. copyright has expired. These ebooks are available in a variety of formats, including Plain Text UTF-8, which we can easily use for text analysis in R. Each e-book has its own unique e-book number, which we can use with the package **gutenbergr** to automatically load the text into an RStudio environment. In this section, we will load and analyze the text of Mary Shelley’s *Frankenstein* (e-book # 42324).
Begin by loading the **gutenbergr** package into the session, as we did with **dplyr** and **tidytext**.
```{r echo=TRUE}
library(gutenbergr)
```
Then, run the following:
```{r}
frankenstein <- gutenberg_download(c(42324))
frankenstein
```
If you look at the first few rows, you see that this also includes the novel’s front matter (i.e. title page, publishing info, introduction, etc.). If you were to examine the last rows, you would also see it includes additional publishing information and the transcriber’s notes. Since we’re only interested in analyzing Shelley’s novel, it would be useful to remove these rows.
Run the following to remove the front matter (rows 1-237) and back matter (rows 7620-7631):
```{r}
frankenstein <- frankenstein[-c(1:237, 7620:7631), ]
frankenstein
```
Now we’re ready to tokenize. During this process, we can add an additional argument to order the words according to the word count. Run the following to create a new tibble with the tokenized text, ordered according to word count:
```{r}
tidy_frankenstein <- frankenstein %>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE)
tidy_frankenstein
```
<span style="color:red">**Question 1: Which two (2) words appear most frequently in** ***Frankenstein*****? How many times do they each appear?**</span>
You may have noticed that these top used words don’t tell you too much about the text of Frankenstein. These types of words are called **stop words** (i.e. 'the', 'of', 'to', etc.). We can remove these types of words from our analysis by using the **anti_join** function. Luckily, you won’t have to come up with a whole list of these words yourself; RStudio includes a preloaded package with a data frame, **stop_words**, which contains English stop words compiled from three different lexicons.^[Datasets of stop words for other languages are also available online and can be loaded into R as needed.]
To remove the stop words from the **tidy_frankenstein tibble**, you must first load the **stop_words** data frame:
```{r}
data("stop_words")
```
Now, you can remove the stop words from your data frame:
```{r}
tidy_frankenstein <- tidy_frankenstein %>%
anti_join(stop_words)
tidy_frankenstein
```
<span style="color:red">**Question 2: Now, which two (2) words appear most frequently? How many times do they each appear?**</span>
###5. Visualizing Mary Shelley's *Frankenstein* (1831)
We are now ready to plot the text of Frankenstein. To do this we will use the packages **ggplot2** and **ggthemes**. These packages are used to create data graphics, according to the Grammar of Graphics. This allows you to easily manipulate the design of your graphs, and to create the most visually appealing data visualizations in R.
First, we are going to plot the most commonly used words in Frankenstein. To do this, first load ggplot2 and ggthemes into your session.
```{r echo=TRUE}
library(ggplot2)
library(ggthemes)
```
Next, run the following to plot the words in *Frankenstein* that appear more than 50 times:
```{r}
tidy_frankenstein %>%
filter(n > 50) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col(fill = "darkred") +
theme_fivethirtyeight() +
xlab(NULL) +
ylab("Word Count") +
coord_flip() +
ggtitle("Word Usage in Frankenstein")
```
This code chunk first calls the dataset that we would like to plot, then filters the dataset to only words that appear more than 50 times in the novel and orders the data according to the number of mentions. Then it calls **ggplot**, and plots **word** on the **x-axis** and **n** (word count) on the **y-axis**. **Geom_col** tells R to use a bar chart with the value of the *y* variable (here, *n*), while the function **fill = “darkred”** colors in the bars.^[Additional codes for colors can be found [here](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf).] **theme_fivethirtyeight** is a pre-set design from the package [ggthemes](https://www.rdocumentation.org/packages/ggthemes/versions/3.5.0), which we then manipulate with the subsequent arguments to adjust the x-axis labels and y-axis labels, change the graph into a horizontal format, and add an appropriate title.
<span style="color:red">**Question 3: How many words appear more than 50 times in** ***Frankenstein*****?**</span>
###6. Sentiment Lexicons
One type of textual analysis that can be conducted using R is sentiment analysis in which the emotions of a given text or set of texts is examined. To conduct sentiment analysis, we can use sentiment lexicons. There are three lexicons that can be used for general sentiment analysis on English-language texts: **AFINN** (Finn Årup Nielsen), **bing** (Bing Liu et. Al) and **nrc** (Saif Mohammad and Peter Turney). These lexicons assign scores for positive and negative sentiment, and also emotions such as joy, anger, fear, etc.
* **nrc** categorizes words as positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise and trust.
* **bing** categorizes words as positive or negative.
* **AFINN** assigns words a numeric score between -5 and 5, where the negative scores indicate negative sentiment and positive scores indicate positive sentiment.
All three of these lexicons are included in the **sentiments** dataset, and **tidytext** provides the function **get_sentiments( )**, which can be used to call specific lexicons. To call the **bing** lexicon, for example, run the following:
```{r}
get_sentiments("bing")
```
Now, call the **AFINN** sentiment lexicon.
```{r}
get_sentiments("afinn")
```
<span style="color:red">**Question 4: What sentiment schore is assigned to "abhorrent"?**</span>
Finally, call the **nrc** lexicon.
```{r}
get_sentiments("nrc")
```
<span style="color:red">**Question 5: What sentiments are associated with "abandon"?**</span>
These sentiment lexicons can be joined with your tokenized text(s) to conduct sentiment analysis by using the **inner_join** function from the **dplyr** package.
Run the following code chunk to create a new data frame connecting the **bing** sentiment lexicon to the **tidy_frankenstein** tibble you created earlier.
```{r}
frankenstein_bing <- tidy_frankenstein %>%
inner_join(get_sentiments("bing"))
frankenstein_bing
```
<span style="color:red">**Question 6: Which three (3) words from the bing lexicon appear most frequently in** ***Frankenstein*****? How many times do they each appear and what sentiment is associated with each one?**</span>
Now, let's use **ggplot2** to produce a horizontal bar chart showing positive and negative word usage in *Frankenstein* usiong the Bing et al. sentiment lexicon.
```{r}
frankenstein_bing %>%
filter(n > 25) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill=sentiment)) +
theme_fivethirtyeight() +
geom_col() +
xlab(NULL) +
coord_flip() +
ylab("Word Count") +
ggtitle("Word Usage in Frankenstein", subtitle = "Sentiment Analysis Using Bing et al.")
```
We can also use the **nrc** sentiment lexicon to get a better insight into how specific emotions play a role in a given text by filtering the data and joining it to your tokenized text. To see how fear appears in Frankenstein, run the following code chunks:
```{r}
nrc_fear <- get_sentiments("nrc") %>%
filter(sentiment == "fear")
nrc_fear
```
```{r}
tidy_frankenstein %>%
inner_join(nrc_fear)
```
<span style="color:red">**Question 7: How many words associated with fear appear in** ***Frankenstein*****? Which fear word appears most frequently?**</span>
Repeat this process looking at the words associated with **disgust**.
```{r}
nrc_disgust <- get_sentiments("nrc") %>%
filter(sentiment == "disgust")
nrc_disgust
```
```{r}
tidy_frankenstein %>%
inner_join(nrc_disgust)
```
<span style="color:red">**Question 8: How many words associated with disgust appear in** ***Frankenstein*****? Which disgust word appears most frequently?**</span>
###7. Comparing Sentiment Lexicons
In performing sentiment analysis, it might be useful to compare how the use of a given lexicon impacts the analysis. In this section we will explore how the **AFINN**, **bing**, and **nrc** lexicons compare when we examine the narrative arc of *Frankenstein.* First, we will divide the text of *Frankenstein* into its narrative sections^[If you’ve not read Frankenstein, the first parts of the book are organized as a series of written letters, while the remainder of the book is organized into chapters.] using the package **stringr**. We will then use the **tidyr** package to bind the sections to the different sentiment lexicons while manipulating the nrc and bing sentiment lexicons to match the numeric scoring used by the AFINN lexicon.
First, load the **stringr** and **tidyr** packages into your session.
```{r}
library(stringr)
library(tidyr)
```
Then, run the following code chunk to section off the text of *Frankenstein* according to each letter or chapter.
```{r}
frankenstein_sections <- frankenstein %>%
mutate(section =
cumsum(str_detect(text,regex("^chapter|letter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup()%>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
frankenstein_sections
```
Now, run the following to create a tibble summarizing the total sentiment score (according to the AFINN lexicon) per section of *Frankenstein*.
```{r}
frankenstein_afinn <- frankenstein_sections %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = section) %>%
summarize(sentiment = sum(score)) %>%
mutate(method = "AFINN")
frankenstein_afinn
```
Next, create a tibble joining the **bing** and **nrc** lexicons to **frankenstein_sections**, and converting the binary values "positive" and "negative" to a numeric score comparable to the **AFINN** lexicon.
```{r}
frankenstein_bingnrc <- bind_rows(frankenstein_sections %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
frankenstein_sections %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in%
c("positive",
"negative"))) %>%
mutate(method = "NRC")) %>%
count(method, index = section, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
frankenstein_bingnrc
```
Now we can plot the **frankenstein_afinn** and **frankenstein_bingnrc** tibbles together using **ggplot2**.
```{r}
bind_rows(frankenstein_afinn,
frankenstein_bingnrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y") +
geom_smooth() +
theme_fivethirtyeight() +
xlab("Index") +
ylab("Sentiment Score") +
ggtitle("Sentiment in Frankenstein per Chapter", subtitle = "Comparing Lexicons")
```
<span style="color:red">**Question 9: Which lexicon gives the final chapter of** ***Frankenstein*** **the most negative score?**</span>
<span style="color:red">**Question 10: Which lexicon is shows the most positive reading of** ***Frankenstein*****?**</span>
Discrepancies among sentiment scores can be due to the ways in which sentiment is categorized in or scored, as well as which words appear or do not appear in each lexicon. To understand these lexicons further, you can count the number of positive and negative words included in each. To look further into **nrc**, run the following:
```{r}
get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative")) %>%
count(sentiment)
```
Now, examine **bing**.
```{r}
get_sentiments("bing") %>%
count(sentiment)
```
<span style="color:red">**Question 11: Which lexicon includes more negative words? How many negative words does it include**</span>
It is also useful to know how much each word contributes to the overall sentiment of the book. To do this, we can create and plot a tibble joining a given lexicon to the text and creating a variable *n* to indicate the word count.
To create the tibble, run the following code chunk:
```{r}
frank_bingcounts <- frankenstein_sections %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
frank_bingcounts
```
To plot:
```{r}
frank_bingcounts %>%
group_by(sentiment) %>%
top_n(20) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
coord_flip() +
theme_fivethirtyeight() +
ggtitle("Words' Contribution to Sentiment in Frankenstein", subtitle = "Using the Bing et. al Lexicon")
```
Now, let's plot the contrubution to sentiment that positive and negative words in *Frankenstein* have using the nrc lexicon. (Note: you will have to include a filter to only include the categories “positive” and “negative.”)
Create the tibble:
```{r}
frank_nrccounts <- frankenstein_sections %>%
inner_join(get_sentiments("nrc")) %>%
filter(sentiment %in% c("positive",
"negative")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
frank_nrccounts
```
Plot:
```{r}
frank_nrccounts %>%
group_by(sentiment) %>%
top_n(20) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
coord_flip() +
theme_fivethirtyeight() +
ggtitle("Words' Contribution to Sentiment in Frankenstein", subtitle = "Using the NRC Lexicon")
```
###8. Analyzing Queen Lyrics
In this section we will experiment with using sentiment analysis on song lyrics. To begin, we will import the Queen.csv file containing all of the lyrics to songs by the band Queen.
First, load the **Queen.csv** file into your environment.
```{r}
queen <- read.csv("https://raw.githubusercontent.com/kayleealexander/TidyText/master/Queen.csv", stringsAsFactors = FALSE)
queen
```
Next, we will need to tokenize the lyrics, and group them per song.
```{r}
queen <- queen %>%
group_by(song) %>%
ungroup()%>%
unnest_tokens(word, text)
queen
```
Now let’s count the most common words in Queen lyrics. Create a tibble for words in Queen lyrics, sorted by word count and with stop words removed.
```{r}
tidy_queen <- queen %>%
count(word, sort = TRUE) %>%
anti_join(stop_words)
tidy_queen
```
<span style="color:red">**Question 12: Which two (2) words appear most commonly in Queen lyrics? How often are they used?**</span>
Song lyrics often include words such as “yeah” and “hey,” which you might want to exclude from your analysis.^[Sometimes it is useful to first visualize the most frequently occurring words in your dataset to identify these types of words that you might want to exclude. This is just a selection of some that appeared in this dataset.] You can create a custom stop words list by running the following:
```{r}
lyric_stop_words <- bind_rows(data_frame(word =
c("yeah","baby","hey", "la","oooh",
"ah","ooh"),
lexicon = c("custom")), stop_words)
lyric_stop_words
```
Use the **anti_join()** function to remove your custom stop words from **tidy_queen**.
```{r}
tidy_queen <- tidy_queen %>%
anti_join(lyric_stop_words)
tidy_queen
```
Now, use **ggplot2** to plot the most common words in Queen lyrics that are used more than 60 times.
```{r}
tidy_queen %>%
filter(n > 60) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
theme_fivethirtyeight() +
geom_col() +
xlab(NULL) +
coord_flip() +
ggtitle("Word Frequency in Queen Lyrics")
```
###9. Analyzing Sentiment in Queen Lyrics
Now that we have an overview of the data, we can begin to apply some sentiment analysis. Use the **inner_join()** function to **create a new tibble** connecting the **queen** dataframe to both the **AFINN** lexicon *and* your existing **tidy_queen** dataframe.
```{r}
queen_afinn <- queen %>%
inner_join(get_sentiments("afinn")) %>%
inner_join(tidy_queen)
queen_afinn
```
Now, plot this new tibble using **ggplot2** to create a horizontal bar chart showing all words in Queen lyrics occurring more than 30 times, ordered on word count and filled according to their AFINN sentiment score.
```{r}
queen_afinn %>%
filter(n > 30) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, fill=score)) +
geom_bar() +
xlab(NULL) +
coord_flip() +
theme_fivethirtyeight() +
ggtitle("Word Frequency and Sentiment in Queen Songs", subtitle = "Using the AFINN Lexicon")
```
Now, let's create a data frame summarizing the AFINN sentiment score per song.
```{r}
queen_songs <- queen_afinn %>%
group_by(index = song) %>%
summarize(sentiment = sum(score))
```
Then, plot this new data frame using **ggplot2**.
```{r}
queen_songs %>%
mutate(index = reorder(index, sentiment)) %>%
ggplot(aes(index, sentiment, fill = sentiment)) +
geom_col() +
theme_fivethirtyeight() +
xlab(NULL) +
ylab("Total Sentiment Score") +
ylim(-100,200) +
theme(axis.text.x=element_blank(),
axis.line.y = element_line(),
panel.grid = element_blank()) +
ggtitle("Songs by Queen",
subtitle = "Sentiment Analysis Using the AFINN Lexicon")
```
We now see each song represented by a bar along the *x-axis*, with the *y-axis* value representing the total AFINN score per Queen song (sum of the scores given for each word in the song). It is difficult, however, to see which bars represent which songs. Let's explore the data to get some insight:
* Click on **queen_songs** in the list of **Data** in your **Global Environment**.
* This pulls up a window showing your data frame.
* Sort the songs according to their sentiment score by clicking the arrows in the upper right corner of the cell with the column name sentiment.
You can now see the songs ordered according to their sum sentiment score. We can use this view to get a better insight into which songs in our visualization are represented in which areas of the graph. We can then add annotations to the graph to show the highest and lowest scoring songs, as well as the more neutral scoring songs to help the viewer understand better what the graph might represent.
Order the songs according to their sentiment scores by clicking on the arrows in the upper right corner of the column name.
<span style="color:red">**Question 13: Which song received a sentiment score of twelve (12)?**</span>
Annotations need to be added to the graph by specifying the coordinates at which you would like them to appear. You can adjust the coordinates easily by playing with the values in order to get the text situated at the precise point you’d like it to. To add annotations (text and arrows) for the highest, lowest, and most neutral Queen songs to your visualization, run the following:
```{r}
queen_songs %>%
mutate(index = reorder(index, sentiment)) %>%
ggplot(aes(index, sentiment, fill = sentiment)) +
geom_col() +
theme_fivethirtyeight() +
xlab(NULL) +
ylab("Total Sentiment Score") +
ylim(-100,200) +
theme(axis.text.x=element_blank(),
axis.line.y = element_line(),
panel.grid = element_blank()) +
ggtitle("Songs by Queen",
subtitle = "Sentiment Analysis Using the AFINN Lexicon") +
annotate("text", label = "''How Funny Love Is''",
size = 3, color = "black", fontface = 2,
x = 160, y = 150) +
annotate("segment", x = 178, xend = 190, y = 150, yend = 150,
colour = "red",
size=.5, arrow=arrow(length = unit(.1, "inches"))) +
annotate("text", label = "''Liar''",
size = 3, color = "black", fontface = 2,
x = 18, y = -78) +
annotate("segment", x = 13, xend = 1, y = -78, yend = -78,
colour = "red",
size=.5, arrow=arrow(length = unit(.1, "inches"))) +
annotate("text", label = "''Some Day One Day''",
size = 3, color = "black", fontface = 2,
x = 82, y = 53) +
annotate("segment", x = 83, xend = 83, y = 45, yend = 1,
colour = "red",
size=.5, arrow=arrow(length = unit(.1, "inches"))) +
annotate("text", label = "''A Kind of Magic''",
size = 3, color = "black", fontface = 2,
x = 77, y = -50) +
annotate("segment", x = 77, xend = 77, y = -45, yend = -1,
colour = "red",
size=.5, arrow=arrow(length = unit(.1, "inches")))
```
The visualization has now been reproduced with 8 annotations (4 arrows, 4 texts). The size, style, and position of each annotation can be adjusted by changing the values within the parenthesis for **size**, **color**, **fontface**, **x**, **xend**, **y**, and **yend**. Feel free to adjust these settings, or add additional annotations to the chart based on the data frame.