-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.Rmd
94 lines (70 loc) · 3.12 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
output: github_document
always_allow_html: true
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
comment = "#",
collapse = TRUE,
fig.path = "man/figures/README-",
fig.width = 8,
fig.height = 5
)
```
# sentopics
<!-- badges: start -->
[](https://CRAN.R-project.org/package=sentopics)
[](https://app.codecov.io/gh/odelmarcelle/sentopics)
[](https://github.com/odelmarcelle/sentopics/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->
## Installation
A stable version `sentopics` is available on CRAN:
```{r eval = FALSE}
install.packages("sentopics")
```
The latest development version can be installed from GitHub:
``` {r eval = FALSE}
devtools::install_github("odelmarcelle/sentopics")
```
The development version requires the appropriate tools to compile C++ and Fortran source code.
## Basic usage
Using a sample of press conferences from the European Central Bank, an LDA model is easily created from a list of tokenized texts. See https://quanteda.io for details on `tokens` input objects and pre-processing functions.
``` {r}
library("sentopics")
print(ECB_press_conferences_tokens, 2)
set.seed(123)
lda <- LDA(ECB_press_conferences_tokens, K = 3, alpha = .1)
lda <- fit(lda, 100)
lda
```
There are various way to extract results from the model: it is either possible to directly access the estimated mixtures from the `lda` object or to use some helper functions.
```{r paged.print=FALSE}
# The document-topic distributions
head(lda$theta)
# The document-topic in a 'long' format & optionally with meta-data
head(melt(lda, include_docvars = FALSE))
# The most probable words per topic
topWords(lda, output = "matrix")
```
Two visualization are also implemented: `plot_topWords()` display the most probable words and `plot()` summarize the topic proportions and their top words.
```{r plot-lda-show, eval = FALSE}
plot(lda)
```
```{r plot-lda, warning=FALSE, fig.align='center', fig.width = 5, echo = FALSE}
plot(lda) |> plotly::layout(width = 500, height = 500)
```
After properly incorporating date and sentiment metadata data (if they are not already present in the `tokens` input), time series functions allows to study the evolution of topic proportions and related sentiment.
```{r series, message=FALSE, out.width="100%"}
sentopics_date(lda) |> head(2)
sentopics_sentiment(lda) |> head(2)
proportion_topics(lda, period = "month") |> head(2)
plot_sentiment_breakdown(lda, period = "quarter", rolling_window = 3)
```
## Advanced usage
Feel free to refer to the vignettes of the package for a more extensive introduction to the features of the package. Because the package is not yet on CRAN, you'll have to build the vignettes locally.
```{r, eval = FALSE}
vignette("Basic_usage", package = "sentopics")
vignette("Topical_time_series", package = "sentopics")
```