README.Rmd

---
output: github_document
always_allow_html: true
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo = TRUE,
  comment = "#",
  collapse = TRUE,
  fig.path = "man/figures/README-",
  fig.width = 8,
  fig.height = 5
  )
```

# sentopics

<!-- badges: start -->
[![CRAN Version](https://www.r-pkg.org/badges/version/sentopics)](https://CRAN.R-project.org/package=sentopics)
[![Codecov test coverage](https://codecov.io/gh/odelmarcelle/sentopics/branch/master/graph/badge.svg?token=V6M82L4ZCX)](https://app.codecov.io/gh/odelmarcelle/sentopics)
[![R-CMD-check](https://github.com/odelmarcelle/sentopics/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/odelmarcelle/sentopics/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->

## Installation

A stable version `sentopics` is available on CRAN:

```{r eval = FALSE}
install.packages("sentopics")
```

The latest development version can be installed from GitHub:

``` {r eval = FALSE}
devtools::install_github("odelmarcelle/sentopics") 
```

The development version requires the appropriate tools to compile C++ and Fortran source code.

## Basic usage

Using a sample of press conferences from the European Central Bank, an LDA model is easily created from a list of tokenized texts. See https://quanteda.io for details on `tokens` input objects and pre-processing functions.

``` {r}
library("sentopics")
print(ECB_press_conferences_tokens, 2)
set.seed(123)
lda <- LDA(ECB_press_conferences_tokens, K = 3, alpha = .1)
lda <- fit(lda, 100)
lda
```

There are various way to extract results from the model: it is either possible to directly access the estimated mixtures from the `lda` object or to use some helper functions.

```{r paged.print=FALSE}
# The document-topic distributions
head(lda$theta) 
# The document-topic in a 'long' format & optionally with meta-data
head(melt(lda, include_docvars = FALSE))
# The most probable words per topic
topWords(lda, output = "matrix") 
```

Two visualization are also implemented: `plot_topWords()` display the most probable words and `plot()` summarize the topic proportions and their top words.
```{r plot-lda-show, eval = FALSE}
plot(lda)
```
```{r plot-lda, warning=FALSE, fig.align='center', fig.width = 5, echo = FALSE}
plot(lda) |> plotly::layout(width = 500, height = 500)
```


After properly incorporating date and sentiment metadata data (if they are not already present in the `tokens` input), time series functions allows to study the evolution of topic proportions and related sentiment.

```{r series, message=FALSE, out.width="100%"}
sentopics_date(lda)  |> head(2)
sentopics_sentiment(lda) |> head(2)
proportion_topics(lda, period = "month") |> head(2)
plot_sentiment_breakdown(lda, period = "quarter", rolling_window = 3)
```

## Advanced usage

Feel free to refer to the vignettes of the package for a more extensive introduction to the features of the package. Because the package is not yet on CRAN, you'll have to build the vignettes locally.

```{r, eval = FALSE}
vignette("Basic_usage", package = "sentopics")
vignette("Topical_time_series", package = "sentopics")
```