slides.qmd

---
title: "Speeding up R"
date: "2023-06-28"
author: "Stuart Lacy"
execute: 
  cache: true
  keep-md: true
format: 
  revealjs:
    smaller: false
    slide-number: c/t
    show-slide-number: all
    scrollable: true
    theme: default
    navigation-mode: linear
    width: 1280
    height: 700
    embed-resources: true
---

## Introduction

  - Focus on techniques for speeding up analysis of tabular data
  - Subjects:
    - Vectorization
    - Joins
    - `Rcpp`
    - `data.table`
    - Alternative backends
  
```{r setup, echo=F}
library(tidyverse)
library(knitr)
library(kableExtra)
options(dplyr.summarise.inform = FALSE)
```

```{r pprint}
pprint <- function(df, n=NULL, font_size=14) {
  df_name <- as.character(match.call())[2]
  if (is.null(n)) {
    n <- nrow(df)
  }
  df |> 
      head(n) |> 
      kable(format="html",
            caption=sprintf("%s: Rows 1 - %d out of %d", 
            df_name, n, nrow(df)), 
            escape=TRUE) |>
    kable_styling(font_size=font_size)
}
```

    
# Vectorization

## Vectorization Concept

  > For loops in R are slow - many StackOverflow posts
  
  - In general, if you're using a `for` loop (or a `sapply` variant), then your code could be sped up by using a `vectorised` function
  - Definition: `f(x[i]) = f(x)[i]` for $i \in 1, ..., N$
  
. . .

  - `sqrt(c(4, 9, 16)) = 2, 3, 4`, therefore `sqrt` is vectorised
  - Using vectorised functions often results in **cleaner code** with less chance for bugs
  - There are a lot of vectorised functions available in the standard library
  
## Standard library vectorised functions {.smaller}

```{r df-creation, echo=FALSE}
df <- data.frame(x=runif(10, min=3, max=8), y=rnorm(10), z=sin(1:10), xy="")
df$z[c(3, 8)] <- NA
```

:::: {.columns}

::: {.column width="50%"}

  - Non-vectorised

```r
for (i in 1:nrow(df)) {
  # New column based on 2 others
  if (df$x[i] > 5 && df$y[i] < 0) {
    df$y[i] <- 5
  } else {
    df$y[i] <- 0
  }
  
  # Replace NAs with error code
  if (is.na(df$z[i])) {
    df$z[i] <- 9999 
  }
  
  # String concatenate columns
  df$xy[i] <- paste(df$x[i], df$y[i], sep="_")
}
  
# Distance between every row
dists <- matrix(nrow=nrow(df), ncol=nrow(df))
for (i in 1:nrow(df)) {
  for (j in 1:nrow(df)) {
    if (j != i) {
      dists[i, j] <- sqrt((df$x[i] - df$x[j])**2 + (df$y[i] - df$y[j])**2)
    }
  }
}
```

:::

::: {.column width="50%"}

  - Vectorised

```r

# New column based on 2 others
df$y <- ifelse(df$x > 5 & df$y < 0, 5, 0)


# Replace NAs with error code
df$z[is.na(df$z)] <- 9999


# Concatenate columns
df$xy <- paste(df$x, df$y, sep="_")


# Distance between every row
dist(df[, c('x', 'y')])
```

:::

::::

## Worked example

  - Example taken from [Jim Hester's blog post](https://www.jimhester.com/post/2018-04-12-vectorize/)
  
  > Given a path some/path/abc/001.txt, create a fast function to return abc_001.txt
  
. . .
  
  - First attempt works on a single path at a time, separating it by `/` and concatenating the last directory and filename
  - Doesn't work for a vector input, is there an easy way to vectorise it?
  
```{r string-1, echo=TRUE}
example_1 <- function(path) {
  path_list <- str_split(path, "/") %>% unlist()
  paste(path_list[length(path_list) - 1], path_list[length(path_list)], sep = "_")
}
example_1(c("foo/bar/car/001.txt", "har/far/lar/002.txt"))
```

## Version 1 - `Vectorize`

  - `Vectorize` takes a function that works on a single element, and returns a vectorised version - job done!

. . .

  - However, it just uses `apply` under the hood and **isn't quicker**, mostly just syntatical sugar

```{r string-1-vec, echo=TRUE}
example_1_vectorised <- Vectorize(example_1)  # This returns a *function*
example_1_vectorised(c("foo/bar/car/001.txt", "har/far/lar/002.txt"))
```

## Version 2

  - Want to replace this implicit for loop with inbuilt vectorised functions
  - ✅ `str_split` is vectorised, returning a list over the input entries
  - ✅ `paste` is also vectorised
  - ❌ Need to use a for loop (`sapply`) to grab the last dir and filename from each entry
  - Overall have reduced the computation done inside the for loop

```{r string-2, echo=TRUE}
example_2 <- function(paths) {
  path_list <- str_split(paths, "/")
  last_two <- sapply(path_list, tail, 2)
  paste(last_two[1, ], last_two[2, ], sep="_")
}
example_2(c("foo/bar/car/001.txt", "har/far/lar/002.txt"))
```

## Version 3

  - We can't directly replace this for loop with a single vectorised function, have to take another approach
  - `dirname('foo/bar/dog.txt') = foo/bar`
  - `basename('foo/bar/dog.txt') = dog.txt`
  - Combining these can give us our entire functionality in 4 inbuilt vectorised function calls!

```{r string-3, echo=TRUE}
example_3 <- function(paths) {
  paste(basename(dirname(paths)), basename(paths), sep="_")
}
example_3(c("foo/bar/car/001.txt", "har/far/lar/002.txt"))
```

## Comparison

  - The `microbenchmark` library makes it easy to time snippets of code
  - The `Vectorize` version isn't doing anything different from manually looping through with `sapply`

```{r string-comp, echo=TRUE}
library(microbenchmark)
# Construct 100 paths
paths <- rep(c("some/path/abc/001.txt", "another/directory/xyz/002.txt"), 100)

res <- microbenchmark(
  example_1_vectorised(paths),
  sapply(paths, example_1),
  example_2(paths),
  example_3(paths)
)

summary(res)[c("expr", "median")]
```

## Conclusions {.smaller}

  > In general, if you're using a `for` loop (or a `sapply` variant), then your code could be sped up by using a `vectorised` function - Me (7 slides ago)

  - This wasn't fully correct, as vectorised functions can have for loops under the hood and will thus still be slow
  - The difference between a vectorised function built using `Vectorize` or an inbuilt function like `basename` is that the latter will have a for loop, **but it will be written in C/C++ rather than R**
  
. . .
  
  > In general, if you're using a `for` loop (or a `sapply` variant), then your code could be sped up by using a for loop written in C/C++, preferably part of the standard library - Me (now)
  
  - Later on will demonstrate how to write our own C++ functions

# DataFrames & Joining

## Basic DataFrame operations {.smaller}

  - Fortunately working with `data.frame`s and the `tidyverse` core verbs pushes you towards using vectorised functions
  - `group_by() |> summarise()` is both quicker and more legible than manually looping over the groups and combining the results
  - `filter(f(x))` assumes that `f()` is vectorised and returns a boolean `TRUE/FALSE` for every row
  - `mutate(newcol=f(oldcol))` assumes `f()` is vectorised and returns a value per row

. . .

  - **Caution**, can run into errors or unexpected behaviour if not using vectorised functions

:::: {.columns}

::: {.column width="50%"}
  
```{r string-error, echo=TRUE}
# Non-vectorised version didn't Error, 
# but gave an unexpected result
data.frame(path=paths) |>
  mutate(path_clean1 = example_1(path),
         path_clean2 = example_3(path)) |>
  head()
```

:::

::: {.column width="50%"}

```{r string-ifelse, echo=TRUE, error=TRUE}
# This function isn't vectorised due to the if/else statement
# Solution: Use ifelse() instead
replace_both_NA_9999 <- function(x, y) {
  if (is.na(x) && is.na(y)) {
    return(9999)
  } else {
    return(0)
  }
}

data.frame(a=c(5, 3, NA, 2, NA), 
           b=c(NA, 2, NA, 1, 9)) |>
  mutate(c = replace_both_NA_9999(a, b))
```

:::

::::

## Joining {.smaller}

  - Linking 2 datasets together using the `join` family of functions is an integral part of data analysis
  - However, `join` functions are highly efficient functions and can be useful in a number of siutations, even when we don't have 2 separate datasets
  - `inner_join` links two dataframes together based on a column in common, with the number of rows equal to the number of rows in the 'left' table that have a matching row in the 'right' table
  
```{r join-setup}
df_1 <- data.frame(group=c('a', 'b', 'c'), value1=c(1, 2, 3)) 
df_2 <- data.frame(group=c('b', 'c', 'd'), value2=c(4, 5, 6))
```

:::: {.columns}

::: {.column width="30%"}

```{r join-display, echo=FALSE}
pprint(df_1, font_size = 18)
```


::: 

::: {.column width="30%"}

```{r join-display-2, echo=FALSE}
pprint(df_2, font_size = 18)
```

:::

::: {.column width="40%"}

```{r join-display-3, echo=TRUE}
joined <- df_1 |> inner_join(df_2, by="group")
```

```{r join-display-4}
pprint(joined, font_size = 18)
```

:::

::::

## Example usage: inner join instead of `ifelse` {.smaller}
  
  - Can think of `inner_join` as being able to both `filter` and `mutate` new columns
  - Example: apply different per-group scaling factor to 300,000 measurements from 3 groups
  - On one joining column `join` isn't much quicker, but it's far more legible and scales well to both having more groups in the joining column, and additional joining columns

:::: {.columns}
  
::: {.column width="30%"}

```{r join-setup-2, echo=FALSE}
n_per_group <- 1e5
df <- data.frame(group=rep(c('a', 'b', 'c'), each=n_per_group),
                 time = rep(seq.POSIXt(from=as_datetime("2020-03-05"), 
                                       by="1 min", 
                                       length.out=n_per_group),
                            3),
                 value = rnorm(n_per_group * 3))
```

```{r join-display-5}
pprint(df, 5, font_size=15)
```

:::

::: {.column width="30%"}

```{r join-scales}
scales <- data.frame(group=c('a', 'b', 'c'),
                     scale=c(2, 7.8, 9))
```

```{r join-scales-2}
pprint(scales, font_size=15)
```

:::

::: {.column width="30%"}

```{r join-scales-3}
joined <- df |> inner_join(scales, by="group")
```

```{r join-scales-4}
pprint(joined, 5, font_size=15)
```

:::

::::

. . .

```{r join-scales-comp, echo=TRUE}
f_join <- function() {
  df |> inner_join(scales, by="group")
}

f_ifelse <- function() {
  df |>
    mutate(scale = ifelse(group == 'a', 2,
                          ifelse(group == 'b', 7.8,
                                 ifelse(group == 'c', 9, NA))))
  
}

res <- microbenchmark(f_join(), f_ifelse(), times=10)
summary(res)[c("expr", "median")]
```

## `left_join`

  - A `left_join` returns **all rows** in the left table, but only those in the right that match the condition
  - Any column from the right table that didn't have a match in the left table is filled with `NA`
  
```{r join-inner-1}
df1 <- data.frame(group=c('a', 'b', 'c', 'd'), val1 = seq(4))
df2 <- data.frame(group=c('a', 'b', 'c'), val2 = seq(3)**2)
```
  
:::: {.columns}

::: {.column width="15%"}
  
```{r join-left-1, echo=TRUE}
df1
```

:::

::: {.column width="15%"}

```{r join-left-2, echo=TRUE}
df2
```

:::

::: {.column width="35%"}

```{r join-left-3, echo=TRUE}
df1 |> 
  left_join(df2, by="group")
```

:::

::: {.column width="35%"}

```{r join-left-4, echo=TRUE}
df1 |> 
  inner_join(df2, by="group")
```

:::

::::

## Example usage: filling gaps with `left_join`

  - Very useful if want to be aware of missing values
  - Useful for filling gaps in non-uniformly sampled time-series so can count missingness or interpolate

:::: {.columns}

::: {.column width="30%"}

```{r join-left-5, echo=FALSE}
df <- data.frame(date=as_date(c('2020-01-01', '2020-01-03', '2020-01-05')),
                  measurement = rnorm(3))
```

```{r join-left-6, echo=TRUE}
df
```

:::

::: {.column width="20%"}

```{r join-left-7, echo=FALSE}
all_times <- data.frame(date = seq.Date(from=min(df$date), to=max(df$date), by=1))
```

```{r join-left-8, echo=TRUE}
all_times
```

:::

::: {.column width="50%"}

```{r join-left-9, echo=TRUE}
all_times |> left_join(df, by="date")
```

:::

::::

## Interval joins {.smaller}

  - Joins aren't limited to joining on equal values, can also join on **intervals** or **closest value**
  - Example: Have measurements from every day in 2020, but want to limit analysis to 5 specific weeks
  
```{r join-intervals-1}
df_interval <- data.frame(time = seq.Date(from=as_date("2020-01-01"), to=as_date("2020-12-31"), by=1),
                 measurement = rnorm(366))
weeks <- data.frame(week_group = c('a', 'b', 'c', 'd', 'e'),
                    week_start = as_date(c("2020-02-14", "2020-03-17", "2020-05-08", "2020-09-20", "2020-11-13")),
                    week_end = as_date(c("2020-02-21", "2020-03-24", "2020-05-15", "2020-09-27", "2020-11-20")))
```

:::: {.columns}

::: {.column width="20%"}

```{r join-intervals-2, echo=FALSE}
pprint(df_interval, 10, font_size=15)
```

:::

::: {.column width="25%"}

```{r join-intervals-3, echo=FALSE}
pprint(weeks, font_size=15)
```

:::

::: {.column width="50%"}

```{r join-intervals-4, echo=TRUE}
joined <- df_interval |>
  inner_join(weeks, 
             by=join_by(time >= week_start, time < week_end))
```

```{r join-intervals-5}
pprint(joined, 10, font_size=15)
```

:::

::::

## Benchmark {.smaller}

  - On only 366 rows with 5 groups it is 10x as fast, will scale better, and is more understandable

```{r join-intervals-6, echo=TRUE}
f_intervaljoin <- function() {
  df_interval |>
    inner_join(weeks, by=join_by(time >= week_start, time < week_end))
}

f_ifelse <- function() {
  df_interval |>
    mutate(week_group = ifelse(time >= as_date("2020-02-14") & time < as_date("2020-02-21"),
                               'a',
                               ifelse(time >= as_date("2020-03-17") & time < as_date("2020-03-24"),
                                      'b',
                                      ifelse(time >= as_date("2020-05-08") & time < as_date("2020-05-15"),
                                             'c',
                                             ifelse(time >= as_date("2020-09-20") & time < as_date("2020-09-27"),
                                                    'd',
                                                    ifelse(time >= as_date("2020-11-13") & time < as_date("2020-11-20"),
                                                           'e', 
                                                           NA)))))) |>
    filter(!is.na(week_group))
}

res <- microbenchmark(f_intervaljoin(), f_ifelse(), times=10)
summary(res)[c("expr", "median")]
```

# Different backends

## Example dataset {.smaller}

  - What if we're using fast functions but still experiencing slow performance due to dataset's *size*?
  - Example dataset: Company House data containing 5 million rows ([440MB archive download](http://download.companieshouse.gov.uk/en_output.html), extracts to 2.4GB) of all companies incorporated in the UK since 1856
  - Using first million rows as an example
  
```{r company-house-1, echo=TRUE}
df <- read_csv("BasicCompanyDataAsOneFile-2023-05-01.csv", n_max=1e6, show_col_types=FALSE)
df$IncorporationDate <- as_date(df$IncorporationDate, format="%d/%m/%Y")
dim(df)
```

```{r company-house-2, echo=TRUE}
df |> 
  select(CompanyName, RegAddress.PostTown, IncorporationDate, SICCode.SicText_1) |>
  head()
```

## Question 1: How many companies have the same name?

  - Will use several basic research questions to have some 'real-world' analysis code to benchmark
  - How many companies have the same name?

```{r company-house-3, echo=T}
df |> 
  count(CompanyName) |> 
  filter(n > 1) |>
  nrow()
```

## Question 2: What York postcode has the most businesses?

  - Want to find the 5 postcodes with most businesses being created in York
  - Need to do some string manipulation to extract the first part of the `YOXX YYY` postcode format

```{r company-house-4, echo=TRUE}
df |> 
  filter(RegAddress.PostTown == 'YORK') |> 
  mutate(postcode = word(RegAddress.PostCode, 1, sep=" ")) |>
  count(postcode) |>
  arrange(desc(n)) |>
  head(5)
```

## Question 3: Classifications {.smaller}

  - Companies can be assigned with up to 4 classifications from a list of 1,042 options
  - Do classifications tend to cluster together? I.e. is the average number of classifications a company has related to the first classification?
  - Slightly tenuous example but wanted to demonstrate pivoting + joining!
  - Only want to look at classifications that are used by at least 10 companies (`inner_join` to filter)
  - Multiple classifications are stored in 4 **wide columns** that are NA when unused - easier to count the number of non-null column entries in **long** format

```{r company-house-5}
df |> 
  select(CompanyName, SICCode.SicText_1, SICCode.SicText_2, SICCode.SicText_3, SICCode.SicText_4) |> 
  head()
```

## Question 3: Classifications (code) {.smaller}

```{r company-house-8, echo=TRUE}
# 755 rows containing the SIC codes that at least 10 companies have
# Only 1 column, SICCode.SicText_1
sic_10_companies <- df |> 
                count(SICCode.SicText_1) |>
                filter(n >= 10) |>
                select(SICCode.SicText_1)

df |>
  # Could do a filter to restrict to these 10 companies, but it's actually quicker to use an inner join
  inner_join(sic_10_companies, by="SICCode.SicText_1") |>
  select(CompanyNumber, SICCode.SicText_1, SICCode.SicText_2, SICCode.SicText_3, SICCode.SicText_4) |> 
  mutate(first_classification = SICCode.SicText_1) |>
  # Pivoting to make it easier to count how many non-NULL classifications each company has
  pivot_longer(c(SICCode.SicText_1, SICCode.SicText_2, SICCode.SicText_3, SICCode.SicText_4)) |>
  filter(!is.na(value)) |>
  # Count how many classifications each company has
  count(CompanyNumber, first_classification) |>
  # Calculate the average number per the first classification
  group_by(first_classification) |>
  summarise(mean_classifications = mean(n, na.rm=T)) |>
  arrange(desc(mean_classifications)) |>
  head()
```
  
# data.table

## Introduction {.smaller}

  - `data.table` is an alternative to data.frame/tibble that is optimised for speed and low memory usage
  - The trade-off is that its API is a bit/lot less user friendly

```{r data-table-1, echo=TRUE, message=FALSE}
library(data.table)
dt <- fread("BasicCompanyDataAsOneFile-2023-05-01.csv", nrows=1e6)         # fread is the equivalent of read.csv
dt[, IncorporationDate := as_date(IncorporationDate, format="%d/%m/%Y") ]  # Creates a new column by *reference*
dim(dt)
```

```{r data-table-2, echo=TRUE}
# Display rows 1-5 and the specified columns
dt[1:5, .(CompanyName, RegAddress.PostTown, IncorporationDate, SICCode.SicText_1)]
```

## Counting number of companies with the same name

  - Generally, `dt[i, j, k]` means for data table `dt`, filter on rows `i`, create and/or select columns `j`, and group by `k`
  - `data.table` operations don't use the Pipe (`|>` or `%>%`), so can either chain together `[]` or create intermediate variables
  - `data.table` have `data.frame` as a class so can use standard functions on them, just won't benefit from the speed up
  - `.N` is the equivalent of `count`

```{r data-table-3, echo=T}
nrow( dt[ , .N, by=.(CompanyName) ][ N > 1 ] )
```

## York Postcodes with most business

  - In this example it's easier to create an intermediate variable than use a one-liner
  - `.SD` applies an operation to a subset of columns (all by default)

```{r data-table-4, echo=TRUE}
postcodes <- dt[ RegAddress.PostTown == 'YORK', .(postcode = word(RegAddress.PostCode, 1))][, .N, by=postcode]
postcodes[order(-postcodes$N), head(.SD, 5)]

# Alternative one-liner
#setorder(dt[ RegAddress.PostTown == 'YORK', .(postcode = word(RegAddress.PostCode, 1))][, .N, by=postcode], -N)[, head(.SD, 5)]
```

## Number of classifications {.smaller}

  - Joins are less intuitive. `x[y]` is equal to `left_join(y, x)`, **NOT** `inner_join(x, y)`
  - `melt` is equivalent to `pivot_longer` and IMO less intuitive
  - Intermdiate variables everywhere!

```{r data-table-5, echo=TRUE}
sic_10_companies_dt <- dt[, .N, by=.(SICCode.SicText_1)][ N >= 10, .(SICCode.SicText_1) ]
dt_companies_wide <- dt[ sic_10_companies_dt,  # This is a join!
                         .(CompanyNumber, 
                           first_classification = SICCode.SicText_1,
                           SICCode.SicText_1,
                           SICCode.SicText_2,
                           SICCode.SicText_3,
                           SICCode.SicText_4),
                          on=.(SICCode.SicText_1)]
dt_companies_long <- melt(dt_companies_wide, id.vars=c('CompanyNumber', 'first_classification'))
dt_companies_mean <- dt_companies_long[ value != '',  # Removes the unused SIC columns
                                        .N, 
                                        by=.(CompanyNumber, first_classification)][, 
                                                                                   .(mean_classifications = mean(N, na.rm=T)), 
                                                                                   by=.(first_classification)]
head(dt_companies_mean[ order(mean_classifications, decreasing = TRUE)])
```

## Speed comparison with tidyverse

```{r data-table-6}
f_read_tidyverse <- function() {
  read_csv("BasicCompanyDataAsOneFile-2023-05-01.csv", n_max=1e6, show_col_types=FALSE)
}
f_read_datatable <- function() {
  fread("BasicCompanyDataAsOneFile-2023-05-01.csv", nrows=1e6)
}

f_count_companies_tidyverse <- function() {
  df |> 
    count(CompanyName) |> 
    filter(n > 1) |>
    nrow()
}

f_count_companies_datatable <- function() {
  nrow( dt[ , .N, by=.(CompanyName) ][ N > 1 ] )
}

f_postcode_tidyverse <- function() {
  df |> 
    filter(RegAddress.PostTown == 'YORK') |> 
    mutate(postcode = word(RegAddress.PostCode, 1)) |>
    count(postcode) |>
    arrange(desc(n)) |>
    head(5)
}

f_postcode_datatable <- function() {
  setorder(dt[ RegAddress.PostTown == 'YORK', .(postcode = word(RegAddress.PostCode, 1))][, .N, by=postcode], -N)[, head(.SD, 5)]
}

f_postcode_datatable_2 <- function() {
  postcodes <- dt[ RegAddress.PostTown == 'YORK', .(postcode = word(RegAddress.PostCode, 1))][, .N, by=postcode]
  postcodes[order(-postcodes$N), head(.SD, 5)]
}

f_sic_tidyverse <- function() {
  sic_10_companies <- df |> 
                  count(SICCode.SicText_1) |>
                  filter(n >= 10) |>
                  select(SICCode.SicText_1)
  
  df |>
    select(CompanyNumber, SICCode.SicText_1, SICCode.SicText_2, SICCode.SicText_3, SICCode.SicText_4) |> 
    inner_join(sic_10_companies, by="SICCode.SicText_1") |>
    mutate(first_classification = SICCode.SicText_1) |>
    pivot_longer(c(SICCode.SicText_1, SICCode.SicText_2, SICCode.SicText_3, SICCode.SicText_4)) |>
    filter(!is.na(value)) |>
    count(CompanyNumber, first_classification) |>
    group_by(first_classification) |>
    summarise(mean_classifications = mean(n, na.rm=T)) |>
    arrange(desc(mean_classifications))
}

f_sic_datatable <- function() {
  sic_10_companies_dt <- dt[, .N, by=.(SICCode.SicText_1)][ N >= 10, .(SICCode.SicText_1) ]
  dt_companies_wide <- dt[ sic_10_companies_dt,
                           .(CompanyNumber, 
                             first_classification = SICCode.SicText_1,
                             SICCode.SicText_1,
                             SICCode.SicText_2,
                             SICCode.SicText_3,
                             SICCode.SicText_4),
                            on=.(SICCode.SicText_1)]
  dt_companies_long <- melt(dt_companies_wide, id.vars=c('CompanyNumber', 'first_classification'))
  dt_companies_mean <- dt_companies_long[ value != '', .N, by=.(CompanyNumber, first_classification)][, .(mean_classifications = mean(N, na.rm=T)), by=.(first_classification)]
  dt_companies_mean[ order(mean_classifications, decreasing = TRUE)]
}
```

```{r data-table-comparison-benchmark}
res_read <- microbenchmark(f_read_tidyverse(), f_read_datatable(), times=1)
res_count <- microbenchmark(f_count_companies_tidyverse(), f_count_companies_datatable(), times=1)
res_postcode <- microbenchmark(f_postcode_tidyverse(), f_postcode_datatable(), times=3)
res_sic <- microbenchmark(f_sic_tidyverse(), f_sic_datatable(), times=3)
```

```{r data-table-comparison-results}
results <- list(
  "ReadingCSV"=res_read,
  "CountCompanies"=res_count,
  "Postcode"=res_postcode,
  "Categories"=res_sic
) |>
  map(function(x) tibble(expr=x$expr, time=x$time)) |>
  list_rbind(names_to="benchmark") |>
  mutate(library = gsub(".+_.+_", "", expr),
         library = gsub("\\(\\)", "", library)) |>
  group_by(benchmark, library) |>
  summarise(time = median(time)) |>
  ungroup() |>
  mutate(benchmark = factor(benchmark, levels=c("ReadingCSV", "CountCompanies", "Postcode", "Categories"),
                            labels=c("Reading CSV", "Duplicate companies", "Postcodes", "Classifications")),
         time = time / 1e9) 

stats <- results |> 
  pivot_wider(names_from="library", values_from="time") |>
  mutate(reference = tidyverse) |>
  pivot_longer(c(datatable, tidyverse), names_to="library") |>
  mutate(speedup = reference / value,
         speedup_label = sprintf("%.1fx", speedup)) |>
  group_by(benchmark) |>
  mutate(ypos = value) |>
  ungroup()

results |>
  ggplot(aes(x=library, y=time, fill=library)) +
    geom_col() +
    facet_wrap(~benchmark, scales="free_y") +
    theme_minimal() +
    scale_fill_brewer("Library", palette="Dark2") +
    guides(fill="none") +
    labs(y="Time (s)", x="") +
    geom_text(aes(label=speedup_label, y=ypos), data=stats) +
    theme(
      axis.text = element_text(size=16),
      legend.position = "bottom",
      axis.title = element_text(size=18),
      legend.text = element_text(size=16),
      legend.title = element_text(size=18),
      strip.text = element_text(size=18)
    )
```

# `tidytable` and `dtplyr`

## `tidytable`: introduction {.smaller}

:::: {.columns}

::: {.column width="40%"}

  - `tidytable` is a drop-in replacement for common tidyverse functions that under the hood work on a `data.table` object
  - So (in theory!) you get the speed of `data.table` but the user friendly API of the `tidyverse`
  - Just load the library then all subsequent calls to `mutate`, `inner_join`, `count`, `select`, `filter` etc... will use the `tidytable` versions that work on a `data.table`
  - **Beware**: not all functions have been ported over and it explicitly overwrites the `dplyr`, `tidyr`, `purrr` functions
  - There's a lag between changes to `tidyverse` being reflected in `tidytable`
  
:::
  
::: {.column width="60%"}

```{.r}
library(tidytable)
# Here we explicitly create tidytable from a regular data.frame
# But passing a regular data.frame or data.table into any tidytable function
# will implicitly change it to be a tidytable object
dtt <- as_tidytable(df)

dtt |> 
    count(SICCode.SicText_1) |>
    filter(n >= 10) |>
    select(SICCode.SicText_1) 
```

```{r, echo=FALSE}
# Dodgy way of showing output from running code without having to load tidytable and wreck namespace
dtt <- tidytable::as_tidytable(df)

dtt |> 
    tidytable::count(SICCode.SicText_1) |>
    tidytable::filter(n >= 10) |>
    tidytable::select(SICCode.SicText_1) 
```

```{r tidytable-1-hidden, echo=FALSE, fig.pos=""}
dtt <- tidytable::as_tidytable(df)

# NB: In interactive work would use library(tidytable)
# Being explicit here to not ruin namespace
f_count_companies_tidytable <- function() {
  dtt |> 
    tidytable::count(CompanyName) |> 
    tidytable::filter(n > 1) |>
    nrow()
}

f_postcode_tidytable <- function() {
  dtt |> 
    tidytable::filter(RegAddress.PostTown == 'YORK') |> 
    tidytable::mutate(postcode = word(RegAddress.PostCode, 1)) |>
    tidytable::count(postcode) |>
    tidytable::arrange(desc(n)) |>
    head(5)
}

f_sic_tidytable <- function() {
  sic_10_companies <- dtt |> 
                  tidytable::count(SICCode.SicText_1) |>
                  tidytable::filter(n >= 10) |>
                  tidytable::select(SICCode.SicText_1)
  
  dtt |>
    tidytable::select(CompanyNumber, SICCode.SicText_1, SICCode.SicText_2, SICCode.SicText_3, SICCode.SicText_4) |> 
    tidytable::inner_join(sic_10_companies, by="SICCode.SicText_1") |>
    tidytable::mutate(first_classification = SICCode.SicText_1) |>
    tidytable::pivot_longer(c(SICCode.SicText_1, SICCode.SicText_2, SICCode.SicText_3, SICCode.SicText_4)) |>
    tidytable::filter(!is.na(value)) |>
    tidytable::count(CompanyNumber, first_classification) |>
    tidytable::group_by(first_classification) |>
    tidytable::summarise(mean_classifications = mean(n, na.rm=T)) |>
    tidytable::arrange(desc(mean_classifications))
}
```

:::

::::
  
## `dtplyr`: introduction {.smaller}

:::: {.columns}

::: {.column width="40%"}

  - An alternative `data.table` wrapper is `dtplyr` (developed by RStudio team)
  - Works differently to `tidytable`: it sequentially builds up the equivalent `data.table` query, but only executes the code when you **explicitly** request it (using `collect()` or `as.data.frame/table()`)
  - Loading the package **doesn't** affect your environment
  - Has less coverage than `tidytable`
  
:::

::: {.column width="60%"}
  
```{r dtplyr-1, echo=TRUE}
library(dtplyr)

# dtplyr operates on `lazy data.tables` which are only created by this function
dtp <- lazy_dt(df)

dtp |> 
    count(SICCode.SicText_1) |>
    filter(n >= 10) |>
    select(SICCode.SicText_1) 
```

:::

::::

## `dtplyr`: usage {.smaller}

:::: {.columns}

::: {.column width="45%"}

  - Can view the generated `data.table` query (subtly different to the one I manually wrote)

```{r dtplyr-2, echo=TRUE}
dtp |> 
    count(SICCode.SicText_1) |>
    filter(n >= 10) |>
    select(SICCode.SicText_1) |>
    show_query()
```

:::

::: {.column width="45%"}

  - Run `collect()` to execute it and return a `tibble`

```{r dtplyr-3, echo=TRUE}
dtp |> 
    count(SICCode.SicText_1) |>
    filter(n >= 10) |>
    select(SICCode.SicText_1) |>
    collect() |> 
    head()
```

:::

::::

## `dtplyr`: chaining queries {.smaller}

  - `dtplyr` queries that haven't been `collect()` can be used in joins 

```{r dtplyr-4, echo=TRUE}
# NB: this returns a datatable QUERY, not a dataset itself
sic_10_companies_dtp <- dtp |> 
    count(SICCode.SicText_1) |>
    filter(n >= 10) |>
    select(SICCode.SicText_1) 

# Can join that query into the middle of another query to return another query
results_dtp <- dtp |>
  inner_join(sic_10_companies_dtp, by="SICCode.SicText_1") |>
  select(CompanyNumber, SICCode.SicText_1, SICCode.SicText_2, SICCode.SicText_3, SICCode.SicText_4) |> 
  mutate(first_classification = SICCode.SicText_1) |>
  pivot_longer(c(SICCode.SicText_1, SICCode.SicText_2, SICCode.SicText_3, SICCode.SicText_4)) |>
  filter(!is.na(value)) |>
  count(CompanyNumber, first_classification) |>
  group_by(first_classification) |>
  summarise(mean_classifications = mean(n, na.rm=T)) |>
  arrange(desc(mean_classifications))

# Finally execute the full query
results_dtp |>
  collect() |>
  head()
```


## Benchmark

```{r data-table-all-1, echo=FALSE}
f_count_companies_dtplyr <- function() {
  dtp |> 
    count(CompanyName) |> 
    filter(n > 1) |>
    collect() |>
    nrow()
}

f_postcode_dtplyr <- function() {
  dtp |> 
    filter(RegAddress.PostTown == 'YORK') |> 
    mutate(postcode = word(RegAddress.PostCode, 1)) |>
    count(postcode) |>
    arrange(desc(n)) |>
    head(5) |>
    collect()
}

f_sic_dtplyr <- function() {
  sic_10_companies_dtp <- dtp |> 
                  count(SICCode.SicText_1) |>
                  filter(n >= 10) |>
                  select(SICCode.SicText_1)
  
  dtp |>
    select(CompanyNumber, SICCode.SicText_1, SICCode.SicText_2, SICCode.SicText_3, SICCode.SicText_4) |> 
    inner_join(sic_10_companies_dtp, by="SICCode.SicText_1") |>
    mutate(first_classification = SICCode.SicText_1) |>
    pivot_longer(c(SICCode.SicText_1, SICCode.SicText_2, SICCode.SicText_3, SICCode.SicText_4)) |>
    filter(!is.na(value)) |>
    count(CompanyNumber, first_classification) |>
    group_by(first_classification) |>
    summarise(mean_classifications = mean(n, na.rm=T)) |>
    arrange(desc(mean_classifications)) |>
    collect()
}

```

```{r data-table-all-2}
res_count_2 <- microbenchmark(
  f_count_companies_tidyverse(), 
  f_count_companies_datatable(),
  f_count_companies_tidytable(),
  f_count_companies_dtplyr(),
  times=3)
res_postcode_2 <- microbenchmark(
  f_postcode_tidyverse(),
  f_postcode_datatable(),
  f_postcode_tidytable(),
  f_postcode_dtplyr(), times=3)
res_sic_2 <- microbenchmark(
  f_sic_tidyverse(),
  f_sic_datatable(),
  f_sic_tidytable(),
  f_sic_dtplyr(), times=3)
```

```{r data-table-all-plot}
results_2 <- list(
  "CountCompanies"=res_count_2,
  "Postcode"=res_postcode_2,
  "Categories"=res_sic_2
) |>
  map(function(x) tibble(expr=x$expr, time=x$time)) |>
  list_rbind(names_to="benchmark") |>
  mutate(library = gsub(".+_.+_", "", expr),
         library = gsub("\\(\\)", "", library)) |>
  group_by(benchmark, library) |>
  summarise(time = median(time)) |>
  ungroup() |>
  mutate(benchmark = factor(benchmark, levels=c("ReadingCSV", "CountCompanies", "Postcode", "Categories"),
                            labels=c("Reading CSV", "Duplicate companies", "Postcodes", "Classifications")),
         time = time / 1e9) 

stats_2 <- results_2 |> 
  pivot_wider(names_from="library", values_from="time") |>
  mutate(reference = tidyverse) |>
  pivot_longer(c(datatable, tidyverse, tidytable, dtplyr), names_to="library") |>
  mutate(speedup = reference / value,
         speedup_label = sprintf("%.1fx", speedup)) |>
  group_by(benchmark) |>
  mutate(ypos = value) |>
  ungroup()

results_2 |>
  ggplot(aes(x=library, y=time, fill=library)) +
    geom_col() +
    facet_wrap(~benchmark, scales="free_y") +
    theme_minimal() +
    scale_fill_brewer("Library", palette="Dark2") +
    guides(fill="none") +
    labs(y="Time (s)", x="") +
    geom_text(aes(label=speedup_label, y=ypos), data=stats_2) +
    theme(
      axis.text.y = element_text(size=16),
      axis.text.x = element_text(size=14, angle=45, hjust=1, vjust=1),
      legend.position = "bottom",
      axis.title = element_text(size=18),
      legend.text = element_text(size=16),
      legend.title = element_text(size=18),
      strip.text = element_text(size=18))
```

# Embedded databases

## Introduction {.smaller}

  - All these options require reading the full dataset into memory, not viable if we have **larger than memory data**
  - Embedded relational databases are stored on disk and only read into memory as needed

. . .

  - Will look at 2 variants:
    - `SQLite` (designed for 'traditional' DB applications)
    - `duckdb` (optimised for analysis)
  - They use SQL (Structured Query Language, its own programming language) to interact with the data, but fortunately in R we can use our `tidyverse` functions just like `dtplyr` rather than learn a new language

## Interfacing with SQLite in R

  - Connect to the DB using `dbConnect()` from `library(DBI)`
  - DBs are organised into tables (can think of a table as a CSV file)
  - `dbWriteTable()` will write a dataframe to the DB
  
```{.r}
library(DBI)  # General database library
library(RSQLite)
# The first argument is the database driver, the second the database file
# If data.sql doesn't exist, it will be created
con_sql <- dbConnect(SQLite(), "data.sql")
dbWriteTable(con_sql, "data_1e6", df)
```

```{r sqlite-1, echo=FALSE}
# Again, hacky way of not running the dbWriteTable each time document is compiled
library(DBI)  # General database library
library(RSQLite)
# The first argument is the database driver
# If data.sql doesn't exist, it will be created
con_sql <- dbConnect(SQLite(), "data.sql")
```

## SQLite usage {.smaller}

:::: {.columns}

::: {.column width="45%"}

  - Can view SQL query with *identical* code to `dtplyr`, except the source data is from `tbl`

```{r sqlite-2, echo=TRUE}
# tbl requires a connection and a table name to read data from
tbl(con_sql, "data_1e6") |> 
  filter(RegAddress.PostTown == 'YORK') |> 
  mutate(postcode = word(RegAddress.PostCode, 1)) |>
  count(postcode) |>
  arrange(desc(n)) |>
  head(5) |>
  show_query()
```

:::

::: {.column width="45%"}

  - Running the query **errors** because the developers haven't translated `word` into SQL yet
  - Solution: use the similar function `substr` from `base` which has been translated
  - This is more likely to happen the more niche a function is

```{r sqlite-3, echo=TRUE}
tbl(con_sql, "data_1e6") |> 
  filter(RegAddress.PostTown == 'YORK') |> 
  mutate(postcode = substr(RegAddress.PostCode, 1, 4)) |>
  count(postcode) |>
  arrange(desc(n)) |>
  head(5) |>
  collect()
```

:::

::::

## `duckdb`: introduction

  - Designed for **fast analytics** (column-oriented) whereas SQLite is designed for **transactions** (row-oriented)
  - Very new, first demo was 2020 (SQLite first release was 2000)
  - Can read directly from CSV or has its own database files like SQLite
  - Use the same `dbConnect()` function but passing in a different driver
  
```{.r}
library(duckdb)
con_dd <- dbConnect(duckdb(), "data.duckdb")
dbWriteTable(con_dd, "data_1e6", df)
```
  
```{r duckdb-1, echo=FALSE}
# Again don't want to create table each time report is compiled
library(duckdb)
con_dd <- dbConnect(duckdb(), "data.duckdb")
```

## `duckdb`: usage {.smaller}

:::: {.columns}

::: {.column width="50%"}

  - Duckdb uses the same SQL language, albeit with subtle differences in available functions

```{r duckdb-2, echo=TRUE}
tbl(con_dd, "data_1e6") |> 
  filter(RegAddress.PostTown == 'YORK') |> 
  mutate(postcode = substr(RegAddress.PostCode, 1, 4)) |>
  count(postcode) |>
  arrange(desc(n)) |>
  head(50) |>
  show_query()
```

:::

::: {.column width="50%"}

  - `word` is also not ported to duckdb so again use the `substr` version
  - Code is again identical to both `SQLite` and `dtplyr`
  
```{r duckdb-3, echo=TRUE}
tbl(con_dd, "data_1e6") |> 
  filter(RegAddress.PostTown == 'YORK') |> 
  mutate(postcode = substr(RegAddress.PostCode, 1, 4)) |>
  count(postcode) |>
  arrange(desc(n)) |>
  head(50) |>
  collect()
```

:::

::::


## Overall benchmark {.smaller}

```{r db-comparison-1}
f_read_sqlite <- function() {
  dbConnect(SQLite(), "data.sql")
}

f_read_duckdb <- function() {
  con_dd <- dbConnect(duckdb(), "data.duckdb", read_only=TRUE)
}

f_count_sqlite <- function() {
  tbl(con_sql, "data_1e6") |> 
    count(CompanyName) |> 
    filter(n > 1) |>
    count() |>
    collect()
}

f_count_duckdb <- function() {
  tbl(con_dd, "data_1e6") |> 
    count(CompanyName) |> 
    filter(n > 1) |>
    count() |>
    collect()
}

f_postcode_sqlite <- function() {
  tbl(con_sql, "data_1e6") |> 
    filter(RegAddress.PostTown == 'YORK') |> 
    mutate(postcode = substr(RegAddress.PostCode, 1, 4)) |>
    count(postcode) |>
    arrange(desc(n)) |>
    head(5) |>
    collect()
}

f_postcode_duckdb <- function() {
  tbl(con_dd, "data_1e6") |> 
    filter(RegAddress.PostTown == 'YORK') |> 
    mutate(postcode = substr(RegAddress.PostCode, 1, 4)) |>
    count(postcode) |>
    arrange(desc(n)) |>
    head(5) |>
    collect()
}

f_sic_sqlite <- function() {
  sic_10_companies_sql <- tbl(con_sql, "data_1e6") |> 
                  count(SICCode.SicText_1) |>
                  filter(n >= 10) |>
                  select(SICCode.SicText_1)
  
  tbl(con_sql, "data_1e6") |>
    select(CompanyNumber, SICCode.SicText_1, SICCode.SicText_2, SICCode.SicText_3, SICCode.SicText_4) |> 
    inner_join(sic_10_companies_sql, by="SICCode.SicText_1") |>
    mutate(first_classification = SICCode.SicText_1) |>
    pivot_longer(c(SICCode.SicText_1, SICCode.SicText_2, SICCode.SicText_3, SICCode.SicText_4)) |>
    filter(!is.na(value)) |>
    count(CompanyNumber, first_classification) |>
    group_by(first_classification) |>
    summarise(mean_classifications = mean(n, na.rm=T)) |>
    arrange(desc(mean_classifications)) |>
    collect()
}

f_sic_duckdb <- function() {
  sic_10_companies_dd <- tbl(con_dd, "data_1e6") |> 
                  count(SICCode.SicText_1) |>
                  filter(n >= 10) |>
                  select(SICCode.SicText_1)
  
  tbl(con_dd, "data_1e6") |>
    select(CompanyNumber, SICCode.SicText_1, SICCode.SicText_2, SICCode.SicText_3, SICCode.SicText_4) |> 
    inner_join(sic_10_companies_dd, by="SICCode.SicText_1") |>
    mutate(first_classification = SICCode.SicText_1) |>
    pivot_longer(c(SICCode.SicText_1, SICCode.SicText_2, SICCode.SicText_3, SICCode.SicText_4)) |>
    filter(!is.na(value)) |>
    count(CompanyNumber, first_classification) |>
    group_by(first_classification) |>
    summarise(mean_classifications = mean(n, na.rm=T)) |>
    arrange(desc(mean_classifications)) |>
    collect()
}
```

```{r db-comparison-2}
tasks <- list(
  read = alist(
    tidyverse = f_read_tidyverse(),
    data.table = f_read_datatable(),
    tidytable = f_read_datatable(),
    dtplyr = f_read_datatable(),
    sqlite = f_read_sqlite(),
    duckdb = f_read_duckdb()
  ),
  count = alist(
    tidyverse = f_count_companies_tidyverse(),
    data.table = f_count_companies_datatable(),
    tidytable = f_count_companies_tidytable(),
    dtplyr = f_count_companies_dtplyr(),
    sqlite = f_count_sqlite(),
    duckdb = f_count_duckdb()
  ),
  postcode = alist(
    tidyverse = f_postcode_tidyverse(),
    data.table = f_postcode_datatable(),
    tidytable = f_postcode_tidytable(),
    dtplyr = f_postcode_dtplyr(),
    sqlite = f_postcode_sqlite(),
    duckdb = f_postcode_duckdb()
  ),
  sic = alist(
    tidyverse = f_sic_tidyverse(),
    data.table = f_sic_datatable(),
    tidytable = f_sic_tidytable(),
    dtplyr = f_sic_dtplyr(),
    sqlite = f_sic_sqlite(),
    duckdb = f_sic_duckdb()
  )
)
```

```{r db-comparison-3}
# Run tasks and save results
all_res <- map_dfr(tasks, function(task) {
  res <- microbenchmark(list=task, times=3)
  tibble(expr = res$expr, time_nano=res$time, time_milli=time_nano / 1e6)
}, .id="task")
```

:::: {.columns}

::: {.column width="60%"}

```{r}
backend_order <- c('tidyverse', 'sqlite', 'tidytable', 'dtplyr', 'data.table', 'duckdb')
```

```{r db-comparison-4}
all_res_plt <- all_res |>
  filter(task != 'read') |>
  mutate(
    task = factor(task,
                  levels=c("count", "postcode", "sic"),
                  labels=c("Duplicate companies", "Postcodes", "Classifications")),
    library = gsub(".+_.+_", "", expr),
    library = gsub("\\(\\)", "", library)) |>
  group_by(task, library) |>
  summarise(time = median(time_milli/1000, na.rm=T)) |>
  ungroup()

stats_all <- all_res_plt |> 
  pivot_wider(names_from="library", values_from="time") |>
  mutate(reference = tidyverse) |>
  pivot_longer(c(data.table, tidyverse, tidytable, dtplyr, sqlite, duckdb), names_to="library") |>
  mutate(speedup = reference / value,
         speedup_label = sprintf("%.1fx", speedup)) |>
  group_by(task) |>
  mutate(ypos = value) |>
  ungroup()

all_res_plt |>
  mutate(library = factor(library, levels=backend_order)) |>
  ggplot(aes(x=library, y=time, fill=library)) +
    geom_col() +
    geom_text(aes(label=speedup_label, y=ypos), size=5, data=stats_all) +
    facet_wrap(~task, scales="free_y") +
    theme_bw() +
    scale_fill_brewer("Library", palette="Dark2") +
    guides(fill="none") +
    labs(x="", y="Time (s)") +
    theme_minimal() +
    theme(
      axis.text.y = element_text(size=16),
      axis.text.x = element_text(size=14, angle=45, hjust=1, vjust=1),
      legend.position = "bottom",
      axis.title = element_text(size=18),
      legend.text = element_text(size=16),
      legend.title = element_text(size=18),
      strip.text = element_text(size=18))
```

:::

::: {.column width="40%"}

  - `data.table` is the fastest! But it requires learning a new 'language'
  - All of the other options are still much faster than `tidyverse` and let you use same code
  - `tidytable` is my personal sweetspot between ease of use and performance gains
  - `duckdb` and `sqlite` are also useful when data storage is a concern:
    - CSV: 2.4GB
    - SQLite: 1.9GB
    - Duckdb: 500MB

:::

::::

## Benchmark - all 5 million rows

```{r benchmark-5million}
fns <- list.files("benchmarks/results/", full.names = TRUE)
res_5mil <- map_dfr(setNames(fns, fns), readRDS, .id="fn") |>
  mutate(library = gsub("\\.rds", "", basename(fn)),
         library=gsub("datatable", "data\\.table", library)) |>
  pivot_longer(-c(fn, library), names_to="task", values_to="time") |>
  filter(task != 'read') |>
  mutate(
    task = factor(task,
                  levels=c("count", "postcode", "sic"),
                  labels=c("Duplicate companies", "Postcodes", "Classifications"))) |>
  select(-fn)

res_5mil <- res_5mil |>
  mutate(library = factor(library, levels=backend_order)) 

stats_5mil <- res_5mil |> 
  pivot_wider(names_from="library", values_from="time") |>
  mutate(reference = tidyverse) |>
  pivot_longer(c(data.table, tidyverse, tidytable, dtplyr, sqlite, duckdb), names_to="library") |>
  mutate(speedup = reference / value,
         speedup_label = sprintf("%.1fx", speedup)) |>
  group_by(task) |>
  mutate(ypos = value) |>
  ungroup()

res_5mil |>
  ggplot(aes(x=library, y=time, fill=library)) +
    geom_col() +
    geom_text(aes(label=speedup_label, y=ypos), size=5, data=stats_5mil) +
    facet_wrap(~task, scales="free_y") +
    theme_bw() +
    scale_fill_brewer("Library", palette="Dark2") +
    guides(fill="none") +
    labs(x="", y="Time (s)") +
    theme_minimal() +
    theme(
      axis.text.y = element_text(size=16),
      axis.text.x = element_text(size=14, angle=45, hjust=1, vjust=1),
      legend.position = "bottom",
      axis.title = element_text(size=18),
      legend.text = element_text(size=16),
      legend.title = element_text(size=18),
      strip.text = element_text(size=18))
```

# Rcpp

## Introduction {.smaller}

  - Sometimes for loops are necessary:
    - No inbuilt vectorised solution
    - Recurrent algorithm such as stepping through time or space
    - Performant critical code and need more specialised data structures
  - `Rcpp`, which combines R and C++, to the rescue!
  
. . .

  - C++ is **compiled** which makes it very fast, but it also requires more effort to both write programs in it and interface with R:
  - `Rcpp` makes this process easy by providing:
    - A C++ library that contains similar data structures and functions to R
    - An R package that compiles C++ code and makes them easily accessible within R
  
## Basic example {.smaller}

  - Can use `Rcpp::cppFunction()` to write a C++ function as a string or `Rcpp::sourceCpp()` if it's in a separate file
  - Both methods do the same:
    - Compile C++ code
    - Create an R function that calls it
  - C++ has many differences with R, having to assign every variable a type is probably the most notable

```{r rcpp-1, echo=TRUE}
library(Rcpp)
cppFunction("
double sumRcpp(NumericVector x) {
  // The function definition syntax is:
  // RETURN_TYPE functionName(INPUT_TYPE input, ...)
  int n = x.size();  // R objects have their own type (NumericVector) with useful attributes
  double total = 0;  // Need to instantiate variables before use
  for (int i = 0; i < n; i++) {  // C++ indexes start at 0
    total += x[i];
  }
  return total;      // Need to explicitly return values
}")
# The sumRcpp function is instantly available within R
sumRcpp(c(1, 2, 3))
```

```{r}
cppFunction("double sumRcpp8(IntegerVector x) {
  int n = x.size();  // R objects have their own type (NumericVector) with useful attributes
  double total = 0;  // Need to instantiate variables before use
  for (int i = 0; i < n; i++) {  // C++ indexes start at 0
    total += x[i];
  }
  return total;
}")
```

## Syntatic sugar - data structures {.smaller}

  - When calling an Rcpp function, the inputs are automatically converted from their R type into the specified C++ type
  - `NumericVector` is a special Rcpp data structure that represents a vector of floats, so will accept both `c(1.2, 2.4, 3.6)`, and `c(1, 2, 3)`, but not `c('a', 'b', 'c')`
  - `IntegerVector` coerces floats into integers
  - `CharacterVector` also exists for strings, and `NumericMatrix` for 2D structures
  
. . .

  - C++ (like most languages but unlike R) differentiates between scalar and vectors
  - Standard C++ scalar data types can be used in Rcpp functions: `int`, `double`, `char` etc...
  - Can also use any other C++ data structure (e.g. STL or from libraries)
  - `wrap()` is an Rcpp function that converts back from C++ to R, useful when returning at the end of a function!

## Syntatic sugar - functions

  - There's no penalty to using for loops in C++ so they are very common
  - But to save typing boiler plate code, Rcpp provides 'syntatical sugar' functions that operate on the R-specific data types
  - Examples: `mean`, `log`, `exp`, `sin`, `any`, `all`
  - The for loop wasn't necessary!

```{r rcpp-4, echo=TRUE}
cppFunction("double sumRcppsugar(NumericVector x) {
  return sum(x);
}")
sumRcppsugar(c(1, 2, 3))
```

## `sum` benchmarks {.smaller}

:::: {.columns}

::: {.column width="40%"}

```{r echo=TRUE}
sumR <- function(x) {
  total <- 0
  for (i in seq(length(x))) {
    total <- total + x[i]
  }
  total
}
```

  - The Rcpp implementations are around 30x faster than the R version
  - The syntatic sugar version is the same speed as the for loop
  - The inbuilt `sum` is highly optimised
  
:::

::: {.column width="60%"}

```{r}
input <- rnorm(1e6)
autoplot(microbenchmark(sum(input), sumRcpp(input), sumRcppsugar(input), sumR(input))) +
  theme_minimal() +
  theme(
      axis.text = element_text(size=16),
      legend.position = "bottom",
      axis.title = element_text(size=18),
      legend.text = element_text(size=16),
      legend.title = element_text(size=18),
      strip.text = element_text(size=18)
  )
```

:::

::::

## Real world example: Kalman Filter {.smaller}

```{r rcpp-5}
n <- 1000
noisy_sine <- sin(seq(1, 50, length.out=n)) + rnorm(n, mean=0, sd=0.5)
```

:::: {.columns}

::: {.column width="50%"}

```{r rcpp-6, echo=T}
#| code-line-numbers: 8-15
kf_r <- function(y, m, Q=0.5, H=0.5) {
  n <- length(y)
  alpha <- array(NA, dim=c(n+1, m))
  P <- array(NA, dim=c(m, m, n+1))
  alpha[1] <- 0  # Initialise with zero mean and high variance
  P[, , 1] <- 1e3
  Z <- array(1, dim=c(n, m)) 
  for (i in 1:n) {
    P_updated <- P[, , i] + Q
    # Calculate kalman gain
    K <- P_updated %*% t(Z[i, ]) %*% solve(Z[i, ] %*% P_updated %*% t(Z[i, ]) + H)
    # Update state and covariance
    alpha[i+1, ] <- alpha[i, ] + K %*% (y[i] - Z[i, ] %*% alpha[i, ])
    P[, , i+1] <- (diag(m) - K %*% Z[i, ]) %*% P_updated
  }
  list(alpha=alpha, P=P)
}
```

:::

::: {.column width="50%"}

  - The Kalman Filter is an algorithm that estimates unobserved parameters in a noisy system
  - It is recursive, the estimate at time $t$ solely depends on the value at time $t-1$, hence is a good candidate for Rcpp
  - R implementation is straight forward series of matrix operations
  
```{r rcpp-7}
kf_out <- kf_r(noisy_sine, 1, Q=0.01, H=0.25)
tibble(Raw = noisy_sine, Filtered=kf_out$alpha[2:1001]) |>
  mutate(id=row_number()) |>
  pivot_longer(-id) |>
  mutate(name = factor(name, levels=c("Raw", "Filtered"))) |>
  ggplot(aes(x=id, y=value, colour=name, alpha=name)) +
    geom_line() +
    scale_colour_manual("", values=c("grey", "steelblue")) +
    scale_alpha_manual("", values=c(0.8, 1)) +
    labs(x="", y="") +
    guides(alpha="none") +
    theme_minimal() +
    theme(
      axis.text = element_text(size=16),
      legend.position = "bottom",
      axis.title = element_text(size=18),
      legend.text = element_text(size=16),
      legend.title = element_text(size=18),
      strip.text = element_text(size=18)
    )
```

:::


::::


## Kalman Filter - Rcpp implementation {.smaller}

:::: {.columns}

::: {.column width="60%"}

```{r rcpp-8, echo=T}
cppFunction("
NumericVector kf_rcpp(arma::vec y, int m, float Q=0.5, float H=0.5) {
  int n = y.n_rows;
  
  arma::mat alpha(n+1, m, arma::fill::none);
  arma::cube P(m, m, n+1, arma::fill::none);
  arma::mat Z(n, m, arma::fill::ones);
  
  // Initialise with zero mean and high variance
  alpha.row(0).fill(0);
  P.slice(0).diag().fill(1000);
  
  // Run filter
  arma::mat P_updated(m, m);
  arma::mat K(m, m);
  for (int i=0; i<n; i++) {
    P_updated = P.slice(i) + Q;
    // Calculate kalman gain:
    K = P_updated * Z.row(i).t() * arma::inv(Z.row(i) * P_updated * Z.row(i).t() + H);
    // Update state and covariance
    alpha.row(i+1) = alpha.row(i) + K * (y[i] - Z.row(i) * alpha.row(i));
    P.slice(i+1) = (arma::eye(m, m) - K * Z.row(i)) * P_updated;
  }
  
  return wrap(alpha);  // This is crucial, converts the Armadillo matrix into an R NumericVector
}", depends="RcppArmadillo")
```

:::

::: {.column width="40%"}

  - Rcpp implementation is very similar, only using the `RcppArmadillo` library for access to 3D arrays
  - Can then call `kf_r()` or `kf_rcpp()` identically

:::

::::


## Benchmark

  - ~80x quicker in Rcpp!
  - Been able to go from hourly to minutely time-resolution
  - Core library function so worth investing the development time

```{r rcpp-9}
autoplot(microbenchmark(kf_r(noisy_sine, 1), kf_rcpp(noisy_sine, 1))) +
  theme_minimal() +
  theme(
      axis.text = element_text(size=16),
      legend.position = "bottom",
      axis.title = element_text(size=18),
      legend.text = element_text(size=16),
      legend.title = element_text(size=18),
      strip.text = element_text(size=18)
  )
```


## Parallelisation / Viking {.smaller}

  - For iterative jobs that don't fit within tabular data can run **parallelised** for loops, e.g.:
    - Fitting models
    - Network downloads/uploads
    - File processing
  - For 'small' jobs can run locally using `parallel::mclapply` (Linux), `doParallel` and `foreach` (Windows), or `furrr` (all OS, Tidyverse, combines `future` and `purrr`)

. . .

  - For larger jobs (both duration of each iteration and number of iterations), **Viking** is very useful with [array jobs](https://wiki.york.ac.uk/display/RCS/VK4%29+Job+script+configuration#VK4\)Jobscriptconfiguration-Arrayjobs)
  - Viking also useful to free up PC when running a computationally intensive library (e.g. Stan, INLA, Keras). Optimising these is application-specific

. . . 

  - `future.batchtools` offers the ability automatically create and submit Slurm job scripts from within R, allowing you to easily switch between running sequentially, local multi-core parallelisation, and independent processes Slurm array jobs. **UNTESTED**
  
## Thoughts {.smaller}

  - I tend to use all of these strategies with varying frequency:
    - Using inbuilt vectorised functions - daily
    - `data.table` in some form - weekly
    - Parellel jobs (Viking or local) - monthly
    - `Rcpp` - couple of times a year
  - As well as speeding up interactive work, faster code is especially useful when developing **packages**
  - As a result of researching for this talk, I'm going to use `duckdb` in future for some large datasets

## Resources {.smaller}

  - Vectorisation: [Chapter in Advanced R](https://adv-r.hadley.nz/perf-improve.html#vectorise)
  - Joining: [Tidyverse docs](https://dplyr.tidyverse.org/reference/mutate-joins.html), [interactive visual join viewer](https://joins.spathon.com/) (nb: 'outer' joins are called 'full' joins in R)
  - `data.table`: [vignette](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html), [syntax comparison with tidyverse](https://wetlandscapes.com/blog/a-comparison-of-r-dialects/#joining-data-full-join)
  - `tidytable` vs `dtplyr`: [benchmarking](https://markfairbanks.github.io/tidytable/articles/speed_comparisons.html) between `tidytable`, `data.table`, `dtplyr`, and `tidyverse` and `pandas` (nb: from `tidytable` author)
  - SQLite: [tutorial](https://www.sqlitetutorial.net/)
  - `duckdb`: [official docs](https://duckdb.org/docs/api/r)
  - `Rcpp`: [chapter in Advanced R](https://adv-r.hadley.nz/rcpp.html), [Rcpp book](https://link.springer.com/book/10.1007/978-1-4614-6868-4) (thorough), [Rcpp for everyone book](https://teuder.github.io/rcpp4everyone_en/) (accessible)
  - Parallelisation: [chapter in R Programming for Data Science ebook](https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html#building-a-socket-cluster)
  - Viking: [wiki](https://wiki.york.ac.uk/display/RCS/Viking+-+University+of+York+Research+Computing+Cluster)
  
# Misc

## Worked example - regex {visibility="uncounted"}

  - The `basename` and `dirname` solution was faster than `regex`

```{r, echo=TRUE}
example_4 <- function(paths) {
  gsub(".+\\/+([[:alnum:]]+)\\/([[:alnum:]]+\\.[[:alnum:]]+)$", "\\1_\\2", paths)
}
example_4(c("foo/bar/car/001.txt", "har/far/lar/002.txt"))
```

```{r}
res <- microbenchmark(
  example_1_vectorised(paths),
  sapply(paths, example_1),
  example_2(paths),
  example_3(paths),
  example_4(paths)
)

summary(res)[c("expr", "median")]
```

## Case when {visibility="uncounted"}

  - No speed difference between `ifelse` and `case_when`

```{r, echo=T}
f_casewhen <- function() {
  df_interval |>
    mutate(week_group = case_when(
      time >= as_date("2020-02-14") & time < as_date("2020-02-21") ~ 'a',
      time >= as_date("2020-03-17") & time < as_date("2020-03-24") ~ 'b',
      time >= as_date("2020-05-08") & time < as_date("2020-05-15") ~ 'c',
      time >= as_date("2020-09-20") & time < as_date("2020-09-27") ~ 'd',
      time >= as_date("2020-11-13") & time < as_date("2020-11-20") ~ 'e',
      .default = NA_character_
    )) |>
      filter(!is.na(week_group))
}

res <- microbenchmark(f_intervaljoin(), f_ifelse(), f_casewhen(), times=10)
summary(res)[c("expr", "median")]
```

```{r db-disconnect}
dbDisconnect(con_sql)
dbDisconnect(con_dd, shutdown=TRUE)
```
  
## Filter vs inner join speed {visibility="uncounted"}

  - When limiting analysis to the classifications with at least 10 companies, it was quicker to reduce the main dataset by an `inner_join` than `filter`

```{r}
f_filter <- function() {
  df |>
    filter(SICCode.SicText_1 %in% sic_10_companies$SICCode.SicText_1)
}

f_inner_join <- function() {
  df |>
    inner_join(sic_10_companies, by="SICCode.SicText_1") 
}

res_filter <- microbenchmark(
  f_filter(),
  f_inner_join(),
  times=3
)
summary(res_filter)[c("expr", "median")]
```