analysis.Rmd

---
title: "Dynamics of resistance genes concentrations after antimicrobial treatment"
output:
  html_document:
    toc: true
    toc_depth: 4
editor_options: 
  chunk_output_type: console
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Required packages

We need the following packages:

```{r}
required_packages <- c("readxl",
                       "purrr",
                       "dplyr",
                       "stringr",
                       "tidyr",
                       "AMR",
                       "RColorBrewer",
                       "magrittr")
```

Installing those that are not already installed on the system:

```{r}
for (package in required_packages) {
  if (!package %in% installed.packages()) install.packages(package)
}
```

Loading the required packages:

```{r message = FALSE}
devnull <- lapply(required_packages, require, character.only = TRUE)
```

## Loading and fixing the data

### Antimicrobial use

Date and type of antimicrobial use per farm:

```{r}
amu <- read_excel("AMU_KietStudy_Marc.xlsx")
```

It is a data frame that looks like this:

```{r}
amu
```

### Antimicrobial classes

Antimicrobial class of the antimicrobial(s) against which each of the resistance
genes is active:

```{r}
classes <- read_excel("FarmvsClass.xlsx")
```

It's a data frame that looks like this:

```{r}
classes
```

Note that each resistance gene can be active against antimicrobials of more than
one antimicrobial class.

### Resistance genes concentrations

The names of the farms:

```{r}
(farms <- paste0("K", c(formatC(6:9, 1, flag = "0"), 11:26)))
```

Let's start by looking at the resistance genes concentrations in chicken only
(maybe we'll look at the rat data later on):

```{r}
genes <- farms %>%
  map(read_excel, path = "KietAnalysisData.xlsx") %>% 
  map(filter, Group.1 %in% c("Chicken-Control", "Chicken-Treatment")) %>% 
  map(select, -Group.1, -TotalLog2Value) %>% 
  setNames(farms)
```

In the excel file, the gene concentrations for a given farm (with control and
treatment experiments) are in a different tab. This structure is mapped in the
`genes` object that is a list of data frames, each of them containing the genes
concentrations for a given farm. As an example the data for the first farm look
like this:

```{r}
genes[[1]]
```

#### Checking and fixing the columns names consistency

Let's check that the names of the columns are the same for all the farms:

```{r}
genes %>%
  map_df(names) %>% 
  apply(1, unique) %>% 
  .[map_int(., length) > 1]
```

There is a problem with with the `FarmID` variable that is called `K09` is one
of the farms. Let's fix this. As the names of the variables seem OK in the first
farm:

```{r}
(correct_names <- names(genes[[1]]))
```

Let's use them as variables names for all the farms:

```{r}
genes %<>% map(setNames, correct_names)
```

#### Dealing with missing "Before" measurements

Note also that not all farms have a "Before" measurement in the control group:

```{r}
(tmp <- genes %>%
  map(~ .x %>% filter(SamplingDay == "Before") %>% select(Group)) %>% 
  .[map_int(., nrow) < 2] %>% 
  unlist())
```

For the farms where the "Before" measurement is missing in the control group,
let's just use the "Before" measurement of the treatment group. For that, we
need the following function where `x` is a data frame for one farm:

```{r}
control_before <- function(x) {
  y <- filter(x, SamplingDay == "Before")
  y$Group <- "Control"
  y$Sample_Name <- NA
  rbind(x, y)
}
```

Let's now use this function on all the farms that need to be fixed:

```{r}
# the names of the farms that need to be fixed:
farms_to_fix <- tmp %>% 
  names() %>% 
  str_remove(".Group")

# the fixed data for these farms:
fixed_farms <- genes %>% 
  extract(farms_to_fix) %>% 
  map(control_before)

# updating the original data with the fixed data:
genes[farms_to_fix] <- fixed_farms
```

## Preparing the data for analyses

### Computing sums of resistance genes

Here we compute aggregates of resistance genes, these aggregates being the sums
of all the resistance genes but also the sum of all the resistance genes by
class of antimicrobial against which they are effective. Indeed, we will perform
the analyses in 3 different ways: per resistant gene, for all resistant genes
altogether, and per class of antimicrobials against which the resistant genes
are active.

Let's start by retrieving the names of resistance genes:

```{r}
(resistance_genes <- setdiff(names(genes[[1]]),
                             c("FarmID", "Group", "Sample_Name", "SamplingDay")))
```

As a reminder, `classes` looks like this:

```{r}
classes
```

Let's add a category `All` to the `Antimicrobial_Class` variable. This category
will simply correspond to all the resistance genes:

```{r}
classes %<>% 
  bind_rows(data.frame(EvaGreen_Name = resistance_genes,
                       Antimicrobial_Class = "All"))
```

The antimicrobial classes against which each resistance gene is active now looks
like:

```{r}
(classes_names <- unique(classes$Antimicrobial_Class))
```

Note that `Other` and `All` categories are actually not antimicrobials classes
per se. The following function calculates the sum of the genes concentration for
a given class `am_class` for the data frame `gene_farm` of a given farm:

```{r}
sum_by_class <- function(am_class, gene_farm) {
  classes %>%
    filter(Antimicrobial_Class == am_class) %>% 
    pull(EvaGreen_Name) %>% 
    extract(gene_farm, .) %>% 
    rowSums()
}
```

The following function uses the one above and adds as many variables as
antimicrobial classes (`r length(classes_names)`) to the data frame of a given
farm:

```{r}
add_sums_by_class <- function(gene_farm) {
  gene_farm %>%
    map_dfc(classes_names, sum_by_class, .) %>% 
    setNames(classes_names) %>% 
    bind_cols(gene_farm, .)
}
```

Let's use this function to add the sums of the genes concentrations to the data
frames of all the farms:

```{r message  = FALSE}
genes %<>% map(add_sums_by_class)
```

### Antimicrobial classes for AMU

As a reminder, the data on AMU look like this:

```{r}
amu
```

In order to work with AMU by antimicrobial class, we need to generate the
antimicrobial class of each of the antimicrobial of the `AAI` column. For that,
we can use the `ab_group()` function of the `AMR` package, but, for consistency
with the `classes` data frame, we need to generate a hash table that provide the
correspondence between the spelling of the antimicrobial class given by the
`AMR::ab_group()` function and the spelling of the antimicrobial class used in
our `classes` data frame:

```{r}
#          AMR package:                  classes data frame:
hash <- c("Aminoglycosides"           = "Aminoglycoside",
          "Amphenicols"               = "Phenicol",
          "Antifungals/antimycotics"  = "Antifungals/antimycotics",
          "Beta-lactams/penicillins"  = "Beta-Lactam",
          "Cephalosporins (3rd gen.)" = "Beta-Lactam",
          "Macrolides/lincosamides"   = "MLSB",
          "Other antibacterials"      = "Other",
          "Polymyxins"                = "Polymyxin",
          "Quinolones"                = "Quinolone",
          "Tetracyclines"             = "Tetracycline",
          "Trimethoprims"             = "Sulfonamide")
```

Let's generate the correspondence data frame:

```{r}
(amu_classes <- amu %>% 
  select(AAI) %>% 
  unique() %>% 
  mutate(AMR_package = map_chr(AAI, ab_group),
         classes_df  = hash[AMR_package]))

```

Let's check for missing values (given the warning we got in the above call):

```{r}
filter(amu_classes, if_any(.fns = is.na))
```

Let's fix it manually:

```{r}
amu_classes[amu_classes$AAI == "bromhexine", "classes_df"] <- "Mucoactive agent"
```

Finally, we can use `amu_classes` to make another hash table:

```{r}
hash <- with(amu_classes, setNames(classes_df, AAI))
```

And we can use this new hash table to add the antimicrobial class to the `amu`
data frame:

```{r}
(amu %<>% mutate(class = hash[AAI]))
```

Here, the categories used in the `class` variable of the `amu` data frame are
consistent with the categories used in the `Antimicrobial_Class` of the
`classes` data frame.

## Antimicrobials use

Let first define a few utility functions. This function duplicates the first row
of a data frame `x` and paste it at the end of the data frame:

```{r}
duplicate_first <- function(x) {
  bind_rows(x, x[1, ])
}
```

This function removes the last row of a data frame `x`:

```{r}
remove_last <- function(x) {
  x[-nrow(x), ]
}
```

This function gets the last value of the `y` variable of a data frame `x`:

```{r}
get_last <- function(x) {
  unlist(x[nrow(x), "y"])
}
```

This function retrieves the sampling collection time from the data frame `x` of
a farm

```{r}
sampling_dates <- function(x) {
  x %>% 
    pull(SamplingDay) %>%
    unique() %>% 
    setdiff(c("Before", "After", "End")) %>% 
    c(0) %>% 
    as.integer() %>% 
    sort()
}
```

This function adds the sampling collection time points to the figure of AMU,
using the data `data` that have been prepared for this figure:

```{r}
adding_sample_collection <- function(data) {
  f <- function(x) {
    if(nrow(x) > 1) {
      y <- x %>% 
        filter(c(AgeUse_Day[-1] > end[-length(end)], NA)) %>% 
        pull(end) %>% 
        first()
      if(is.na(y)) return(max(x$end))
      return(y)
    }
    pull(x, end)
  }
  
  g <- function(y) {
    data.frame(FarmID = rep(names(y), map(y, length)), x = unlist(y))
  }

  a <- map(genes, sampling_dates)
  
  b <- data %>% 
  group_by(FarmID) %>% 
  summarise(y1 = min(y) - 1,
            y2 = max(y) + 1)
  
  data %>% 
    group_by(FarmID) %>% 
    group_split() %>% 
    map(f) %>% 
    unlist() %>% 
    map2(a, ., `+`) %>% 
    g() %>% 
    left_join(b, "FarmID") %$%
    segments(x, y1, x, y2)
}
```

Preparing the data for plotting antimicrobial use:

```{r}
tmp <- amu %>% 
  filter(!class %in% c("Other", "Mucoactive agent", "Antifungals/antimycotics")) %>% 
  mutate(end = AgeUse_Day + DurationOfUse_days) %>% 
  group_by(FarmID) %>% 
  group_split() %>% 
  map(arrange, AgeUse_Day, end) %>% 
  map(duplicate_first) %>% 
  bind_rows() %>% 
  mutate(y = row_number()) %>% 
  group_by(FarmID) %>% 
  group_split()

last_values <- tmp %>% 
  map_int(get_last)

tmp %<>% 
  map(remove_last) %>% 
  bind_rows()

classes_names <- tmp %>%
  pull(class) %>%
  unique()

hash <- classes_names %>%
  length() %>% 
  brewer.pal("Set3") %>% 
  setNames(classes_names)

tmp %<>% mutate(color = hash[class])
```

Plotting the data:

```{r fig.width = 8}
x_max <- 135
lwd_val <- 2
y_val <- c(.16347245, .86858097)
opar <- par(plt = c(.05, .7, y_val))

plot(NA, xlim = c(-7, x_max), ylim = c(0, (max(tmp$y) + 1)),
     type = "n", xlab = "age of flock (days)", ylab = "", axes = FALSE,
     xaxs = "i", yaxs = "i")
box(bty = "o")
axis(1)
with(tmp, segments(AgeUse_Day, y, end, y, col = color, lwd = lwd_val))
abline(h = last_values, col = "grey")
abline(v = 0)
abline(v = seq(7, x_max, 7), col = "grey")

tmp %>% 
  group_by(FarmID) %>% 
  summarise(y = mean(y)) %$%
  text(-3.5, y, FarmID, cex = .5)

adding_sample_collection(tmp)

par(plt = c(.71, 1, y_val), new = TRUE)
plot(1:10, type = "n", axes = FALSE, ann = FALSE)
legend("center", legend = names(hash), col = hash, lwd = lwd_val, lty = 1, bty = "n")
par(opar)
```

Note that there is missing information for farm K24 here. 

## Options of genes concentrations visualization

### Plotting the raw data per farm and gene

The time points:

```{r}
genes %>%
  map(pull, SamplingDay) %>% 
  unlist() %>% 
  unique()
```

Generating new time points and ordering the data chronologically:

```{r}
genes %<>% map(mutate,
               SamplingDay2 = as.integer(recode(SamplingDay,
                                                Before = "-7",
                                                After  = "0",
                                                End    = "120"))) %>% 
  map(arrange, Group, SamplingDay2) # making sure that data are arranged chronologically
```

A utility function for the function `plot_gene_concentration()` that follows
after. This function adds the dots and lines to a plot:

```{r}
plot_data <- function(x, gene, col, lwd = 2, lty = 3) {
  lines2 <- function(...) lines(..., col = col, lwd = lwd)
  nrows <- nrow(x)
  x2 <- x[-c(1, nrows), ]
  x3 <- x[1:2, ]
  x4 <- x[(nrows - 1):nrows, ]
  points(x$SamplingDay2, x[[gene]], col = col, lwd = lwd)
  lines2(x2$SamplingDay2, x2[[gene]])
  lines2(x3$SamplingDay2, x3[[gene]], lty = lty)
  lines2(x4$SamplingDay2, x4[[gene]], lty = lty)
}
```

Another utility function for the function `plot_gene_concentration()` that
follows after. This function plot the axes, their range and labels:

```{r}
plot_frame <- function(..., type = "n") {
  plot(type = type, xlim = c(-10, 120), axes = FALSE, ...)
  ats <- c(-7, seq(0, 120, 20))
  lbs <- ats
  lbs[1] <- "before"
  lbs[length(lbs)] <- "end"
  axis(1, ats, lbs)
  axis(2)
}
```

Let's now define colors that we will use throughout for the control and
treatment experiments:

```{r}
exp_col <- c(treatment = "red", control = "blue")
```

The following function uses the two previous functions to plots the gene
concentrations as a function of time for a given resistance gene `gene` in a
given farm `farm`. And this for both control and treatment experiments:

```{r}
plot_gene_concentration <- function(farm, gene, text, col = exp_col) {
  farm_dataset <- genes[[farm]]
  
  plot_frame(farm_dataset$SamplingDay2, farm_dataset[[gene]],
             xlab = "time after treatment (days)", ylab = "gene concentration")

  farm_dataset %>%
    filter(Group == "Control") %>% 
    plot_data(gene, col["control"])
  
  farm_dataset %>%
    filter(Group == "Treatment") %>% 
    plot_data(gene, col["treatment"])

  mtext(text)
  abline(v = 0, lwd = 2)
}
```

The `text` argument is a character string to use as the title of the plot.
Plotting `aac3-liacde` for control and treatment experiments in farm K06:

```{r}
plot_gene_concentration(farms[1], resistance_genes[1], resistance_genes[1])
legend("right", legend = names(exp_col), col = exp_col, lty = 1, lwd = 2, bty = "n")
```

#### Plotting all the genes for a given farm

Number of columns we want and some graph tuning:

```{r}
ncols <- 4
plt_val <- c(.13, .92, .22, .9)
```

Plotting all the genes for the first farm:

```{r fig.height = ceiling(length(resistance_genes) / ncols) * (1 + 2 / 3), fig.width = 2.45 * ncols}
opar <- par(mfrow = c(ceiling(length(resistance_genes) / ncols), ncols), plt = plt_val)
walk(resistance_genes, ~ plot_gene_concentration(farms[1], .x, .x))
par(opar)
```

#### Plotting all the farms for a given gene

Plotting all the farms for the first gene:

```{r fig.height = ceiling(length(farms) / ncols) * (1 + 2 / 3), fig.width = 2.45 * ncols}
opar <- par(mfrow = c(ceiling(length(farms) / ncols), ncols), plt   = plt_val)
walk(farms, ~ plot_gene_concentration(.x, resistance_genes[1], .x))
par(opar)
```

### Gathering data per gene

Here we want to plot on a single graph the data from all the farms (with control
and treatment experiments), with one line per experiment time series. For that,
we first need to standardize the genes concentration in order to ensure that
dynamics are comparable across experiments and farms. The standardization is
done by taking the `Before` measurement as a reference value.

#### Standardizing the data

The following function allows to standardize the data for all the resistance
genes from one given experiment (i.e. either control or treatment in one given
farm):

```{r}
standardize_by_before_ <- function(x) {
  rbind(rep(1, length(resistance_genes)),
        sweep(as.matrix(x[x$SamplingDay != "Before", resistance_genes]), 2,
                 unlist(x[x$SamplingDay == "Before", resistance_genes]), `/`))
}
```

Here `x` is the subset of the data frame of genes concentrations of a given farm
that correspond either to the control of the treatment experiment. This function
below uses the one above and standardizes the data for all the resistance genes
for both the control and treatment experiments of a given farm:

```{r}
standardize_by_before <- function(x) {
  x[, resistance_genes] <- rbind(standardize_by_before_(filter(x, Group == "Control")),
                                 standardize_by_before_(filter(x, Group == "Treatment")))
  
  x
}
```

Now here `x` is the data frame of genes concentrations of a given farm. Let's
use this function to standardize the gene concentrations for all the farms:

```{r}
genes_standardized <- map(genes, standardize_by_before)
```

Let's now plot the concentrations of a given gene across the farms on a single
plot. Let's first define the following function that draws the layout the plot:

```{r}
plot_frame2 <- function(...) {
  plot_frame(..., xlab = "time (days)", ylab = "standardized genes concentrations")
}
```

The same on with a log scale y-axis:

```{r}
plot_frame2log <- function(...) {
  plot_frame(..., xlab = "time (days)", ylab = "log(standardized genes concentrations)")
}
```

Let's also define a function that plot vertical and horizontal lines and adds
title:

```{r}
lines_title <- function(x, h = 0) {
  mtext(x)
  abline(h = h, lwd = 2)
  abline(v = 0, lwd = 2)
}
```

Let's now explore various ways to plot the data.

#### Experiments time series

Now, we can start exploring various options of plotting the data. Let's start by
considering this transformation function:

```{r}
transformation <- function(x) {
  mutate_at(x, 3, log10)
}
```

This function replaces infinity values in the third column of data frame `x` by
some specified value:

```{r}
replace_infinity <- function(x, infinity = -1000) {
  mutate_at(x, 3, ~ replace(.x, is.infinite(.x), infinity))
}
```

With the 4 functions above being defined, we can proceed and explore plotting:

```{r}
plot_gene_all_farms <- function(gene, infinity = -1000, h = 0, pf = plot_frame2) {
  tmp <- genes_standardized %>%
    map(extract, c("Group", "SamplingDay2", gene)) %>%
    map(transformation)

  tmp2 <- bind_rows(tmp)
  pf(tmp2[[2]], tmp2[[3]])
  
  tmp %>% 
    map(replace_infinity, infinity) %>% 
    map(~ split(.x, .x$Group)) %>% 
    unlist(recursive = FALSE) %>% 
    map(mutate, col = exp_col[(Group == "Control") + 1]) %>% 
    sample() %>% # we want to shuffle the treatments for unbiased visualization
    walk(~ plot_data(.x, gene, col = .x$col))
  
  lines_title(gene, h)
}
```

Note here that we transform the standardized genes concentrations. Depending on
the chosen transformation, it may generate infinity values. For visualization
purpose we replace these values by the big number `infinity` in the list of the
function's arguments. Let's try it on one resistance gene:

```{r}
plot_gene_all_farms(resistance_genes[1], pf = plot_frame2log)
legend("bottomright", legend = names(exp_col), col = exp_col, lty = 1, lwd = 2, bty = "n")
```

#### Boxplots

Let's try an alternative visualization, using this following function for
box-plots:

```{r}
boxplot2 <- function(x, eps, col, ...) {
  boxplot(x[[3]] ~ x[[2]], at =  unique(x[[2]]) + eps, add = TRUE, axes = FALSE,
          boxwex = 2.5, col = adjustcolor(col, .5), outline = FALSE, ...)
}
```

Here is the alternative visualization:

```{r}
plot_gene_all_farms <- function(gene, col = exp_col, eps = 1.5, h = 0,
                                transform = I, plot_frame = plot_frame2, plot_fct = boxplot2) {
  tmp <- genes_standardized %>% 
    bind_rows() %>% 
    filter(SamplingDay != "Before") %>% 
    select(c("Group", "SamplingDay2", gene)) %>% 
    transform()

  plot_frame(tmp[[2]], tmp[[3]])

  control <- tmp %>%
    filter(Group == "Control") %>% 
    arrange(SamplingDay2) # because boxplot() and vioplot() are not used with the formula option
  points(jitter(control[[2]] - eps), control[[3]], col = col["control"])
  plot_fct(control, -eps, col = col["control"])
  
  treatment <- tmp %>%
    filter(Group == "Treatment") %>% 
    arrange(SamplingDay2) # because boxplot() and vioplot() are not used with the formula option
  points(jitter(treatment[[2]] + eps), treatment[[3]], col = col["treatment"])
  plot_fct(treatment, eps, col = col["treatment"])
  
  lines_title(gene, h)
}
```

Let's try it:

```{r}
plot_gene_all_farms(resistance_genes[1], h = 1)
legend("topright", legend = names(exp_col), fill = adjustcolor(exp_col, .5), bty = "n")
```

But a log scale might be more appropriate:

```{r}
plot_gene_all_farms(resistance_genes[1], transform = transformation, plot_frame = plot_frame2log)
legend("bottomright", legend = names(exp_col), fill = adjustcolor(exp_col, .5), bty = "n")
```

#### Violin plots

Let's try a violin plot instead, by using this function:

```{r}
vioplot2 <- function(x, eps, color, ...) {
  vioplot::vioplot(x[[3]] ~ x[[2]], at =  unique(x[[2]]) + eps, add = TRUE,
                   axes = FALSE, fill = color, lineCol = color, border = color,
                   wex = 4, col = adjustcolor(color, .5), ...)
}
```

Let's try plotting the data with the violin plot:

```{r}
plot_gene_all_farms(resistance_genes[1], h = 1, plot_fct = vioplot2)
legend("topright", legend = names(exp_col), fill = adjustcolor(exp_col, .5), bty = "n")
```

### Differences between treatment and control

```{r}
plot_frame3 <- function(...) {
  plot_frame(..., xlab = "time (days)",
             ylab = "difference in standardized genes concentrations")
}
```

#### Experiment time series

Let's plot the difference between treatment and control for each farm. For that,
we need the following function:

```{r}
plot_differences <- function(gene) {
  tmp <- genes %>% 
    map(extract, c("SamplingDay2", "Group", gene)) %>% 
    map(pivot_wider, names_from = Group, values_from = 3) %>% 
    map(mutate, difference = Treatment - Control)

  tmp %>% 
    bind_rows() %$% 
    plot_frame3(SamplingDay2, difference)
  
  walk(tmp, with, lines(SamplingDay2, difference, col = "green", lwd = 2))

  lines_title(gene)
}
```

Let's try it:

```{r}
plot_differences(resistance_genes[1])
```

#### Boxplots

Let's consider this function for box-plots and violin plots:

```{r}
plot_differences <- function(gene, plot_fct = boxplot2) {
  
  tmp <- genes %>% 
    map(extract, c("SamplingDay2", "Group", gene)) %>% 
    map(pivot_wider, names_from = Group, values_from = 3) %>% 
    map(mutate, difference = Treatment - Control) %>% 
    bind_rows() %>% 
    arrange(SamplingDay2) %>% # for boxplot() and vioplot()
    select(Control, SamplingDay2, difference) # for boxplot2() and vioplot2()

  with(tmp, plot_frame3(jitter(SamplingDay2), difference, col = "green", type = "p"))
    
  plot_fct(tmp, 0, col = "green")

  lines_title(gene)
}
```

Let's try it:

```{r}
plot_differences(resistance_genes[1], boxplot2)
```

#### Violin plots

Let's look at the violin alternative:

```{r}
plot_differences(resistance_genes[1], vioplot2)
```

#### Quantiles

A tuning of the `polygon()` function:

```{r}
polygon2 <- function(x, y1, y2, ...) {
  polygon(c(x, rev(x)), c(y1, rev(y2)), ...)
}
```

A tuning of the `arrows()` function:

```{r}
arrows2 <- function(...) {
  arrows(length = .1, angle = 90, code = 3, ...)
}
```

Let's consider another option.

```{r}
plot_differences_quantiles <- function(gene, data = genes,
                                       col_lines = "steelblue3", col_area = "lightblue") {
  tmp <- data %>% 
    map(extract, c("SamplingDay2", "Group", gene)) %>% 
    map(pivot_wider, names_from = Group, values_from = 3) %>% 
    map(mutate, difference = Treatment - Control) %>% 
    bind_rows() %>% 
    select(SamplingDay2, difference)

  with(tmp, plot_frame3(jitter(SamplingDay2), difference, col = col_area, type = "p"))

  x_val <- sort(unique(tmp$SamplingDay2))
  
  tmp %>%
    group_by(SamplingDay2) %>% 
    group_split() %>% 
    map(~ quantile(.x$difference, c(.25, .5, .75), na.rm = TRUE)) %>% 
    bind_rows() %>% 
    mutate(x_val = x_val) %>% 
    with({
      polygon2(x_val, `25%`, `75%`,
               col = adjustcolor(col_area, .2), border = col_lines, lty = 3)
      points(x_val, `50%`, lwd = 2, col = col_lines)
      arrows2(x_val, `25%`, x_val, `75%`, lwd = 2, col = col_lines)
      lines(x_val, `50%`, lty = 2, col = col_lines)
    })
  
  lines_title(gene)
}
```

Let's try it:

```{r}
plot_differences_quantiles(resistance_genes[1])
```

Let's plot all the genes:

```{r fig.height = ceiling(length(resistance_genes) / ncols) * (1 + 2 / 3), fig.width = 2.45 * ncols, warning = FALSE}
opar <- par(mfrow = c(ceiling(length(resistance_genes) / ncols), ncols), plt = plt_val)
walk(resistance_genes, plot_differences_quantiles)
par(opar)
```

## Visualization by age

## Visualization by groups of antimicrobials

* farm K06: Sulfonamide (+ MLSB)
* farm K07: 
* farm K08: Tetracycline + MLSB

farm K08: Tetracycline + MLSB: here we want

* a line for Tetracycline
* a line for MLSB
* a line for all the other ones

The following function calculates the gene concentrations differences between
Treatment and Control, for each genes of an experiment and at each time point.
`x` is the data of this experiment.

```{r}
gene_conc_diff <- function(x) {
  x %<>%
    group_by(Group) %>% 
    group_split()
  
  tmp <- x[[1]][[1]]

  x %>%
    map(select, -Group) %>% 
    rev() %>% 
    do.call(`-`, .) %>% 
    mutate(SamplingDay2 = tmp)
}
```

The following function calculates the quantiles of genes concentration,
across genes at for each time point. The data frame `x` has time values in the
column `SamplingDay2` and then genes concentrations by column.

```{r}
gene_conc_diff_quant <- function(x) {
  tmp <- x$SamplingDay2

  x %>% 
    select(-SamplingDay2) %>% 
    apply(1, quantile, c(.25, .5, .75)) %>% 
    t() %>% 
    as.data.frame() %>% 
    mutate(SamplingDay2 = tmp)
}
```

The following function plots the data of some genes concentrations of an
experiment:

```{r}
plot_data <- function(x, col_lines = "steelblue3", col_area = "lightblue",
                      lwd_val = 2, add = FALSE, ...) {
  if (add) {
    plot_fct <- function(x, y, ...) {
      points(jitter(x), y, col = col_area, ...)
    }
  } else {
    plot_fct <- function(x, y, ...) {
      plot_frame3(jitter(x), y, col = col_area, type = "p", ...)
    }
  }
  
  x %>% 
    pivot_longer(-SamplingDay2) %$% 
    plot_fct(SamplingDay2, value, ...)

  x %>% 
    gene_conc_diff_quant() %>% 
    with({
      polygon2(SamplingDay2, `25%`, `75%`, col = adjustcolor(col_area, .2),
               border = col_lines, lty = 3)
      points(SamplingDay2, `50%`, lwd = lwd_val, col = col_lines)
      arrows2(SamplingDay2, `25%`, SamplingDay2, `75%`, lwd = lwd_val, col = col_lines)
      lines(SamplingDay2, `50%`, lty = lwd_val, col = col_lines)
    })
}
```

The following function plot the data of a set `ab` of resistance genes in farm
`farm`:

```{r}
plot_ab_set <- function(farm, ab, ...) {
  genes_set <- classes %>% 
    filter(Antimicrobial_Class %in% ab) %>% 
    pull(EvaGreen_Name)

  genes %>% 
    extract2(farm) %>% 
    extract(c("SamplingDay2", "Group", genes_set)) %>% 
    gene_conc_diff() %>% 
    plot_data(., ...)
}
```

Let's generate the vector that contains the names of the antimicrobial classes
used in the experiments:

```{r}
ab_classes <- classes %>% 
  extract2("Antimicrobial_Class") %>% 
  unique() %>% 
  setdiff(c("Multidrug", "Other", "All"))
```

The following function calculates the y-axis range we should choose for a given
farm:

```{r}
y_range <- function(farm) {
  genes_set <- classes %>% 
    filter(Antimicrobial_Class %in% ab_classes) %>% 
    pull(EvaGreen_Name)

  genes %>% 
    extract2(farm) %>% 
    extract(c("SamplingDay2", "Group", genes_set)) %>% 
    gene_conc_diff() %>% 
    select(-SamplingDay2) %>% 
    range()
}
```

### Plotting farm K08

Parameters:

```{r}
lwd_v <- 2
farm <- "K08"
ab1 <- "Tetracycline"
ab2 <- "MLSB"
```

Separated:

```{r}
plot_ab_set(farm, ab1,
   col_lines = "steelblue3", col_area = "lightblue", lwd_val = lwd_v, ylim = y_range(farm))

plot_ab_set(farm, ab2,
   col_lines = "indianred1", col_area = "darksalmon", lwd_val = lwd_v, add = TRUE)

plot_ab_set(farm, setdiff(ab_classes, c(ab1, ab2)),
   col_lines = "grey", col_area = "grey", lwd_val = lwd_v, add = TRUE)

abline(h = 0)

legend("bottomright", legend = c(ab1, ab2, "others"),
       fill = c("steelblue3" , "indianred1", "grey"), bty = "n")
```

Combined:

```{r}
plot_ab_set(farm, c(ab1, ab2),
   col_lines = "steelblue3", col_area = "lightblue", lwd_val = lwd_v, ylim = y_range(farm))

plot_ab_set(farm, setdiff(ab_classes, c(ab1, ab2)),
   col_lines = "grey", col_area = "grey", lwd_val = lwd_v, add = TRUE)

abline(h = 0)

legend("bottomright", legend = c(paste(ab1, ab2, sep = " and "), "others"),
       fill = c("steelblue3" , "grey"), bty = "n")
```

### Plotting farm K09

Parameters:

```{r}
lwd_v <- 2
farm <- "K09"
ab1 <- "Tetracycline"
ab2 <- "Polymyxin"
```

Separated:

```{r}
plot_ab_set(farm, ab1,
   col_lines = "steelblue3", col_area = "lightblue", lwd_val = lwd_v, ylim = y_range(farm))

plot_ab_set(farm, ab2,
   col_lines = "indianred1", col_area = "darksalmon", lwd_val = lwd_v, add = TRUE)

plot_ab_set(farm, setdiff(ab_classes, c(ab1, ab2)),
   col_lines = "grey", col_area = "grey", lwd_val = lwd_v, add = TRUE)

abline(h = 0)

legend("bottomright", legend = c(ab1, ab2, "others"),
       fill = c("steelblue3" , "indianred1", "grey"), bty = "n")
```

Combined:

```{r}
plot_ab_set(farm, c(ab1, ab2),
   col_lines = "steelblue3", col_area = "lightblue", lwd_val = lwd_v, ylim = y_range(farm))

plot_ab_set(farm, setdiff(ab_classes, c(ab1, ab2)),
   col_lines = "grey", col_area = "grey", lwd_val = lwd_v, add = TRUE)

abline(h = 0)

legend("bottomright", legend = c(paste(ab1, ab2, sep = " and "), "others"),
       fill = c("steelblue3" , "grey"), bty = "n")
```

### Plotting farm K11

Parameters:

```{r}
lwd_v <- 2
farm <- "K11"
ab1 <- "Beta-Lactam"
ab2 <- "Aminoglycoside"
```

Separated:

```{r}
plot_ab_set(farm, ab1,
   col_lines = "steelblue3", col_area = "lightblue", lwd_val = lwd_v, ylim = y_range(farm))

plot_ab_set(farm, ab2,
   col_lines = "indianred1", col_area = "darksalmon", lwd_val = lwd_v, add = TRUE)

plot_ab_set(farm, setdiff(ab_classes, c(ab1, ab2)),
   col_lines = "grey", col_area = "grey", lwd_val = lwd_v, add = TRUE)

abline(h = 0)

legend("top", legend = c(ab1, ab2, "others"),
       fill = c("steelblue3" , "indianred1", "grey"), bty = "n")
```

Combined:

```{r}
plot_ab_set(farm, c(ab1, ab2),
   col_lines = "steelblue3", col_area = "lightblue", lwd_val = lwd_v, ylim = y_range(farm))

plot_ab_set(farm, setdiff(ab_classes, c(ab1, ab2)),
   col_lines = "grey", col_area = "grey", lwd_val = lwd_v, add = TRUE)

abline(h = 0)

legend("top", legend = c(paste(ab1, ab2, sep = " and "), "others"),
       fill = c("steelblue3" , "grey"), bty = "n")
```

### Plotting farm K12

Parameters:

```{r}
lwd_v <- 2
farm <- "K12"
ab1 <- "Beta-Lactam"
ab2 <- "Aminoglycoside"
```

Separated:

```{r}
plot_ab_set(farm, ab1,
   col_lines = "steelblue3", col_area = "lightblue", lwd_val = lwd_v, ylim = y_range(farm))

plot_ab_set(farm, ab2,
   col_lines = "indianred1", col_area = "darksalmon", lwd_val = lwd_v, add = TRUE)

plot_ab_set(farm, setdiff(ab_classes, c(ab1, ab2)),
   col_lines = "grey", col_area = "grey", lwd_val = lwd_v, add = TRUE)

abline(h = 0)

legend("bottomright", legend = c(ab1, ab2, "others"),
       fill = c("steelblue3" , "indianred1", "grey"), bty = "n")
```

Combined:

```{r}
plot_ab_set(farm, c(ab1, ab2),
   col_lines = "steelblue3", col_area = "lightblue", lwd_val = lwd_v, ylim = y_range(farm))

plot_ab_set(farm, setdiff(ab_classes, c(ab1, ab2)),
   col_lines = "grey", col_area = "grey", lwd_val = lwd_v, add = TRUE)

abline(h = 0)

legend("bottomright", legend = c(paste(ab1, ab2, sep = " and "), "others"),
       fill = c("steelblue3" , "grey"), bty = "n")
```

### Plotting farm K13

Parameters:

```{r}
lwd_v <- 2
farm <- "K13"
ab1 <- "Beta-Lactam"
ab2 <- "Polymyxin"
ab3 <- "MLSB"
ab4 <- "Tetracycline"
```

Combined:

```{r}
plot_ab_set(farm, c(ab1, ab2, ab3, ab4),
   col_lines = "steelblue3", col_area = "lightblue", lwd_val = lwd_v, ylim = y_range(farm))

plot_ab_set(farm, setdiff(ab_classes, c(ab1, ab2, ab3, ab4)),
   col_lines = "grey", col_area = "grey", lwd_val = lwd_v, add = TRUE)

abline(h = 0)

legend("bottomright",
       legend = c(paste(paste(ab1, ab2, ab3, sep = ", "), ab2, sep = " and "), "others"),
       fill = c("steelblue3" , "grey"), bty = "n")
```

### Plotting farm K19

Parameters:

```{r}
lwd_v <- 2
farm <- "K19"
ab1 <- "Quinolone"
```

Separated:

```{r}
plot_ab_set(farm, ab1,
   col_lines = "steelblue3", col_area = "lightblue", lwd_val = lwd_v, ylim = y_range(farm))

plot_ab_set(farm, setdiff(ab_classes, ab1),
   col_lines = "grey", col_area = "grey", lwd_val = lwd_v, add = TRUE)

abline(h = 0)

legend("bottomright", legend = c(ab1, "others"),
       fill = c("steelblue3" , "grey"), bty = "n")
```

### Plotting farm K22

Parameters:

```{r}
lwd_v <- 2
farm <- "K22"
ab1 <- "Tetracycline"
ab2 <- "Polymyxin"
```

Separated:

```{r}
plot_ab_set(farm, ab1,
   col_lines = "steelblue3", col_area = "lightblue", lwd_val = lwd_v, ylim = y_range(farm))

plot_ab_set(farm, ab2,
   col_lines = "indianred1", col_area = "darksalmon", lwd_val = lwd_v, add = TRUE)

plot_ab_set(farm, setdiff(ab_classes, c(ab1, ab2)),
   col_lines = "grey", col_area = "grey", lwd_val = lwd_v, add = TRUE)

abline(h = 0)

legend("topright", legend = c(ab1, ab2, "others"),
       fill = c("steelblue3" , "indianred1", "grey"), bty = "n")
```

Combined:

```{r}
plot_ab_set(farm, c(ab1, ab2),
   col_lines = "steelblue3", col_area = "lightblue", lwd_val = lwd_v, ylim = y_range(farm))

plot_ab_set(farm, setdiff(ab_classes, c(ab1, ab2)),
   col_lines = "grey", col_area = "grey", lwd_val = lwd_v, add = TRUE)

abline(h = 0)

legend("topright", legend = c(paste(ab1, ab2, sep = " and "), "others"),
       fill = c("steelblue3" , "grey"), bty = "n")
```

### Plotting farm K25

Parameters:

```{r}
lwd_v <- 2
farm <- "K25"
ab1 <- "Quinolone"
```

Separated:

```{r}
plot_ab_set(farm, ab1,
   col_lines = "steelblue3", col_area = "lightblue", lwd_val = lwd_v, ylim = y_range(farm))

plot_ab_set(farm, setdiff(ab_classes, ab1),
   col_lines = "grey", col_area = "grey", lwd_val = lwd_v, add = TRUE)

abline(h = 0)

legend("bottomright", legend = c(ab1, "others"),
       fill = c("steelblue3" , "grey"), bty = "n")
```

### Plotting farm K26

Parameters:

```{r}
lwd_v <- 2
farm <- "K26"
ab1 <- "Tetracycline"
```

Separated:

```{r}
plot_ab_set(farm, ab1,
   col_lines = "steelblue3", col_area = "lightblue", lwd_val = lwd_v, ylim = y_range(farm))

plot_ab_set(farm, setdiff(ab_classes, ab1),
   col_lines = "grey", col_area = "grey", lwd_val = lwd_v, add = TRUE)

abline(h = 0)

legend("bottomright", legend = c(ab1, "others"),
       fill = c("steelblue3" , "grey"), bty = "n")
```

## Distances between genes

### Calculating distances using differences

The following function calculates the difference between treatment and control
for a given gene `gene` for all the farms:

```{r}
calc_differences <- function(gene) {
  genes %>% 
    map(extract, c("SamplingDay2", "Group", gene)) %>% 
    map(pivot_wider, names_from = Group, values_from = 3) %>% 
    map(mutate, difference = Treatment - Control) %>% 
    bind_rows(.id = "farm") %>% 
    select(-Control, -Treatment) %>% 
    setNames(c("farm", "SamplingDay2", gene))
}
```

Calculating genes differences between treatment and control, across farms:

```{r}
differences <- resistance_genes %>% 
  map(calc_differences) %>% 
  reduce(left_join, by = c("farm", "SamplingDay2"))
```

Let's do a principal component analysis (PCA) on these differences:

```{r}
pca <- differences %>% 
  filter(if_all(.fns = Negate(is.na))) %>% 
  select(-farm, -SamplingDay2) %>% 
  t() %>% 
  as.data.frame() %>% 
  select(where(~ any(. != 0))) %>% 
  prcomp(scale. = TRUE)
```

Preparing the PCA data to plot:

```{r}
# a hash table that converts an antimicrobial name into a class name:
hash1 <- classes %>% 
  filter(Antimicrobial_Class != "All") %>% 
  unique() %$%
  setNames(Antimicrobial_Class, EvaGreen_Name)

# a hash table that converts a class name into a color:
hash2 <- setNames(brewer.pal(11, "Set3"), unique(hash1))

# PC1 and PC2 coordinates of the PCA:
xs <- pca$x
xs %<>% 
  as.data.frame() %>% 
  select(PC1, PC2) %>% 
  mutate(class = hash1[rownames(xs)],
         color = hash2[class])

# adding the percentage of variance as metadata:
variance <- pca$sdev^2
variance <- variance / sum(variance)
attr(xs, "percvar") <- round(100 * variance[1:2])
```

A function that plots the data points of the first two components of the PCA,
highlighting a given class of antimicrobials:

```{r}
pca_biplot <- function(abclass, xs) {
  percvar <- attr(xs, "percvar")
  with(xs, plot(PC1, PC2, main = abclass,
                xlab = paste0("PC1 (", percvar[1], "%)"),
                ylab = paste0("PC2 (", percvar[2], "%)")))
  xs %>% 
    filter(class == abclass) %$% 
    points(PC1, PC2, col = "red", pch = 19)
  
  abline(v = 0, lty = 2, col = "grey")
  abline(h = 0, lty = 2, col = "grey")
}
```

The number of genes per antimicrobial classes:

```{r}
(ordered_classes <- classes %>% 
  filter(Antimicrobial_Class != "All") %>% 
  unique() %>% 
  group_by(Antimicrobial_Class) %>% 
  tally() %>% 
  arrange(desc(n)))
```

```{r}
classes_to_plot <- ordered_classes %>%
  filter(n > 5) %>% 
  pull(Antimicrobial_Class)
```

The number of columns we want:

```{r}
ncols <- 2
plt_val <- c(.13, .92, .2, .85)
```

Plotting the genes from the classes with the highest number of genes:

```{r fig.height = 1.5 * ceiling(length(classes_to_plot) / ncols) * (1 + 2 / 3), fig.width = 1.5 * 2.45 * ncols}
opar <- par(mfrow = c(ceiling(length(classes_to_plot) / ncols), ncols), plt = plt_val)
walk(classes_to_plot, pca_biplot, xs)
par(opar)
```

There is a group of genes that behave very similarly:

```{r}
(tmp <- filter(xs, PC1 < -3.8, PC2 > 5))
```

But that seems to be not interesting:

```{r}
ncols <- 4
```

```{r fig.height = ceiling(nrow(tmp) / ncols) * (1 + 2 / 3), fig.width = 2.45 * ncols, warning = FALSE}
opar <- par(mfrow = c(ceiling(nrow(tmp) / ncols), ncols), plt = plt_val)
walk(rownames(tmp), plot_differences_quantiles)
par(opar)
```

### Calculating distances using relative variations of genes concentrations

Here we basically reuse the code of the previous subsection. The function
`calc_relvar()` basically replaces the function `calc_differences()` of the
previous subsection:

```{r}
calc_relvar <- function(gene) {
  genes %>% 
    map(extract, c("SamplingDay2", "Group", gene)) %>% 
    map(pivot_wider, names_from = Group, values_from = 3) %>% 
    map(mutate_at, c("Control", "Treatment"), scale) %>% 
    map(pivot_longer, -1, names_to = "group", values_to = gene) %>% 
    bind_rows(.id = "farm")
}
```

The function `pc1pc2()` packages part of the code of the previous subsection:

```{r}
pc1pc2 <- function(data) {
# the PCA:
  pca <- data %>% 
    mutate_all(~ replace(.x, is.nan(.x), 0)) %>% # because scaling may generate NaNs
    filter(if_all(.fns = Negate(is.na))) %>% 
    select(-farm, -SamplingDay2, -group) %>% ###
    t() %>% 
    as.data.frame() %>% 
    select(where(~ any(. != 0))) %>% 
    prcomp(scale. = TRUE)

# PC1 and PC2 coordinates of the PCA:
  xs <- pca$x
  xs %<>% 
    as.data.frame() %>% 
    select(PC1, PC2) %>% 
    mutate(class = hash1[rownames(xs)],
           color = hash2[class])

# adding the percentage of variance as metadata:
  variance <- pca$sdev^2
  variance <- variance / sum(variance)
  attr(xs, "percvar") <- round(100 * variance[1:2])

# returns value:
  xs
}
```

Computing the coordinates on the first 2 components of the PCA:

```{r}
xs2 <- resistance_genes %>% 
  map(calc_relvar) %>% 
  reduce(left_join, by = c("farm", "SamplingDay2", "group")) %>% 
  pc1pc2()
```

Plotting the genes from the classes with the highest number of genes:

```{r}
ncols <- 2
```

```{r fig.height = 1.5 * ceiling(length(classes_to_plot) / ncols) * (1 + 2 / 3), fig.width = 1.5 * 2.45 * ncols}
opar <- par(mfrow = c(ceiling(length(classes_to_plot) / ncols), ncols), plt = plt_val)
walk(classes_to_plot, pca_biplot, xs2)
par(opar)
```

There is a group of genes that behave very similarly:

```{r}
(tmp <- filter(xs2, PC1 < -11))
```

Not really clear how similar they are:

```{r}
ncols <- 4
```

```{r fig.height = ceiling(nrow(tmp) / ncols) * (1 + 2 / 3), fig.width = 2.45 * ncols, warning = FALSE}
opar <- par(mfrow = c(ceiling(nrow(tmp) / ncols), ncols), plt = plt_val)
walk(rownames(tmp), plot_gene_all_farms, plot_frame = plot_frame2log)
par(opar)
```

### Comparing the two

```{r}
percvar <- attr(xs, "percvar")
with(xs, plot(PC1, PC2,
              xlab = paste0("PC1 (", percvar[1], "%)"),
              ylab = paste0("PC2 (", percvar[2], "%)")))

abline(v = 0, lty = 2, col = "grey")
abline(h = 0, lty = 2, col = "grey")

xs %>% 
  tibble::rownames_to_column() %>% 
  filter(rowname %in% rownames(filter(xs, PC1 < -3.8, PC2 > 5))) %$%
  points(PC1, PC2, col = "red", pch = 19)

xs %>% 
  tibble::rownames_to_column() %>% 
  filter(rowname %in% rownames(filter(xs2, PC1 < -11))) %$%
  points(PC1, PC2, col = "blue", pch = 19)
```

And the opposite:

```{r}
percvar <- attr(xs2, "percvar")
with(xs2, plot(PC1, PC2,
              xlab = paste0("PC1 (", percvar[1], "%)"),
              ylab = paste0("PC2 (", percvar[2], "%)")))

abline(v = 0, lty = 2, col = "grey")
abline(h = 0, lty = 2, col = "grey")

xs2 %>% 
  tibble::rownames_to_column() %>% 
  filter(rowname %in% rownames(filter(xs, PC1 < -3.8, PC2 > 5))) %$%
  points(PC1, PC2, col = "red", pch = 19)

xs2 %>% 
  tibble::rownames_to_column() %>% 
  filter(rowname %in% rownames(filter(xs2, PC1 < -11))) %$%
  points(PC1, PC2, col = "blue", pch = 19)
```


### k-means

```{r eval = FALSE}
tmp <- differences %>% 
  filter(if_all(.fns = Negate(is.na))) %>% 
  select(-farm, -SamplingDay2) %>% 
  t() %>% 
  as.data.frame() %>% 
  select(where(~ any(. != 0)))

km <- kmeans(tmp, centers = 11, iter.max = 1000, nstart = 50)
fviz_cluster(km, data = tmp)
```

```{r eval = FALSE}
library(factoextra)
library(NbClust)

fviz_nbclust(tmp, kmeans, method = "wss")
fviz_nbclust(tmp, kmeans, method = "silhouette")
fviz_nbclust(tmp, kmeans, method = "gap_stat")

NbClust(data = NULL, diss = NULL, distance = "euclidean",
        min.nc = 2, max.nc = 20, method = "kmeans")

NbClust(data = tmp, min.nc = 2, max.nc = 20, method = "kmeans")
```