Supervised.Rmd

---
title: "Introduction to supervised changepoint detection"
author: Toby Dylan Hocking, McGill University, Montreal, Canada
output: 
  html_document:
    toc: true
    toc_depth: 2
---

See the GitHub repository for links to source code and exercises:
https://github.com/tdhock/change-tutorial

Before executing the code in this tutorial Rmd, make sure to install
the required packages:

```{r packages}
packages.R <- if(file.exists("packages.R")){
  "packages.R"
}else{
  "https://raw.githubusercontent.com/tdhock/change-tutorial/master/packages.R"
}
source(packages.R)
options(width=100)
```
 
# What is the difference between unsupervised and supervised changepoint detection?

There are two features of a data set that make it possible to do a
supervised changepoint analysis:

* The data consist of **several sequences** with similar signal/noise
  patterns, which are treated as separate changepoint detection
  problems.
* Some data sequences have **labels** which indicate specific regions
  where the model should predict presence or absence of changepoints.
  
The goal of the supervised analysis is to find a model that learns
from the limited labeled data, and provides accurate changepoint
predictions for all of the data (a different number of changepoints
for each separate data sequence).

## Neuroblastoma data set

We will use the neuroblastoma data set to explore supervised
changepoint detection.

```{r neuroblastoma}
data(neuroblastoma, package="neuroblastoma")
str(neuroblastoma)
lapply(neuroblastoma, head, 10)
```

The neuroblastoma data set consists of two data.frames:

* profiles contains the noisy data (logratio), which is approximate
  DNA copy number (measured at chromosome, position), for a particular
  patient (profile.id). Every unique combination of (profile.id,
  chromosome) defines a separate multiple changepoint detection
  problem. The logratio measures how many copies of each part of the
  human genome are present in the patient's tumor. Normally there are
  two copies of each chromosome (logratio=0), but in cancer samples
  there can be gains (changes up) and losses (changes down) of
  specific regions or entire chromosomes.
* annotations contains the labels, which indicate presence
  (breakpoint) or absence (normal) of changepoints in specific regions
  (profile.id, chromosome, min, max) of the data.
  
These data come from children who were treated at the Institut Curie
(Paris, France). For six children we also have follow-up data on
whether they recovered (ok) or had a relapse, several years after
treatment:

```{r clinical}
followup <- data.frame(
  profile.id=paste(c(10, 8, 4, 6, 11, 1)),
  status=c("ok", "relapse", "relapse", "ok", "ok", "relapse"))
followup
```
  
## Unsupervised: no labels, model selection via theory

When there are no labels, the input for a changepoint analysis just
uses the noisy DNA copy number data in
`neuroblastoma$profiles$logratio`. We begin by selecting the profiles
of the six patients for which we have follow-up data.

```{r six}
rownames(followup) <- followup$profile.id
followup$status.profile <- with(followup, paste(status, profile.id))
some.ids <- rownames(followup)
library(data.table)
someProfiles <- function(all.profiles){
  some <- subset(all.profiles, paste(profile.id) %in% some.ids)
  status.profile <- followup[paste(some$profile.id), "status.profile"]
  some$status.profile <- ifelse(
    is.na(status.profile), paste(some$profile.id), status.profile)
  data.table(some)
}
neuroblastoma.dt <- data.table(neuroblastoma$profiles)
## neuroblastoma.dt[, logratio.norm := logratio/mad(logratio), by=list(
##   profile.id, chromosome)]
neuroblastoma.dt[, logratio.norm := logratio]
six.profiles <- someProfiles(neuroblastoma.dt)
```

For these six patients, the total number of separate changepoint
detection problems is 24 chromosomes x 6 patients = 144 problems. We
plot each problem below in a separate facet (panel). 

```{r plotProfiles, fig.width=10}
library(ggplot2)
gg.unsupervised <- ggplot()+
  ggtitle("unsupervised changepoint detection = only noisy data sequences")+
  theme(
    panel.spacing=grid::unit(0, "lines"),
    panel.border=element_rect(fill=NA, color="grey50")
  )+
  facet_grid(status.profile ~ chromosome, scales="free", space="free_x")+
  geom_point(aes(position/1e6, logratio),
             data=six.profiles,
             shape=1)+
  scale_x_continuous(
    "position on chromosome (mega bases)",
    breaks=c(100, 200))+
  scale_y_continuous(
    "logratio (approximate DNA copy number)",
    limits=c(-1,1)*1.1)
print(gg.unsupervised)
```

Note the facet titles on the right:

* The top three profiles are "ok" because those children
  completely recovered.
* The bottom three profiles are "relapse" because the
  neuroblastoma cancer came back, and required another treatment at
  the hospital.

**Question:** can you visually identify any differences between the
"ok" and "relapse" profiles?

How to choose the number of changepoints? In the unsupervised setting,
our only option is to use theoretical/statistical arguments, which
give information criteria such as Aikaike Information Criterion (AIC,
penalty=2) or Bayesian/Schwarz Information Criterion (BIC/SIC,
penalty=log n). More on this later.

## Supervised: learn a penalty function with minimal incorrect labels

In supervised changepoint detection, there are labels which indicate
presence and absence of changepoints in particular data subsets. For
the same set of 6 profiles, we superimpose the labels in the plot
below.

```{r plotLabels, fig.width=10}
six.labels <- someProfiles(neuroblastoma$annotations)
change.colors <- c(
  "1change"="#ff7d7d",
  breakpoint="#a445ee",
  normal="#f6c48f"
  )
gg.supervised <- gg.unsupervised+
  ggtitle(paste(
    "supervised changepoint detection = data + labels",
    "that indicate specific regions with or without changepoints"))+
  penaltyLearning::geom_tallrect(aes(
    xmin=min/1e6, xmax=max/1e6, fill=annotation),
    alpha=0.5,
    color=NA,
    data=six.labels)+
  scale_fill_manual("label", values=change.colors)+
  theme(legend.position="bottom")
print(gg.supervised)
```

**New function:** `geom_tallrect` is like `geom_rect` but the ymin is
always the bottom of the panel and the ymax is always the top of the
panel. These are useful for drawing labeled intervals of x values.

It is clear from this plot that the neuroblastoma data satisfy the two
criteria which are necessary for a supervised changepoint analysis:

* The data consist of **several sequences** with similar signal/noise
  patterns, which are treated as separate changepoint detection
  problems.
* Some data sequences have **labels** which indicate specific regions
  where the model should predict presence or absence of changepoints.
  
As we will see later in the tutorial, the labels can be used for
choosing the best model and parameters. The main idea is that the
changepoint model should be consistent with the labels:

* Every breakpoint/positive label is a region where the model should
  predict at least one changepoint. If the model predicts no
  changepoints in a region with a breakpoint label, that is considered
  a false negative.
* Every normal/negative label is a region where the model should
  predict no changepoints. If the model predicts one or more
  changepoints in a region with a normal label, that is considered a
  false positive.
* The overall goal is find a changepoint model that minimizes the
  number of incorrectly predicted labels, which is defined as the sum
  of false positives and false negatives.

## Label error of an unsupervised changepoint model

In this section we show that typical unsupervised changepoint models
result in prediction errors with respect to the labels in the
neuroblastoma data. We begin by using the `changepoint::cpt.mean`
function to compute an unsupervised changepoint model for each data
sequence. To compute the number of incorrect labels later, make sure
to save a description of the model in a "tidy" data.frame
(observations on rows and variables on columns). In the code below we

* fit the model inside `by=list(status.profile, chromosome)` which adds
  those two columns to the resulting `data.table`.
* add the `pen.name` column to identify the model.
* use the `changes` list column to save the changepoint positions (in
  the same units as the label positions: bases on the chromosome).

```{r unsupervised-models}
pen.name <- "SIC0"
(unsupervised.models <- six.profiles[, {
  fit.pelt <- changepoint::cpt.mean(
    logratio.norm, penalty=pen.name, method="PELT")
  end <- fit.pelt@cpts
  before.change <- end[-length(end)]
  after.change <- before.change+1L
  data.table(
    pen.name,
    pen.value=fit.pelt@pen.value,
    changes=list(
    as.integer((position[before.change]+position[after.change])/2)
    ))
}, by=list(profile.id, status.profile, chromosome)])
```

Next, we convert the list column `changes` to tall format (one row for
each changepoint).

```{r unsupervised-changes}
(unsupervised.changes <- unsupervised.models[, data.table(
  change=changes[[1]]
), by=list(profile.id, status.profile, chromosome, pen.name)])
```

**New function:** we now introduce the `penaltyLearning::labelError`
function, which computes the number of incorrect labels for a set of
predicted changepoints. Its inputs are 3 data.frames:

* `models` one row per changepoint model.
* `labels` one row for each label, with columns for label location and
  type.
* `changes` one row for each predicted changepoint, with a column for
  the predicted changepoint position.

```{r unsupervised-error-list}
unsupervised.error.list <- penaltyLearning::labelError(
  unsupervised.models, six.labels, unsupervised.changes,
  problem.vars=c("status.profile", "chromosome"),
  model.vars="pen.name",
  change.var="change")
```

The result of `labelError` is a list of two data tables:

* `model.errors` has one row per (sequence,model), with the total
  number of incorrect labels. We use these totals in the `ggtitle` of
  the plot below.
* `label.errors` has one row per (sequence,model,label), with the
  fp/fn status of each label. We use these to visualize false negative
  and false positive predictions using the `geom_tallrect` with
  `aes(linetype=status)` below.
  
```{r gg-unsupervised-bic, fig.width=10}
gg.supervised+
  theme(legend.box="horizontal")+
  unsupervised.error.list$model.errors[, ggtitle(paste0(
    "Unsupervised ", pen.name, " penalty has ",
    sum(errors),
    " incorrect labels: ",
    sum(fp), " FP + ",
    sum(fn), " FN"))]+
  geom_vline(aes(
    xintercept=change/1e6),
    color="green",
    size=1,
    linetype="dashed",
    data=unsupervised.changes)+
  geom_tallrect(aes(
    xmin=min/1e6, xmax=max/1e6, linetype=status),
    fill=NA,
    data=unsupervised.error.list$label.errors)+
  scale_linetype_manual("error type", values=c(
    correct=0,
    "false negative"=3,
    "false positive"=1))
```

Note that in the plot above we also plotted the predicted changepoints
using a green dashed `geom_vline`. Labels with inconsistent
changepoints are shown with a black border in the `geom_tallrect`,
with `scale_linetype_manual` to show the type of prediction error:

* **False positives** are labels with too many predicted changepoints,
  drawn with a solid black border (none here but we will see some
  later).
* **False negatives** are labels with too few predicted changepoints,
  drawn with a dotted black border.
* **Correct** labels with no border have an acceptable number of
  predicted changepoints.

**Remark:** the result above (BIC/SIC predicts too few changepoints,
resulting in false negatives) is specific to the neuroblastoma data
set. However, in general you can expect that unsupervised penalty
functions will commit changepoint detection errors which are visually
obvious. This is the main motivation of supervised changepoint
analysis: we can use labels to learn a penalty function with improved
changepoint detection accuracy.

**Exercise 1 during class:** try changing the `pen.name` variable 
to another theretically-motivated unsupervised penalty. Other options
from `help(cpt.mean)` are SIC (without a zero), MBIC, AIC, and
Hannan-Quinn. Can you find a penalty with improved changepoint
detection accuracy? Also try changing the `some.ids` variable (to some
profile IDs from 1 to 603) to see the changepoints and label errors
for another set of profiles.

## Creating labels via visual inspection

To quickly create labels for many data sequences, I recommend using a
GUI in which you can use the mouse to draw label rectangles on the
scatterplots. I wrote
[annotate_regions.py](https://pypi.python.org/pypi/annotate_regions/1.0)
for labeling the neuroblastoma data set, and
[SegAnnDB](https://github.com/tdhock/SegAnnDB) for labeling general
DNA copy number profile data. You may want to write another GUI for
your particular data -- make sure it supports visualizing the data
sets and interactive labeling.

Even if you don't have GUI software, you can always just plot the data
in R, and then write the labels in the R code. Begin by plotting one
data set by itself to see the details of the signal and noise.

```{r zoom}
zoom.profile <- 4     #Exercise: change this value!
zoom.chromosome <- 14 #Exercise: change this value!
zoom.pro <- neuroblastoma.dt[
  profile.id==zoom.profile & chromosome==zoom.chromosome]
zoom.gg <- ggplot()+
  geom_point(aes(position/1e6, logratio),
             data=zoom.pro,
             shape=1)+
  scale_y_continuous(
    "logratio (approximate DNA copy number)")
print(zoom.gg)
```

Based on your visual interpretation of the plot above, you can create
a set of labels. 

* For each region that contains at least one changepoint, create a
  "breakpoint" label (no predicted changes inside is a false
  negative).
* For each region that contains no changepoints, create a "normal"
  label (one or more predicted changes inside is a false positive).
* For each region that contains exactly one changepoint, create a
  "1change" label (no predicted changes inside is a false negative,
  and two or more is a false positive).

The region for each label can be as small or as large as you want.
Because the goal of learning is to minimize the number of incorrectly
predicted labels, it is important to make sure that each label is a
correct representation of the changepoint model you want. Avoid
labeling any regions for which you are not sure about the label. Save
your labels to a data.frame with the label type and positions, as
below.

```{r manual-label-fun}
label <- function(min, max, annotation){
  data.frame(
    profile.id=paste(zoom.profile),
    chromosome=paste(zoom.chromosome),
    min, max, annotation)
}
zoom.labels <- rbind(# Exercise: change these values!
  label(70e6, 90e6, "1change"), 
  label(20e6, 60e6, "normal"))
zoom.gg.lab <- zoom.gg+
  penaltyLearning::geom_tallrect(aes(
    xmin=min/1e6, xmax=max/1e6, fill=annotation),
    alpha=0.5,
    color=NA,
    data=zoom.labels)+
  scale_fill_manual("label", values=change.colors)
print(zoom.gg.lab)
```

The plot above shows the noisy data along with the labels. Note that
it is possible to have multiple labels per data sequence, even though
that is not the case for the neuroblastoma data set.

# A typical supervised changepoint analysis

In this section we explain how to perform a supervised changepoint
analysis on a labeled data set.

## Overview of supervised changepoint computations

A typical supervised changepoint analysis consists of the
following computations:

* [Changepoint models of increasing complexity](#changepoint-models-of-increasing-complexity).
  For each of several labeled segmentation problems (data sequences
  that are separate but have a similar signal/noise pattern), use your
  favorite changepoint detection package to compute a *sequence* of
  models of increasing complexity (say from 0 to 20 changepoints).
* [Label error](#label-error). Use `penaltyLearning::labelError` to
  compute the number of incorrect labels for each labeled segmentation
  problem and each changepoint model. The goal of learning is to
  minimize the number of incorrectly predicted labels.
* [Model selection](#model-selection). Use
  `penaltyLearning::modelSelection` to compute the exact path of
  models that will be selected for every possible non-negative penalty
  value.
* [Outputs](#outputs). Use `penaltyLearning::targetIntervals` to
  compute a target interval of log(penalty) values that predicts the
  minimum number of incorrect labels for each segmentation
  problem. Create a target interval matrix (one row for each
  segmentation problems, 2 columns) which can be used as the output in
  the supervised machine learning problem.
* [Inputs](#inputs). Compute a feature matrix (segmentation problems x
  features) using `penaltyLearning::featureMatrix`. Or do it yourself
  using simple statistics of each segmentation problem (quantiles,
  mean, number of data points, estimated variance, etc).
* [Learning and prediction](#learning-and-prediction). Use
  `survival::survreg` or `penaltyLearning::IntervalRegressionCV` to
  learn a penalty function.
* [Evaluation](#evaluation).  use `penaltyLearning::ROChange` to
  compute ROC curves using labels in a test set (that were not used
  for training). The AUC (area under the ROC curve) and the percent
  incorrect labels in the test set can be used to evaluate the
  prediction accuracy of your model.
  
The mathematical details are described in
[our ICML'13 paper, Learning Sparse Penalties for Change-point Detection using Max Margin Interval Regression](http://proceedings.mlr.press/v28/hocking13.html).

## Changepoint models of increasing complexity

To begin we will perform the supervised changepoint computations on
just one of the labeled neuroblastoma data sequences,

```{r zoom-gg-lab}
zoom.gg.lab <- ggplot()+
  geom_point(aes(position/1e6, logratio),
             data=zoom.pro,
             shape=1)+
  scale_y_continuous(
    "logratio (approximate DNA copy number)")+
  penaltyLearning::geom_tallrect(aes(
    xmin=min/1e6, xmax=max/1e6, fill=annotation),
    alpha=0.5,
    color=NA,
    data=zoom.labels)+
  scale_fill_manual("label", values=change.colors)
print(zoom.gg.lab)
```

We use the code below to compute a sequence of maximum likelihood
Gaussian models with $s\in\{1,\dots,7\}$ segments. In contrast, a
typical unsupervised analysis will include computation of *one*
changepoint model. In supervised analysis we need to compute *a
sequence of changepoint models of increasing complexity* in order to
determine which are consistent and inconsistent with the
labels. Remember the goal of the machine learning is to predict a
model with complexity consistent with the labels (minimize the number
of incorrectly predicted labels).

**New function:** `Segmentor3IsBack::Segmentor` computes maximum
likelihood segmentations from 1 to Kmax segments. Inputs are the data
sequence to segment, the likelihood (model=2 means normal/Gaussian),
and the maximum number of segments (Kmax). For a supervised
changepoint analysis, Kmax should be rather large, but how do you know
if it is large enough? You should check to make sure there is at least
one model with a false positive (a labeled region with too many
predicted changepoints), if that is possible (not possible for data
sequences with only breakpoint labels). These false positives in the
training labels/models are used to learn a penalty function that avoids
predicting too many changepoints. More details below in the
[Outputs](#outputs) section.

```{r zoomSegmentor}
max.segments <- 7
(fit <- Segmentor3IsBack::Segmentor(
  zoom.pro$logratio.norm, model=2, Kmax=max.segments))
```

The Segmentor `fit` object is an S4 class with slots

* `breaks` an integer matrix of end position of each segment.
* `parameters` a numeric matrix of the mean of each segment.

Next, we convert the model to tidy format (a `data.frame` with one row
per segment). The code below is specific to the `Segmentor` function;
if you use a different changepoint detection package/function, you
will have to write some other code to tidy the data. 

```{r zoom-segs}
zoom.segs.list <- list()
zoom.loss.vec <- rep(NA, max.segments)
for(n.segments in 1:max.segments){
  end <- fit@breaks[n.segments, 1:n.segments]
  data.before.change <- end[-n.segments]
  data.after.change <- data.before.change+1
  pos.before.change <- as.integer(
  (zoom.pro$position[data.before.change]+
   zoom.pro$position[data.after.change])/2)
  start <- c(1, data.after.change)
  chromStart <- c(zoom.pro$position[1], pos.before.change)
  chromEnd <- c(pos.before.change, max(zoom.pro$position))
  seg.mean.vec <- fit@parameters[n.segments, 1:n.segments]
  zoom.segs.list[[n.segments]] <- data.table(
    profile.id=paste(zoom.profile),
    chromosome=paste(zoom.chromosome),
    n.segments, # model complexity.
    start, # in data points.
    end,
    chromStart, # in bases on chromosome.
    chromEnd,
    mean=seg.mean.vec)
  data.mean.vec <- rep(seg.mean.vec, end-start+1)
  stopifnot(length(data.mean.vec)==nrow(zoom.pro))
  zoom.loss.vec[n.segments] <- sum((zoom.pro$logratio.norm-data.mean.vec)^2)
}
(zoom.segs <- do.call(rbind, zoom.segs.list))
```

Above we show the data.table of optimal segment means. Below we
compute the data.table of predicted changepoint positions, then plot
the models.

```{r zoom-changes}
(zoom.changes <- zoom.segs[1 < start, data.table(
  profile.id, chromosome, n.segments,
  changepoint=chromStart)])
zoom.gg.models <- zoom.gg.lab+
  theme(
    panel.spacing=grid::unit(0, "lines"),
    panel.border=element_rect(fill=NA, color="grey50")
  )+
  facet_grid(n.segments ~ .)+
  geom_vline(aes(
    xintercept=changepoint/1e6),
    data=zoom.changes,
    color="green",
    size=1,
    linetype="dashed")
print(zoom.gg.models)
```

The plot above shows a panel for each model from 1 to 7 segments (0 to
6 changepoints). Each changepoint is drawn using a green dashed
`geom_vline`.

For more details about the `Segmentor` function, read the references
below. To save time during class we will skip to the next section,
"Label error."

* Let there be $n$ separate data sequences to segment. For example in
  the previous section we plotted $n=144$ data sequences, each a
  separate changepoint detection problem.
* For any data sequence $i\in\{1,\dots,n\}$, let $d_i$ be the number
  of data points in data sequence $i$, and let $\mathbf z_i\in\mathbb
  R^{d_i}$ be the vector of data points to segment for that sequence.
* The Segment Neighborhood problem with $s$ segments for data sequence
  $i$ is to find the most likely $s-1$ changepoints:
$$
\operatorname{minimize}_{\mathbf m\in\mathbb R^{d_i}}
\sum_{j=1}^{d_i} \ell(z_{ij}, m_j) 
\text{ subject to}
\sum_{j=1}^{d_i-1} I(m_{j} \neq m_{j+1})=s-1.
$$

For the Normal homoscedastic model we have

* the optimization variable $\mathbf m = (m_1 \cdots m_{d_i})$ is a
  vector of $d_i$ real-valued segment mean variables.
* the loss function $\ell$ is the square loss $\ell(z,m)=(z-m)^2$.
* the indicator function $I$ is 1 for every change-point where
  $m_{j} \neq m_{j+1}$, and 0 otherwise.
* $\lambda>0$ is a non-negative penalty constant.
* the objective function is the total loss over all $d_i$ data points,
  which is convex.
* the constraint is that the mean $\mathbf m$ must have exactly $s-1$
  changes, which is non-convex.
* overall the optimization problem is non-convex, which means that
  gradient-based algorithms are not guaranteed to compute the global
  minimum. However the dynamic programming algorithms discussed below
  can be used to compute the global minimum in a reasonable amount of
  time.

A few references about the algorithm implemented in the `Segmentor` function:

* the "Segment Neighborhood" problem was originally described by
  [Auger and Lawrence (1989)](https://www.ncbi.nlm.nih.gov/pubmed/2706400). They
  proposed an $O(S_{\text{max}} d^2)$ algorithm for computing the most
  likely models with $s\in\{1,\dots,S_{\text{max}}\}$ segments, for
  $d$ data points. An R implementation of their algorithm is available
  as `changepoint::cpt.mean(method="SegNeigh")` but it is very slow
  for large $d$, due to the quadratic time complexity.
* [Rigaill (2010) discovered a Pruned Dynamic Programming Algorithm (PDPA)](https://arxiv.org/abs/1004.0887)
  which solves the same "Segment Neighborhood" problem. Its
  $O(S_{\text{max}}d\log d)$ time complexity makes it much faster for
  larger data sets. The original R implementation `cghseg:::segmeanCO`
  only works for the Gaussian/normal likelihood.
* [Cleynen et al (2014) generalized the PDPA to other loss/likelihood models](https://almob.biomedcentral.com/articles/10.1186/1748-7188-9-6). Their
  R function `Segmentor3IsBack::Segmentor` implements the
  $O(S_{\text{max}}d\log d)$ PDPA for Poisson, Normal homoscedastic
  (uniform variance -- change in mean), Negative Binomial, Normal
  Heteroscedastic (uniform mean -- change in variance), and
  Exponential likelihood models.
  
**Exercise after class:** instead of using `Segmentor` to compute a
sequence of models, try using
`changepoint::cpt.mean(penalty="CROPS")`,
[arXiv:1412.3617](https://arxiv.org/abs/1412.3617). Then re-do the
rest of the analyses below. Another option is
`cghseg:::segmeanCO`. Using any of these three packages you should be
able to get the same results.

## Label error

Next, we compute the number of incorrect labels for each model, using
the `penaltyLearning::labelError` function that we saw above. We then
plot the label errors.

```{r zoomlabelerror}
zoom.models <- data.table(
  profile.id=paste(zoom.profile),
  chromosome=paste(zoom.chromosome),
  loss=zoom.loss.vec,
  n.segments=as.numeric(1:max.segments))
zoom.error.list <- penaltyLearning::labelError(
  zoom.models,
  zoom.labels,
  zoom.changes,
  change.var="changepoint",
  problem.vars=c("profile.id", "chromosome"))
zoom.gg.models+
  penaltyLearning::geom_tallrect(aes(
    xmin=min/1e6,
    xmax=max/1e6,
    linetype=status),
    data=zoom.error.list$label.errors,
    fill=NA)+
  scale_linetype_manual("error type", values=c(
    correct=0,
    "false negative"=3,
    "false positive"=1))
```

It is clear from the plot above that each model has a different number
of labels which are correctly predicted. Models with too few
changepoints cause false negatives, and models with too many
changepoints cause false positives. However there is a subset of
models with minimal incorrect labels. The goal of a supervised
changepoint detection algorithm is to predict one of those models.

## Model selection

Rather than using a regression model to directly predict the
integer-valued number of segments/changepoints, we will instead learn
a regression model that predicts the real-valued log(penalty). The
model selection function is a mapping from penalty values to model
size (in segments/changepoints).

Notice that `zoom.models` contains a `loss` column, which is the sum
of squared residuals of each model. Below we use that to compute the
models that will be selected for every possible penalty
value. 

**New function:** In the code below we use the
`penaltyLearning::modelSelection` function, which takes as input a
`data.frame` with one row per changepoint model. There should be at
least the following columns (but there can be others):

* IDs for the data sequence (here, `profile.id` and `chromosome`).
* complexity, such as number of changepoints/segments. This column
  name can be specified via the `complexity` argument, as below.
* loss, such as the residual sum of squares or negative log
  likelihood. This should decrease as model complexity increases, and
  the column name can be specified via the `loss` argument (default is
  `"loss"` which is present in `zoom.models` so no need to specify in
  the code below).
  
The output is a data.frame with one row per model that is selected for
at least one penalty value. There are columns for the min/max penalty
that will select each model, as shown below.

```{r zoomSelection}
print(zoom.models)
(zoom.selection <- penaltyLearning::modelSelection(
  zoom.models, complexity="n.segments"))
ggplot()+
  geom_segment(aes(
    min.log.lambda, n.segments,
    xend=max.log.lambda, yend=n.segments),
    size=1,
    data=zoom.selection)+
  xlab("log(penalty) = log(lambda)")+
  scale_y_continuous(breaks=zoom.selection$n.segments)
```

We plot the model selection function in the figure above, which shows
that it is a piecewise constant non-increasing function. Intuitively,
the larger the penalty, the fewer the number of changepoints/segments (and vice
versa). More concretely, let $L_{i,s}$ be the loss (sum of squared
residuals) of the model with $s\in\{1,\dots,s_{\text{max}}\}$ segments
for data sequence $i$. The model selection function is defined for
every non-negative penalty $\lambda\geq 0$ as

$$
s_i^*(\lambda) = \text{arg min}_s L_{i,s} + \lambda s 
$$

## Outputs

The output in the machine learning problem is an interval of
log(penalty) values that achieves minimum incorrect labels. Remember that
we have computed `zoom.error.list$model.errors` which contains the
number of incorrect labels $e_i(s)\in\{0,1,2,\dots\}$ for this data
sequence $i$ and every number of segments $s$. We can thus compute the
label error as function of the penalty, 
$$
E_i(\lambda) = e_i[s^*_i(\lambda)].
$$

In R this computation amounts to a join with `zoom.selection` (which
came from `penaltyLearning::modelSelection`).

```{r zoomploterrors}
zoom.error.join <- zoom.error.list$model.errors[J(zoom.selection), on=list(
  profile.id, chromosome, n.segments, loss)]
zoom.errors.tall <- data.table::melt(
  zoom.error.join,
  measure.vars=c("n.segments", "errors"))
zoom.gg.errors <- ggplot()+
  geom_segment(aes(
    min.log.lambda, value,
    xend=max.log.lambda, yend=value),
    size=1,
    data=zoom.errors.tall)+
  theme_bw()+
  theme(panel.spacing=grid::unit(0, "lines"))+
  facet_grid(variable ~ ., scales="free")+
  scale_y_continuous("", breaks=0:max.segments)+
  xlab("log(penalty) = log(lambda)")
print(zoom.gg.errors)
```

The code above plots the number of incorrect labels $E_i(\lambda)$,
clearly showing that it is a piecewise constant function that takes
integer values (like the 01 loss in binary classification). The goal
of the machine learning algorithm is to provide predictions that
minimize this function (on test data).

Below we compute the output for the machine learning problem, a target
interval of penalty values that yield changepoint models with minimal
incorrect labels.

**New function:** the `penaltyLearning::targetIntervals` function
computes the upper and lower limits of the target interval, which is
the largest range of log(penalty) values that results in minimum
incorrectly predicted labels. Its input parameter is a data.frame with
one row per model (from `penaltyLearning::modelSelection`), with an
additional column `errors` for the number of incorrect labels. The
`problem.vars` argument indicates the data sequence ID columns. 
The function returns a data.table with one row per labeled data sequence,
with columns for the lower and upper bounds of the target interval
(`min.log.lambda`, `max.log.lambda`).
  
```{r zoomplottargets}
print(zoom.error.join)
(zoom.target <- penaltyLearning::targetIntervals(
  zoom.error.join,
  problem.vars=c("profile.id", "chromosome")))
zoom.target.tall <- data.table::melt(
  zoom.target,
  measure.vars=c("min.log.lambda", "max.log.lambda"),
  variable.name="limit")[is.finite(value)]
zoom.gg.errors+
  geom_text(aes(
    ifelse(limit=="min.log.lambda", value-1, value+1),
    errors+1,
    label=paste(
      "false", 
      ifelse(limit=="min.log.lambda", "positives", "negatives"),
      "\ntoo",
      ifelse(limit=="min.log.lambda", "many", "few"),
      "changes"),
    hjust=ifelse(
      limit=="min.log.lambda", 0, 1)),
    data=data.frame(zoom.target.tall, variable="errors"),
    vjust=-1)+
  geom_point(aes(
    value,
    errors,
    fill=limit),
    shape=21,
    size=4,
    data=data.frame(zoom.target.tall, variable="errors"))+
  scale_fill_manual("limit", values=c(
    min.log.lambda="black",
    max.log.lambda="white"))+
  theme(
    legend.position="bottom",
    legend.box="horizontal")
```

The plot above emphasizes the target interval, which is used as the
output in the machine learning problem:

* the black dot shows the lower limit of the target interval. If a
  smaller log(penalty) is predicted, then there are too many changepoints (false
  positives).
* the white dot shows the upper limit of the target interval. If a
  larger log(penalty) is predicted, then there are too few
  changepoints (false negatives).
  
The target interval is different for each labeled data sequence, as we
show below. We will compute the target interval for each of the
labeled data sequences in the six profiles from the beginning of the
tutorial. The first step is to compute changepoints and model
selection functions, which we do via a for loop over each data
sequence below.
  
```{r sixSegmentor}
six.problems.dt <- unique(six.profiles[, list(
  profile.id, chromosome, status.profile)])
setkey(six.profiles, profile.id, chromosome)
six.segs.list <- list()
six.selection.list <- list()
for(problem.i in 1:nrow(six.problems.dt)){
  meta <- six.problems.dt[problem.i,]
  pro <- six.profiles[meta]
  max.segments <- min(nrow(pro), 10)
  fit <- Segmentor3IsBack::Segmentor(
    pro$logratio.norm, model=2, Kmax=max.segments)
  rss.vec <- rep(NA, max.segments)
  for(n.segments in 1:max.segments){
    end <- fit@breaks[n.segments, 1:n.segments]
    seg.mean.vec <- fit@parameters[n.segments, 1:n.segments]
    data.before.change <- end[-n.segments]
    data.after.change <- data.before.change+1
    pos.before.change <- as.integer(
      (pro$position[data.before.change]+pro$position[data.after.change])/2)
    start <- c(1, data.after.change)
    chromStart <- c(pro$position[1], pos.before.change)
    chromEnd <- c(pos.before.change, max(pro$position))
    data.mean.vec <- rep(seg.mean.vec, end-start+1)
    rss.vec[n.segments] <- sum((pro$logratio.norm-data.mean.vec)^2)
    six.segs.list[[paste(problem.i, n.segments)]] <- data.table(
      meta,
      n.segments,
      start,
      end,
      chromStart,
      chromEnd,
      mean=seg.mean.vec)
  }
  loss.dt <- data.table(
    meta,
    n.segments=1:max.segments,
    loss=rss.vec)
  six.selection.list[[problem.i]] <- penaltyLearning::modelSelection(
    loss.dt, complexity="n.segments")
}
six.selection <- do.call(rbind, six.selection.list)
six.segs <- do.call(rbind, six.segs.list)
six.changes <- six.segs[1 < start]
```

Inside of each iteration of the loop above, we used

* `Segmentor3IsBack::Segmentor` to compute the optimal changepoint
  models, saving the tidy result data.frame of segments to an element
  of `six.segs.list`.
* `penaltyLearning::modelSelection` to compute the model selection
  functions, saving the result to an element of `six.selection.list`.

Below, we use 

* `penaltyLearning::labelError` to compute the number of incorrect
  labels for each segmentation model.
* `penaltyLearning::targetIntervals` to compute the target interval
  of penalty values for each labeled data sequence.

```{r sixPlotErrors}
six.error.list <- penaltyLearning::labelError(
  six.selection, six.labels, six.changes,
  problem.vars=c("profile.id", "chromosome"))
six.targets <- penaltyLearning::targetIntervals(
  six.error.list$model.errors,
  problem.vars=c("profile.id", "chromosome"))
ymax.err <- 1.2
six.targets.tall <- data.table::melt(
  six.targets,
  measure.vars=c("min.log.lambda", "max.log.lambda"),
  variable.name="limit",
  value.name="log.lambda")[is.finite(log.lambda)]
six.gg.errors <- ggplot()+
  theme_bw()+
  theme(panel.spacing=grid::unit(0, "lines"))+
  facet_grid(profile.id ~ chromosome, labeller=function(df){
    if("chromosome" %in% names(df))
      df$chromosome <- paste0("chr", df$chromosome)
    if("profile.id" %in% names(df))
      df$profile.id <- paste("profile", df$profile.id)
    df
  })+
  geom_segment(aes(
    min.log.lambda, errors,
    xend=max.log.lambda, yend=errors),
    size=1,
    data=six.error.list$model.errors)+
  geom_point(aes(
    log.lambda,
    errors,
    fill=limit),
    shape=21,
    size=3,
    alpha=0.5,
    data=six.targets.tall)+
  scale_fill_manual("limit", values=c(
    min.log.lambda="black",
    max.log.lambda="white"))+
  theme(
    legend.position="bottom",
    legend.box="horizontal")+
  scale_y_continuous(
    "incorrectly predicted labels",
    limits=c(0,ymax.err),
    breaks=c(0,1))+
  scale_x_continuous(
    "predicted log(penalty) = log(lambda)")
print(six.gg.errors)
```

The plot above shows the error function and target intervals for each
labeled data sequence. Each column of panels/facets represents a
different chromosome, and each row represents a different
profile/patient. It is clear from the plot above that a model with
minimal incorrect labels will predict a different penalty for each
data sequence.

**Question:** discuss with your neighbor for 1 minute, why is there
only one limit per data sequence? (only one target limit point per
panel in the plot above)

**Exercise 2 during class:** create a set of labels via visual
inspection, and visualize how changing the labels affects the
computation of the target interval.

## Inputs

To train a machine learning model, we first need to compute a vector
$\mathbf x_i\in\mathbb R^p$ of exactly $p$ features for each data sequence
$i$. 
We then learn a function $f(\mathbf x_i)=\log\lambda_i\in\mathbb R$. 
To compute a feature matrix, use the code below.

**New function:** the `penaltyLearning::featureMatrix` function
computes a feature matrix for some data sequences. Inputs are a
data.frame of data sequences (`six.profiles` below), along with column
names of IDs (`problem.vars`) and data (`data.var`). Output is a
numeric matrix with one row per data sequence and one column per
feature.

```{r sixFeatureMatrix}
print(six.profiles)
feature.mat <- penaltyLearning::featureMatrix(
  six.profiles,
  problem.vars=c("profile.id", "chromosome"),
  data.var="logratio")
str(feature.mat)
```

The feature matrix is 144 rows (one for each data sequence) by 365
columns/features. Which features should we use to predict log(penalty)
values? If you have a lot of labels, you can use
`penaltyLearning::IntervalRegressionCV` which uses L1-regularization
to automatically select relevant features -- more details in the next
section.

For now we will discuss the simpler unsupervised BIC/SIC
model, which corresponds to solving the following changepoint optimization problem.
$$
\operatorname{minimize}_{\mathbf m\in\mathbb R^{d_i}} 
\sum_{j=1}^{d_i}
\ell(z_{ij}, m_j) +
\underbrace{(\log d_i)}_{\lambda_i} 
\sum_{j=1}^{d_i-1} 
I(m_{j} \neq m_{j+1}).
$$
The BIC/SIC penalty is $\lambda_i =
\log d_i$, where $d_i$ is the number of data points to segment in
sequence $i$. In the changepoint package version 2.2 this is
`penalty="SIC0"`. 

In our framework, we can write the BIC/SIC criterion in terms of $p=1$
feature, $x_i = \log\log d_i$ -- this is the column `n.loglog` of the
feature matrix. The unsupervised BIC/SIC penalty function is thus a
linear model $\log\lambda_i = f(x_i)=w x_i + \beta$ with intercept
$\beta=0$ and slope $w=1$. 


```{r six-BIC-dt}
slope <- 1
intercept <- 0
bic.line.dt <- data.table(
  slope, intercept, model="BIC/SIC")
six.BIC.dt <- six.profiles[, list(
  feature=log(log(.N)),
  model="BIC/SIC"
), by=list(profile.id, chromosome)]
six.BIC.dt[, pred.log.lambda := feature*slope + intercept]
```

The plot below visualizes the target interval limits in this feature
space.

```{r sixLimitsPlot}
six.features.tall <- six.BIC.dt[six.targets.tall, on=list(
  profile.id, chromosome)]
six.gg.limits <- ggplot()+
  geom_point(aes(
    feature, log.lambda, fill=limit),
    shape=21,
    data=six.features.tall)+
  scale_fill_manual("limit", values=c(
    min.log.lambda="black",
    max.log.lambda="white"),
    breaks=c("max.log.lambda", "min.log.lambda"))+
  scale_x_continuous(
    "feature")+
  scale_y_continuous(
    "log(penalty)=log(lambda)")
print(six.gg.limits)
```

In the plot above, each point represents a finite limit of a target
interval. The goal of learning is to find a regression function which
is below the white dots (upper limits) and above the black dots (lower
limits). This is a machine learning problem called "regression with
left, right, and interval censored outputs" or "interval regression"
for short. Note that this is not the same problem as binary
classification, for which the vertical axis is a second feature. 

In this context we can plot the BIC/SIC model as a line with slope 1
and intercept 0, and we can visualize its errors using the vertical
line segments below.

```{r sixBIC}
six.features.targets <- six.BIC.dt[six.targets, on=list(profile.id, chromosome)]
six.features.targets[, residual := 0]
six.features.targets[{
  is.finite(min.log.lambda)|is.finite(max.log.lambda)
}, residual := {
  penaltyLearning::targetIntervalResidual(
    cbind(min.log.lambda, max.log.lambda), pred.log.lambda)
}]
model.colors <- c(
  constant="black",
  "BIC/SIC"="#1B9E77",#green
  survreg="#D95F02",#orange
  IntervalRegressionCV="#7570B3")#dark purple
six.gg.bic <- six.gg.limits+
  scale_color_manual(values=model.colors)+
  geom_abline(aes(
    slope=slope, intercept=intercept, color=model),
    size=1,
    data=bic.line.dt)+
  geom_segment(aes(
    feature, pred.log.lambda,
    xend=feature, yend=pred.log.lambda-residual,
    color=model),
    data=six.features.targets)
print(six.gg.bic)
```

**New function:** the `penaltyLearning::targetIntervalResidual`
function computes the residual of a predicted log(penalty) with
respect to a target interval (zero if prediction is inside the
interval, positive for penalty too large, negative for penalty too
small). The first argument is a 2-column numeric matrix of target
intervals, and the second argument is a numeric vector of predicted
log(penalty) values. The output is a numeric vector of residuals,
useful for showing the errors in regression plots such as the one above.

We can see from the plot above that there are 5 data sequences for
which the BIC/SIC predicts a penalty value outside of the target
interval (penalty too large = too few changepoints). It is clear that
the number of errors could be reduced by simply decreasing the
intercept (moving the line down). However there is no line in this
feature space that perfectly separates the white and black points.

We can also visualize the number of incorrect labels for the BIC/SIC
model by plotting the data, segmentation and labels below. Note that
this BIC/SIC model was computed using `Segmentor3IsBack::Segmentor`,
but it yields exactly the same model as `changepoint::cpt.mean`
(compare with previous figure above).

```{r figBIC, fig.width=10}
six.BIC.dt[, pred := pred.log.lambda]
six.BIC.selection <- data.table(six.selection)[six.BIC.dt, on=list(
  profile.id, chromosome,
  min.log.lambda < pred,
  max.log.lambda > pred)]
six.BIC.labels <- six.error.list$label.errors[six.BIC.selection, on=list(
  profile.id, chromosome, n.segments), nomatch=0L]
six.BIC.changes <- six.changes[six.BIC.selection, on=list(
  profile.id, chromosome, n.segments), nomatch=0L]
six.BIC.segs <- six.segs[six.BIC.selection, on=list(
  profile.id, chromosome, n.segments)]
gg.supervised+
  ggtitle("BIC/SIC model changepoints and label errors")+
  penaltyLearning::geom_tallrect(aes(
    xmin=min/1e6,
    xmax=max/1e6,
    linetype=status),
    fill=NA,
    data=six.BIC.labels)+
  scale_linetype_manual("error type", values=c(
    correct=0,
    "false negative"=3,
    "false positive"=1))+
  geom_vline(aes(
    xintercept=chromStart/1e6),
    data=six.BIC.changes,
    color="green",
    size=1,
    linetype="dashed")+
  theme(legend.box="horizontal")
```

**Question:** discuss with your neighbor for a minute, why are there
only four false negatives in this plot, but there are five predictions
outside the target interval in the previous plot with the regression
line? Hint: look at the plot of the error as a function of penalty.

Another way to visualize the predicted penalty values is with the dots
in the error function plot below.

```{r six-gg-errors-bic}
six.BIC.errors <- six.error.list$model.errors[six.BIC.selection, on=list(
  profile.id, chromosome, n.segments), nomatch=0L]
six.gg.errors.bic <- six.gg.errors+
  geom_point(aes(
    pred.log.lambda, errors, color=model),
    size=2,
    data=six.BIC.errors)+
  scale_color_manual(values=model.colors)
print(six.gg.errors.bic)
```

It is clear from the plot above that 

* Most of the BIC/SIC penalty values occur where the error function is
  zero. These are predictions which are correct with respect to the
  labeled regions.
* Some of the penalty values occur where the error function is
  non-zero. These are prediction errors with respect to the labels.
  
**Exercise after class:** can you think of any other features which
would be useful for predicting log(penalty) values? In other words,
what features of a data sequence should be related to the number of
changepoints we should predict? Some ideas are quantiles/mean/sd of
the observed data points. Change the definition of `feature` in
`six.BIC.dt` above, and re-do the plots in this section. Using your
feature, is it possible to learn an affine function with fewer errors
-- are the white and black dots linearly separable?

**Exercise after class:** In the changepoint package version 2.2
`penalty="SIC0"`means $\lambda_i = \log d_i$ (same as the BIC/SIC we
explored in this section). Verify that the predicted penalty values in
this section are the same, by adding a `geom_point` for predicted
log(penalty) values to the plot with the regression line above. Hint:
when we computed `changepoint::cpt.mean` in the first part of the
tutorial, we saved the predicted penalty in the `pen.value` column of
`unsupervised.models`. Join that data.table with
`six.features.targets` to get the predicted penalty for the labeled
data sequences.

**Exercise after class:** In the changepoint package version 2.2
`penalty="SIC"`means $\lambda_i = 2\log d_i$ (twice as much penalty as
the BIC/SIC in this section). What is the slope/intercept of this
model in the same feature space? Re-do the plots in this section. Is
the model more or less accurate? To check your work, use a
`geom_point` with predictions from `changepoint::cpt.mean`, as in the
previous exercise.

## Learning and prediction

In the previous section we explored the unsupervised BIC/SIC penalty,
which uses 1 feature, and does not learn any parameters using the
labels. In this section, we will explain two supervised penalty
learning methods that can be used to increase changepoint detection
accuracy.

### survreg (un-regularized interval regression)

We begin by using the same feature as the BIC/SIC penalty, but
relaxing the assumption that the intercept $\beta=0$ and slope
$w=1$. Instead of taking these parameter values for granted based on
the theoretical arguments of the BIC/SIC penalty, we can use the
`survival::survreg` function to optimize these parameters based on the
labels. The penalty function is
$\log\lambda_i = f( x_i) = \beta + w\log\log
d_i$ which is equivalent to solving the following penalized
changepoint detection problem, 
$$
\operatorname{minimize}_{\mathbf m\in\mathbb R^{d_i}}
\sum_{j=1}^{d_i}
\ell(z_{ij}, m_j) 
+\underbrace{e^\beta(\log d_i)^w }_{\lambda_i}
\sum_{j=1}^{d_i-1} 
I(m_{j} \neq m_{j+1}),
$$
where the intercept $\beta$ and the slope $w$ 
are defined as the solution to the following optimization
problem, 
$$
\operatorname{minimize}_{ 
\mathbf w\in\mathbb R^p,
\beta\in\mathbb R,
\sigma\in\mathbb R_+ 
}
\sum_{i=1}^n
-\log[ 
F(\overline y_i, \mathbf w^\intercal \mathbf x_i+\beta, \sigma)- 
F(\underline y_i, \mathbf w^\intercal \mathbf x_i+\beta, \sigma) 
],
$$
where

* the optimization variables are the coefficients or weight vector
  $\mathbf w\in\mathbb R^p$, the bias/intercept $\beta\in\mathbb R$,
  and the scale $\sigma\in\mathbb R_+$. In this section for the
  BIC/SIC penalty we have $p=1$ feature, but `survreg` can be used
  with more features.
* $(\underline y_i,\overline y_i)$ is the target interval of
  log(penalty) values.
* $\mathbf x_i\in\mathbb R^p$ is the feature vector for data sequence $i$.
* $\log\lambda_i= f(\mathbf x_i)=\mathbf w^\intercal \mathbf
  x_i+\beta$ is the predicted log(penalty) value for data sequence
  $i$. The `survreg` Normal model is $\log \lambda_i
  \sim N(\mathbf w^\intercal \mathbf x_i+\beta, \sigma^2)$.
* $F$ is the likelihood: a cumulative distribution function such as
  the normal, logistic, or Weibull. (specified using the `dist`
  argument) 
* The objective function to minimize is the total negative log
  likelihood, which is convex with respect to the weight vector
  $\mathbf w$ and the bias/intercept $\beta$, but not jointly convex
  with respect to the scale parameter $\sigma$. Thus overall the
  optimization problem is non-convex, meaning that the gradient-based
  algorithms that `survreg` uses are only guaranteed to reach a local
  minimum (not the globally most likely model). 

To perform a more fair comparison,
we will hold out chromosome 11 as a test set, and only train the model
using the other labeled chromosomes.

```{r fit-survreg-only-chr11}
library(survival)
train.dt <- six.features.targets[chromosome!=11]
```

**New function:** the `survival::survreg` function is used below with
three arguments: 

* a formula `output ~ inputs` as in other R modeling functions such as
  `lm` -- the difference is that the left/output must be a `Surv`
  object. When learning a penalty function for changepoint detection,
  we always use `type="interval2"` because all outputs are either
  left, right, or interval censored. (target intervals of log penalty
  values)
* the data.frame where the input/output variables can be found.
* the distribution to use, in this case normal/Gaussian.

```{r fit-survreg}
(fit.survreg <- survreg(
  Surv(min.log.lambda, max.log.lambda, type="interval2") ~ feature,
  train.dt, dist="gaussian"))
```

It is clear that the coefficients of the learned model are different
from the BIC/SIC model (slope=1, intercept=0). We can visualize the
regression line in the feature space via the code below:

```{r six-gg-survreg}
survreg.line.dt <- data.table(t(coef(fit.survreg)), model="survreg")
six.survreg.pred <- data.table(six.BIC.dt)
six.survreg.pred[, model := "survreg"]
six.survreg.pred[, pred.log.lambda := predict(fit.survreg, six.survreg.pred)]
six.survreg.res <- six.survreg.pred[six.features.targets, on=list(
  profile.id, chromosome)]
six.survreg.res[, residual := 0]
six.survreg.res[{
  is.finite(min.log.lambda)|is.finite(max.log.lambda)
}, residual := {
  penaltyLearning::targetIntervalResidual(
    cbind(min.log.lambda, max.log.lambda), pred.log.lambda)
}]
six.gg.survreg <- six.gg.bic+
  geom_abline(aes(
    slope=feature, intercept=`(Intercept)`, color=model),
    size=1,
    data=survreg.line.dt)+
  geom_segment(aes(
    feature, pred.log.lambda,
    xend=feature, yend=pred.log.lambda-residual,
    color=model),
    data=six.survreg.res)
print(six.gg.survreg)
```

It is clear from the plot above that the learned survreg penalty
reduces the number of incorrectly predicted target intervals from 5 to
2.

The plot below shows the predicted errors and the smooth surrogate
loss minimized by the gradient descent algorithm implemented by the
survreg function (gradient descent cannot minimize the non-convex
error function directly).

```{r six-gg-errors-survreg}
six.survreg.pred[, pred.log.penalty := pred.log.lambda]
six.survreg.selection <- data.table(six.selection)[six.survreg.pred, on=list(
  profile.id, chromosome,
  min.log.lambda < pred.log.lambda,
  max.log.lambda > pred.log.lambda)]
six.survreg.errors <- six.error.list$model.errors[six.survreg.selection, on=list(
  profile.id, chromosome, n.segments), nomatch=0L]
log.pen.vec <- six.error.list$model.errors[, seq(
  min(max.log.lambda), max(min.log.lambda), l=100)]
six.survreg.surrogate <- six.survreg.res[, {
  ## For censored observations, survreg finds a linear function that
  ## predicts mean values that maximize the normal CDF (pnorm) --
  ## equivalent to minimizing the negative log likelihood below.
  neg.log.lik <- -log(
    pnorm(max.log.lambda, log.pen.vec, fit.survreg$scale)-
    pnorm(min.log.lambda, log.pen.vec, fit.survreg$scale))
  data.table(
    model="survreg",
    log.penalty=log.pen.vec,
    surrogate.loss=neg.log.lik
  )[neg.log.lik < ymax.err]
}, by=list(profile.id, chromosome)]
six.gg.errors.survreg <- six.gg.errors.bic+
  geom_point(aes(
    pred.log.penalty, errors, color=model),
    size=2,
    data=six.survreg.errors)+
  geom_line(aes(
    log.penalty, surrogate.loss, color=model),
    data=six.survreg.surrogate)
print(six.gg.errors.survreg)
```

Note in the plot above:

* four `survreg` predicted penalties achieve 0 error where the BIC/SIC
  has 1 error. 
* one BIC/SIC prediction achieves 0 error where `survreg` has 1 error.
* the smooth surrogate loss is a good approximation of the non-convex
  error function.

We can also visualize the survreg model predictions by plotting the
data sequences via the code below.

```{r figsurvreg, fig.width=10}
six.survreg.labels <- six.error.list$label.errors[six.survreg.selection, on=list(
  profile.id, chromosome, n.segments), nomatch=0L]
six.survreg.changes <- six.changes[six.survreg.selection, on=list(
  profile.id, chromosome, n.segments), nomatch=0L]
six.survreg.segs <- six.segs[six.survreg.selection, on=list(
  profile.id, chromosome, n.segments)]
gg.supervised+
  ggtitle("survreg 1 feature model changepoints and label errors")+
  penaltyLearning::geom_tallrect(aes(
    xmin=min/1e6,
    xmax=max/1e6,
    linetype=status),
    fill=NA,
    data=six.survreg.labels)+
  scale_linetype_manual("error type", values=c(
    correct=0,
    "false negative"=3,
    "false positive"=1))+
  geom_vline(aes(
    xintercept=chromStart/1e6),
    data=six.survreg.changes,
    color="green",
    size=1,
    linetype="dashed")+
  theme(legend.box="horizontal")
```

The false positive and false negative predictions are evident in the
plot above.

**Exercise 3 during class:** the survreg model above assumed a
Gaussian distribution for the log(penalty) values. Would using another
distribution increase accuracy? Hint: try re-doing the analyses in
this section with the `survreg(dist="logistic")` model. For other
choices, see `help(survreg)`. Also try other input features from
`feature.mat`.

**Exercise after class:** for one of the neuroblastoma data sequences
plotted above, verify that `changepoint::cpt.mean(data.vec,
penalty="Manual", pen.value=PREDICTED.PENALTY, method="PELT")` and
`fpop::Fpop(data.vec, lambda=PREDICTED.PENALTY)` yield the same
changepoints. Fpop implements a "functional pruning optimal
partitioning" algorithm which is guaranteed to prune more sub-optimal
changepoints (i.e. be faster than) the PELT algorithm. Read
[On optimal multiple changepoint algorithms for large data](http://link.springer.com/article/10.1007/s11222-016-9636-3)
for more info.

**Exercise after class:** try re-doing the analyses in this section
using `penaltyLearning::IntervalRegressionUnregularized`, which learns
an un-regularized model by minimizing the squared hinge loss with
respect to the target interval limits. Is there any difference in
prediction accuracy?

### IntervalRegressionCV (L1-regularized interval regression)

In the previous sections, we assumed that the log(penalty) is an
affine function of a given single feature. 
We chose that feature based on the theoretical arguments
of the BIC/SIC penalty. In this section, we show
how a multi-variate affine penalty function can be learned using L1
regularization. This technique is great when you don't have a 
theoretical prior for the penalty. You can compute as many features
as you like (256 using `penaltyLearning::featureMatrix`), 
and the L1 regularization with ignore any irrelevant features
automatically (feature selection).

The penalty function we learn is $\log\lambda_i =
f(\mathbf x_i) = \beta + \mathbf w^\intercal \mathbf x_i$ 
which is equivalent to solving the
following penalized changepoint detection problem,
$$
\operatorname{minimize}_{\mathbf m\in\mathbb R^{d_i}} 
\sum_{j=1}^{d_i}
\ell(z_{ij}, m_j) +
\underbrace{\exp(\beta+\mathbf w^\intercal \mathbf x_i)}_{\lambda_i}
\sum_{j=1}^{d_i-1} 
I(m_{j} \neq m_{j+1}), 
$$ 
where the intercept $\beta$ and the weight vector $\mathbf w$ are defined
as the solution to the following optimization problem, 
$$
\operatorname{minimize}_{ \mathbf w\in\mathbb R^p, \beta\in\mathbb R }
\gamma ||\mathbf w||_1 +
\frac{1}{n}
\sum_{i=1}^n 
\phi(\mathbf w^\intercal \mathbf x_i+\beta-\underline y_i)+ 
\phi(\overline y_i-\mathbf w^\intercal \mathbf x_i-\beta) 
$$
where

* the data and optimization variables are the same as in the
  un-regularized interval regression problem described in the previous
  section (except there is no scale $\sigma$).
* the L1-norm $||\mathbf w||_1=\sum_{j=1}^p |w_j|$ is a regularization
  term which is used to avoid overfitting. It forces some of the $w_j$
  weights to be exactly zero (feature selection).
* the hyper-parameter $\gamma>0$ is the degree of L1-regularization
  (chosen via an internal cross-validation loop). This means that the
  model performs automatic feature selection. You can thus use as many
  features $p$ as you want, and the model will automatically ignore
  any which do not increase prediction accuracy.
* $\phi(x)=(x-1)^2I(x<1)$ is the squared hinge loss ($I$ is the
  indicator function: 1 if the argument is true, and 0 otherwise). 
* The objective function to minimize is the L1-norm of the weight
  vector plus the mean squared hinge loss (both are convex). Overall
  the optimization problem is convex, so the global minimum is
  guaranteed to be found in a reasonable amount of time using
  gradient-based algorithms.

This is a machine learning technique that
automatically selects features that result in good prediction
accuracy. First we create a feature and target matrix for the data in
the train set:

```{r train-feature-mat}
finite.feature.mat <- feature.mat[, apply(is.finite(feature.mat), 2, all)]
train.ids <- train.dt[, paste(profile.id, chromosome) ]
train.feature.mat <- finite.feature.mat[train.ids, ]
train.target.mat <- train.dt[, cbind(min.log.lambda, max.log.lambda)]
str(train.feature.mat)
```

The training set consists of 30 examples, each with 256
features. Before training, we register the multiprocess future plan
(the internal cross-validation loop to choose the regularization
parameter will be computed in parallel).

```{r require-future}
if(require(future)){
  options(mc.cores=parallel::detectCores()/2)
  plan(multiprocess)
}
```

We use `set.seed` below in order to make the results reproducible
(there is randomness in the fold assignment in the internal
cross-validation loop). Then, we use `IntervalRegressionCV` to fit the
regularized linear model.

**New function:** the `penaltyLearning::IntervalRegressionCV` function
computes an L1-regularized interval regression model. Its two main
inputs are a numeric input feature matrix (n observations x p features), and
a numeric output target interval matrix (n observations x 2). 
Its return value is a model fit
object of class `IntervalRegression`, which has `predict`, `print`,
`plot`, and `coef` methods.

```{r fit-cv}
set.seed(1)
any.finite <- apply(is.finite(train.target.mat), 1, any)
fit.cv <- penaltyLearning::IntervalRegressionCV(
  train.feature.mat[any.finite,], train.target.mat[any.finite,])
coef(fit.cv)
```

It is clear from the output above that even though the model was
trained using 256 possible input features, it has selected only a few
of them to use in the prediction function. We visualize the predicted
penalty values as dots in the error curve plots below:

```{r six-gg-errors-cv}
six.cv.pred <- data.table(six.survreg.pred)
six.cv.pred[, pred.log.penalty := {
  pred.feat.mat <- finite.feature.mat[paste(profile.id, chromosome), ]
  predict(fit.cv, pred.feat.mat)
}]
six.cv.pred[, model := "IntervalRegressionCV"]
six.cv.pred[, pred.log.lambda := pred.log.penalty]
six.cv.selection <- data.table(six.selection)[six.cv.pred, on=list(
  profile.id, chromosome,
  min.log.lambda < pred.log.lambda,
  max.log.lambda > pred.log.lambda)]
six.cv.errors <- six.error.list$model.errors[six.cv.selection, on=list(
  profile.id, chromosome, n.segments), nomatch=0L]
six.cv.surrogate <- six.survreg.res[, {
  ## IntervalRegressionCV minimizes the squared hinge loss.
  data.table(
    model="IntervalRegressionCV",
    log.penalty=log.pen.vec,
    surrogate.loss=penaltyLearning::squared.hinge(log.pen.vec-min.log.lambda)+
      penaltyLearning::squared.hinge(max.log.lambda-log.pen.vec)
  )[surrogate.loss < ymax.err]
}, by=list(profile.id, chromosome)]
six.gg.errors.survreg+
  geom_point(aes(
    pred.log.penalty, errors, color=model),
    size=2,
    data=six.cv.errors)+
  geom_line(aes(
    log.penalty, surrogate.loss, color=model),
    data=six.cv.surrogate)
```

The plot above also shows the surrogate loss that
`IntervalRegressionCV` minimizes using a gradient descent
algorithm. Note how it is very similar to the `survreg` surrogate loss
-- both are smooth functions which increase quadratically as the
predicted value goes past the target interval limit. 

Finally, we can also visualize the `IntervalRegressionCV` model in
terms of the data sequences, labels, and predicted segmentation:

```{r figcv, fig.width=10}
six.cv.labels <- six.error.list$label.errors[six.cv.selection, on=list(
  profile.id, chromosome, n.segments), nomatch=0L]
six.cv.changes <- six.changes[six.cv.selection, on=list(
  profile.id, chromosome, n.segments), nomatch=0L]
six.cv.segs <- six.segs[six.cv.selection, on=list(
  profile.id, chromosome, n.segments)]
gg.supervised+
  ggtitle("IntervalRegressionCV multi-feature model changepoints and label errors")+
  penaltyLearning::geom_tallrect(aes(
    xmin=min/1e6,
    xmax=max/1e6,
    linetype=status),
    fill=NA,
    data=six.cv.labels)+
  scale_linetype_manual("error type", values=c(
    correct=0,
    "false negative"=3,
    "false positive"=1))+
  geom_vline(aes(
    xintercept=chromStart/1e6),
    data=six.cv.changes,
    color="green",
    size=1,
    linetype="dashed")+
  theme(legend.box="horizontal")
```

The plot above clearly shows the advantage of the machine learning
approach: all labels are correctly predicted, including all labels on
the test set (chromosome 11).

**Exercise after class:** by default the `IntervalRegressionCV`
function chooses the regularization parameter with minimum error, as
estimated using an internal cross-validation loop. Another option is
to choose the least complex model within 1 standard deviation. Re-fit
the model using `IntervalRegressionCV(reg.type="1sd")`, view the error
as a function of model complexity via `plot(fit.cv)`, then re-do the
model prediction error plots. Is this model more or less accurate?
Why?  Hint: use `coef(fit.cv)` to see how many features are included
in the model.

### Other methods

**Exercise after class:** try one of these other interval regression
algorithms, and see if you can learn a more accurate penalty function.

* [iregnet](https://github.com/anujkhare/iregnet), elastic net
  regularized Accelerated Failure Time models.
* [mmit](https://github.com/aldro61/mmit), max margin interval trees.
* [trtf](https://r-forge.r-project.org/R/?group_id=1322)
  transformation trees and forests.

## Evaluation

Since we did not use the chromosome 11 labels to train any of the
models above, we can use those labels to evaluate the prediction
accuracy of the learned models. 

* In fact, we have already **qualitatively** evaluated the accuracy of
  the predictions by visual inspection of the plots in the last
  section. Qualitative evaluation via visual inspection is also
  possible in the context of an unlabeled unsupervised changepoint
  analysis.
* However because we have labels, we can **quantitatively** evaluate
  the prediction accuracy by computing the number of incorrect labels
  and AUC (area under the ROC curve). This is a major advantage of the
  supervised machine learning approach.
  
Below we create a list with an element for each of the three models we
want to compare. Then we use a `for` loop to compute ROC curves for
each model, with respect to the test data on chromosome 11.

**New function:** the `penaltyLearning::ROChange` function computes
Receiver Operating Characteristic curves for penalty function learning
problems. It inputs the `model.errors` data.table from
`penaltyLearning::labelError`, and a data.table with predicted
log(penalty) values in the `pred.log.lambda` column. It outputs a list
which describes the ROC curves. Below we use the `roc` element which
is a data.table with one row for each point on the ROC curve. We also
use the `thresholds` element, which is a data.table which has a row
with the error rates for the predicted threshold. For more info read
`help(ROChange)`.

```{r six-auc}
model.pred.list <- list(
  IntervalRegressionCV=six.cv.pred,
  "BIC/SIC"=six.BIC.dt,
  survreg=six.survreg.pred)
roc.list <- list()
auc.list <- list()
for(model in names(model.pred.list)){
  model.pred <- model.pred.list[[model]]
  model.roc <- penaltyLearning::ROChange(
    six.error.list$model.errors, model.pred[chromosome==11],
    problem.vars=c("profile.id", "chromosome"))
  roc.list[[model]] <- data.table(model, model.roc$roc)
  auc.list[[model]] <- with(model.roc, data.table(
    model, auc, thresholds[threshold=="predicted"]))
}
roc <- do.call(rbind, roc.list)
(auc <- do.call(rbind, auc.list))
```

The table above shows the error/accuracy of the three models with
respect to the labels in the test data (chromosome 11). It is clear
that the IntervalRegressionCV is slightly more accurate (higher auc
and lower errors). The ROC curves are shown below:

```{r six-gg-roc}
ggplot()+
  scale_color_manual(values=model.colors)+
  geom_path(aes(
    FPR, TPR, color=model, linetype=model),
    size=1,
    data=roc)+
  geom_point(aes(
    FPR, TPR, color=model),
    size=3,
    data=auc)+
  coord_equal()
```

However, with so few labels, and only one test set, it is
difficult/impossible to tell if there is any significant difference
between models. Below we use K-fold cross-validation to evaluate the
models:

* we assign each labeled data sequence to one of K folds.
* for each fold, we set aside the corresponding data in that fold as a
  test set.
* we then train each of the models using all of the other folds (not
  using the test fold at all during this training phase).
* after each model has been trained, we compute predicted penalty
  values for the test data, then accuracy metrics (percent incorrect
  labels and AUC).
* we then repeat the process for each of the folds, and check to see
  if any method is better on average over all of the test sets.
  
Since the neuroblastoma data set is relatively big (3418 labeled data
sequences), we have pre-computed the Segmentor models, which we load
using the code below. (to re-compute them on your computer, run
[Segmentor.models.R](https://github.com/tdhock/change-tutorial/blob/master/Segmentor.models.R))

```{r load-Segmentor-models}
if(!file.exists("Segmentor.models.RData")){
  download.file(
    "https://rcdata.nau.edu/genomic-ml/Segmentor.models.RData",
    "Segmentor.models.RData")
}
load("Segmentor.models.RData")
```

Before performing cross-validation, we can compute the error
functions, features and target intervals for all of the labeled data.

```{r labeled-feature-mat}
nb.selection <- Segmentor.models$loss[, {
  penaltyLearning::modelSelection(
    .SD, complexity="n.segments")
}, by=list(profile.id, chromosome)]
nb.error.list <- penaltyLearning::labelError(
  nb.selection,
  neuroblastoma$annotations,
  Segmentor.models$segs[1 < start],
  problem.vars=c("profile.id", "chromosome")
)
nb.target.dt <- penaltyLearning::targetIntervals(
  nb.error.list$model.errors,
  problem.vars=c("profile.id", "chromosome"))
labeled.profiles <- data.table(
  neuroblastoma$profiles)[nb.target.dt[, list(
    profile.id, chromosome)], on=list(
    profile.id, chromosome)]
labeled.feature.mat <- penaltyLearning::featureMatrix(
  labeled.profiles,
  problem.vars=c("profile.id", "chromosome"),
  data.var="logratio")
```

Rather than training on all features (which is possible but takes a
bit too long for this tutorial), we will just use the following subset:

```{r some-feature-names}
(some.feature.names <- c(
  colnames(labeled.feature.mat)[1:5],
  rownames(coef(fit.cv))[-1]))
```

Next we assign a fold to each target,

```{r nb-target-dt}
set.seed(1)
n.folds <- 5
nb.target.dt[, fold := sample(rep(1:n.folds, l=.N))]
```

Then we loop over test folds. In each iteration of the for loop, we
create `train.target.dt` and `train.features` by excluding data in the
test fold. We then train each model using these data, and compute
prediction accuracy for each model using `penaltyLearning::ROChange`.

```{r cv-auc-list}
cv.roc.list <- list()
cv.auc.list <- list()
for(test.fold in 1:n.folds){
  train.target.dt <- nb.target.dt[fold!=test.fold]
  train.id.vec <- train.target.dt[, paste(profile.id, chromosome)]
  train.features <- labeled.feature.mat[train.id.vec, ]
  train.target.mat <- train.target.dt[, cbind(min.log.lambda, max.log.lambda)]
  print(test.fold)
  set.seed(1)
  fit <- penaltyLearning::IntervalRegressionCV(
    train.features[, some.feature.names], train.target.mat)
  train.target.dt$feature <- train.features[, "n.loglog"]
  fit.survreg <- survreg(
    Surv(min.log.lambda, max.log.lambda, type="interval2") ~ feature,
    train.target.dt, dist="gaussian")
  ## compute predictions
  test.target.dt <- nb.target.dt[fold==test.fold]
  test.id.vec <- test.target.dt[, paste(profile.id, chromosome)]
  test.feature.vec <- labeled.feature.mat[test.id.vec, "n.loglog"]
  ## baseline constant prediction
  const.pen.vec <- seq(-4, 4, by=0.1)
  const.acc.vec <- sapply(const.pen.vec, function(pred.penalty){
    sum(train.target.mat[,1] < pred.penalty & pred.penalty < train.target.mat[,2])
  })
  best.constant <- const.pen.vec[which.max(const.acc.vec)]
  pred.list <- list(
    constant=rep(best.constant, length(test.feature.vec)),
    "BIC/SIC"=test.feature.vec,
    survreg=predict(fit.survreg, data.table(feature=test.feature.vec)),
    IntervalRegressionCV=predict(fit, labeled.feature.mat[test.id.vec,]))
  for(model in names(pred.list)){
    pred.dt <- data.table(
      test.target.dt,
      pred.log.lambda=as.numeric(pred.list[[model]]))
    pred.roc <- penaltyLearning::ROChange(
      nb.error.list$model.errors,
      pred.dt,
      problem.vars=c("profile.id", "chromosome"))
    cv.roc.list[[paste(test.fold, model)]] <- data.table(
      test.fold, model, pred.roc$roc)
    cv.auc.list[[paste(test.fold, model)]] <- with(pred.roc, data.table(
      test.fold, model, auc, thresholds[threshold=="predicted"]))
  }
}
cv.auc <- do.call(rbind, cv.auc.list)
cv.roc <- do.call(rbind, cv.roc.list)
```

Note that in the code above, we used `cv.roc.list` to save the ROC
curve for each model and test fold, and we used `cv.auc.list` to save
AUC. We plot the ROC curves below in order to visualize what kinds of
errors the models commit (false positives or false negatives).

```{r cv-roc-curves}
ggplot()+
  scale_color_manual(values=model.colors)+
  geom_path(aes(
    FPR, TPR, color=model, group=paste(model, test.fold)),
    data=cv.roc)+
  geom_point(aes(
    FPR, TPR, color=model),
    fill="white",
    shape=21,
    data=cv.auc)+
  coord_equal(xlim=c(0, 0.5), ylim=c(0.5, 1))
```

It is clear from the ROC curves above that the BIC/SIC model has a
relatively low true positive rate (high false negative rate -- many
breakpoint/positive labels with no predicted changepoints).

Below we compute and plot two accuracy metrics, AUC and percent
incorrectly predicted labels.

```{r cvAccuracyPlot}
cv.auc[, accuracy.percent := 100-error.percent]
cv.auc.tall <- melt(cv.auc, measure.vars=c("accuracy.percent", "auc"))
cv.means <- cv.auc.tall[, list(
  mean=mean(value),
  sd=sd(value)
), by=list(model, variable)]
levs <- cv.means[variable=="auc"][order(mean), model]
cv.auc.tall[, model.fac := factor(model, levs)]
ggplot()+
  theme_bw()+
  theme(panel.spacing=grid::unit(0, "lines"))+
  facet_grid(. ~ variable, scales="free")+
  scale_color_manual(values=model.colors)+
  guides(color="none")+
  geom_point(aes(
    value, model.fac, color=model),
    data=cv.auc.tall)+
  xlab("")+
  ylab("model")
```

It is clear from the plot above that survreg is significantly more
accurate than the BIC/SIC and constant penalties, 
and IntervalRegressionCV is slightly more
accurate than survreg. 

**Remark 1:** the accuracy rates above (BIC/SIC and constant penalties
predict 92% correct labels whereas survreg and IntervalRegressionCV
predict 97-98% correct labels) are specific to the neuroblastoma data
set. However, in general you can expect that unsupervised penalty
functions like BIC/SIC will be less accurate than supervised methods
such as IntervalRegressionCV. This is the main motivation of
supervised changepoint analysis: we can use labels to learn a penalty
function with improved changepoint detection accuracy.

**Remark 2:** To learn the most accurate model, it is best to use all
  training data (instead of holding out some of the labels as a test
  set). The
  [figure-neuroblastoma-predictions.R](https://github.com/tdhock/change-tutorial/blob/master/figure-neuroblastoma-predictions.R)
  script fits an IntervalRegressionCV model to the entire
  neuroblastoma data set, uses the model to predict changepoints for
  all data sequences (even those without labels), then
  [plots the predicted changepoints along with the train errors](figure-neuroblastoma-predictions.png). Some
  overfitting/extrapolation errors are evident. For example some chrY
  sequences have too many changes --- this could be fixed by adding some
  negative/normal labels on those sequences.

**Conclusion:** in **labeled data sequences with a common signal/noise
pattern**, supervised learning algorithms can be used to predict
penalty values and changepoints which are **more accurate than
unsupervised methods**. An advantage of the supervised learning
approach is that we can **quantitatively evaluate** the learned models
by counting the number of incorrectly predicted labels in a held-out
test set.

**Exercise after class:** is the logistic or Gaussian distribution a
  more accurate model for the log(penalty) values? In the code block
  above, add some code for computing the model predictions and error
  rates for `survreg(dist="logistic")`. Then re-do the plots in this
  section -- there should be 4 models instead of 3 in the last plot.