maximum-likelihood.Rmd

# (PART\*)  Estimation {-}


# Maximum Likelihood


This is a brief refresher on <span class="emph">maximum likelihood estimation</span> using a standard regression approach as an example, and more or less assumes one hasn't tried to roll their own such function in a programming environment before.  Given the likelihood's role in Bayesian estimation and statistics in general, and the ties between specific Bayesian results and maximum likelihood estimates one typically comes across, one should be conceptually comfortable with some basic likelihood estimation. The following is taken directly from [my document](https://m-clark.github.io/bayesian-basics/appendix.html#maximum-likelihood-review) with mostly just cleaned up code and visualization.  The TLDR version can be viewed in the [Linear Regression][Linear Regression] chapter.


In the standard model setting we attempt to find parameters $\theta$ that will maximize the probability of the data we actually observe.  We'll start with an observed random target vector $y$ with $i...N$ independent and identically distributed observations and some data-generating process underlying it $f(\cdot|\theta)$.  We are interested in estimating the model parameter(s), $\theta$, that would make the data most likely to have occurred.  The probability density function for $y$ given some particular estimate for the parameters can be noted as $f(y_i|\theta)$.  The joint probability distribution of the (independent) observations given those parameters, $f(y_i|\theta)$, is the product of the individual densities, and is our *likelihood function*.  We can write it out generally as:


$$\mathcal{L}(\theta) = \prod_{i=1}^N f(y_i|\theta)$$


Thus, the *likelihood* for one set of parameter estimates given a fixed set of data y, is equal to the probability of the data given those (fixed) estimates.  Furthermore, we can compare one set, $\mathcal{L}(\theta_A)$, to that of another, $\mathcal{L}(\theta_B)$, and whichever produces the greater likelihood would be the preferred set of estimates.  We can get a sense of this with the following visualization, based on a single parameter. The data is drawn from Poisson distributed variable with mean $\theta=5$.  We note the calculated likelihood increases as we estimate values for $\theta$ closer to $5$, or more precisely, whatever the mean observed value is for the data. However, with more and more data, the final ML estimate will converge on the true value.

```{r ml-illustration, echo=FALSE, out.width='500px', message=T}
set.seed(1234)

y = rpois(100000, lambda = 5)
mus = seq(3, 8, l = 100)
L = map_dbl(mus, function(mu) sum(dpois(y, lambda = mu, log = T)))
message(glue::glue('Final estimate = ', round(mus[L == max(L)], 2)))

ggplot(data.frame(mus, L)) +
  geom_vline(aes(xintercept = 5), alpha = .5, lty = 2) +
  geom_hline(aes(yintercept = max(L)), alpha = .5, lty = 2) +
  geom_path(aes(x = mus, y = L), color = '#ff5500', lwd = 2) +
  labs(x = expression(theta), y = 'Likelihood') +
  visibly::theme_clean(center_axis_labels = TRUE) +
  theme(
    axis.ticks.y = element_blank(),
    axis.text.y  = element_blank(),
    axis.text.x  = element_text(color = 'gray10', size = 14),
    axis.title.x = element_text(color = 'gray10', size = 24)
  )

# ggsave('img/maxLikeNormalCompareLLforDifferentMeans.svg', bg='transparent')
```


For computational reasons, we instead work with the sum of the natural log probabilities, and so deal with the *log likelihood*:


$$\ln\mathcal{L}(\theta) = \sum_{i=1}^N \ln[f(y_i|\theta)]$$

Concretely, we calculate a log likelihood for each observation and then sum them for the total likelihood for parameter(s) $\theta$. 

The likelihood function incorporates our assumption about the sampling distribution of the data given some estimate for the parameters.  It can take on many forms and be notably complex depending on the model in question, but once specified, we can use any number of optimization approaches to find the estimates of the parameter that make the data most likely. As an example, for a normally distributed variable of interest we can write the log likelihood as follows:

$$\ln\mathcal{L}(\theta) = \sum_{i=1}^N \ln[\frac{1}{\sqrt{2\pi\sigma^2}}\exp(-\frac{(y-\mu)^2}{2\sigma^2})]$$


### Example

In the following we will demonstrate the maximum likelihood approach to estimation for a simple setting incorporating a normal distribution, where we estimate the mean and variance/sd for a set of values $y$.  First the data is created, and then we create the function that will compute the log likelihood.  Using the built in R distributions makes it fairly straightforward to create our own likelihood function and feed it into an optimization function to find the best parameters.  We will set things up to work with the <span class="pack">bbmle</span> package, which has some nice summary functionality and other features.  However, one should take a glance at <span class="func">optim</span> and the other underlying functions that do the work.

```{r ml-bblmDemo1}
# for replication
set.seed(1234)

# create the data
y = rnorm(1000, mean = 5, sd = 2)
starting_values = c(0, 1)

# the log likelihood function
simple_ll <- function(mu, sigma, verbose = TRUE) {
  
  ll = sum(dnorm(y, mean = mu, sd = sigma, log = TRUE))
  
  if (verbose)
    message(paste(mu, sigma, ll))
  
  -ll
}
```

The <span class="func">simple_ll</span> function takes starting points for the parameters as arguments, in this case we call them $\mu$ and $\sigma$, which will be set to `r starting_values[1]` and `r starting_values[2]` respectively.  Only the first line (ll = -sum...) is actually necessary, and we use <span class="func">dnorm</span> to get the density for each point.  Since this optimizer is by default minimization, we reverse the sign of the sum so as to minimize the negative log likelihood, which is the same as maximizing the likelihood.  Note that the bit of other code just allows you to see the estimates as the optimization procedure searches for the best values.  I do not show that here but you'll see it in your console if `trace = TRUE`.

We are now ready to obtain maximum likelihood estimates for the parameters.  For comparison we will use <span class="pack" style = "">bbmle</span> due to its nice summary result, but you can use optim as in the other demonstrations. For the <span class="func">mle2</span> function we will need the function we've created, plus other inputs related to that function or the underlying optimizing function used (by default <span class="func">optim</span>).  In this case we will use an optimization procedure that will allow us to set a lower bound for $\sigma$.  This isn't strictly necessary, but otherwise you would get warnings and possibly lack of convergence if negative estimates for $\sigma$ were allowed.

```{r ml-bblmDemo2}
# using optim, and L-BFGS-B so as to constrain sigma to be positive by setting
# the lower bound at zero
mlnorm = bbmle::mle2(
  simple_ll,
  start  = list(mu = 2, sigma = 1),
  method = "L-BFGS-B",
  lower  = c(sigma = 0),
  trace = TRUE
)

mlnorm

# compare to an intercept only regression model
summary(lm(y~1))
```

We can see that the ML estimates are the same as the `lm`  model estimates based on least squares, and which given the sample size are close to the true values.

In terms of the parameters we estimate, instead of the curve presented previously, in the typical case of two or more parameters we can think of a <span class="emph">likelihood surface</span> that represents the possible likelihood values given any particular set of estimates. Given some starting point, the optimization procedure then travels along the surface looking for a minimum/maximum point. For simpler settings such as this, we can  visualize the likelihood surface and its minimum point. The optimizer travels along this surface until it finds a minimum (the surface plot is interactive- feel free to adjust).  I also plot the path of the optimizer from a top down view.  The large dot noted represents the minimum negative log likelihood. 

```{r ml-plotSurface, echo=FALSE}
mu    = seq(4, 6, length = 50)
sigma = seq(1.5, 3, length = 50)

llsurf = matrix(NA, length(mu), length(sigma))

for (i in 1:length(mu)){
  for (j in 1:length(sigma)){
    llsurf[i,j] = -sum(dnorm(y, mean = mu[i], sd = sigma[j], log = TRUE))
  }
}

rownames(llsurf) = mu
colnames(llsurf) = sigma


plotdat = read.csv('data/mle_est.csv')
pointdat = filter(plotdat, mu <=6 & mu >=4 & sigma<= 3 & sigma >=1.5) %>% 
  mutate(ll = -ll)

# because plotly legend doesn't work correctly with surface, color scale dropped, not needed anyway https://community.plot.ly/t/how-to-name-axis-and-show-legend-in-mesh3d-and-surface-3d-plots/1819


library(plotly)

llsurf_trans = t(llsurf)

palettes = visibly::palettes

plot_ly(
  x =  ~ mu,
  y =  ~ sigma,
  z =  ~ llsurf_trans,
  colors = viridis::plasma(500),
  type = 'surface',
  showscale = FALSE,
  width = 750
) %>%
  add_trace(
    x =  ~ mu,
    y =  ~ sigma,
    z =  ~ ll,
    mode = 'markers',
    marker = list(color = palettes$orange, size = 3),
    type = "scatter3d",
    data = pointdat,
    name = NULL,
    showlegend = F
  ) %>%
  add_trace(
    x = coef(mlnorm)[1],
    y = coef(mlnorm)[2],
    z = mlnorm@min,
    marker = list(color = palettes$orange$complementary, size = 10),
    type = "scatter3d",
    showlegend = F
  ) %>%
  visibly::theme_plotly() %>%
  layout(
    scene = list(
      xaxis = list(
        title = 'mu',
        titlefont = list(size = 12),
        tickfont = list(size = 10),
        dtick = .25
      ),
      yaxis = list(
        title = 'sigma',
        titlefont = list(size = 12),
        tickfont = list(size = 10),
        dtick = .25
      ),
      zaxis = list(
        title = 'Neg. Log Lik ',
        titlefont = list(size = 8),
        tickfont = list(size = 10),
        position = .1
      )
    ),
    paper_bgcolor = 'rgba(0,0,0,0)',
    plot_bgcolor  = 'rgba(0,0,0,0)'
  )

detach(package:plotly)
```

```{r ml-plotNormMLEpath, echo=FALSE, out.width='500px'}
plotdat$nll = -plotdat$ll

ggplot(aes(x = mu, y = sigma), data = plotdat) +
  geom_point(
    aes(),
    color = '#ff5500',
    size = 4,
    alpha = .25,
    show.legend = F,
    position = position_jitter(w = 0.2, h = 0.2)
  ) +
  scale_size_continuous(range = c(.1, 5)) +
  geom_point(
    aes(),
    col   = palettes$orange$complementary[2],
    data  = data.frame(t(coef(mlnorm))),
    size  = 10,
    alpha = .9
  ) +
  labs(x = expression(mu),
       y = expression(sigma),
       caption = "A bit of jitter was added to the points to better see what's going on.") +
  theme(
    axis.ticks.y    = element_blank(),
    axis.text.y     = element_blank(),
    axis.text.x     = element_text(color = 'gray10', size = 14),
    axis.title.x    = element_text(color = 'gray10', size = 24),
    axis.title.y    = element_text(color = 'gray10', size = 24)
  )
# ggsave('img/mlnorm.svg', bg='transparent')
```


Please note that there are many other considerations in optimization completely ignored here, but for our purposes and the audience for which this is intended, we do not want to lose sight of the forest for the trees. We now move next to a slightly more complicated regression example.


## Linear Model

In the standard regression context, our expected value for the target variable comes from our linear predictor, i.e. the weighted combination of our explanatory variables, and we estimate the regression weights/coefficients and possibly other relevant parameters.  We can expand our previous example to the standard linear model without too much change.  In this case we estimate a mean for each observation, but otherwise assume the variance is constant across observations.  Again, we first construct some data so that we know exactly what to expect, then write out the likelihood function with starting parameters.  As we need to estimate our intercept and  coefficient for the X predictor (collectively referred to as $\beta$), we can think of our likelihood  explicitly as before:

$$\ln\mathcal{L}(\beta, \sigma^2) = \sum_{i=1}^N \ln[\frac{1}{\sqrt{2\pi\sigma^2}}\exp(-\frac{(y-X\beta)^2}{2\sigma^2})]$$


```{r ml-bblmeReg}
# for replication
set.seed(1234)

# predictor
X = rnorm(1000)

# coefficients for intercept and predictor
beta = c(5, 2)

# add intercept to X and create y with some noise
y = cbind(1, X) %*% beta + rnorm(1000, sd = 2.5)

regression_ll <- function(sigma = 1, Int = 0, b1 = 0) {
  coefs = c(Int, b1)
  
  mu = cbind(1,X)%*%coefs
  
  ll = -sum(dnorm(y, mean = mu, sd = sigma, log = TRUE))
  
  message(paste(sigma, Int, b1, ll))
  
  ll
}

mlopt =  bbmle::mle2(regression_ll, method = "L-BFGS-B", lower = c(sigma = 0)) 

summary(mlopt)
# plot(profile(mlopt), absVal=F)

modlm = lm(y ~ X)
summary(modlm)
- 2 * logLik(modlm)
```

As before, our estimates and final log likelihood value are about where they should be, and reflect the `lm` output, as the OLS estimates are the maximum likelihood estimates.  The visualization becomes more difficult beyond two parameters, but we can examine slices similar to the previous plot. 

```{r ml-misc, echo=FALSE}
plotdat = read.csv('..//bayesian-basics/data//mleEstimates.csv')
colnames(plotdat) = c('sigma', 'Intercept', 'β', 'll')

plotdat$nll = -plotdat$ll

plotdat %>%
  select(-ll) %>%
  pivot_longer(-c(sigma, nll), names_to = 'variable') %>%
  ggplot() +
  geom_point(
    aes(x = sigma, y = value, size = nll),
    color = palettes$orange$orange,
    alpha = .15,
    show.legend = F,
    position = position_jitter(w = 0.2, h = 0.2)
  ) +
  facet_wrap( ~ variable) + 
  scale_size_continuous(range = c(.5, 5)) +
  labs(x = expression(sigma)) +
  visibly::theme_clean(center_axis_labels = T) +
  theme(
    axis.ticks.y = element_blank(),
    axis.text.x  = element_text(color = 'gray10', size = 14),
    axis.title.x = element_text(color = 'gray10', size = 24)
  )
  
# ggsave('img/mlreg.svg', bg='transparent')
```

To move to generalized linear models, very little changes of the process outside of the distribution assumed and that we are typically modeling a function of the target variable (e.g. $\log(y)=X\beta; \mu = e^{X\beta}$).


## Source

Original code available at:
https://m-clark.github.io/bayesian-basics/appendix.html#maximum-likelihood-review