You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Let $\Xton$ be independent and identically distributed (i.i.d)
random variables (r.v.). Let $\mu = \E[X]$ the expectation and
$\sigma^2 = \V[X]$ the variance of these random variables.
We define the sample average of our $X_i$ as:
$$\bar{X}n = \frac1n \sum{i=1}^nX_i$$
Law of Large Numbers (LLN):
$$\tag{1.1} \bar{X}_n \conv{\P, a.s.}\mu$$
Where $\P$ designs convergence in probability and $a.s.$ convergence
almost surely.
parametric if $\Theta \subseteq \R^d$ for some $d \in \N$,
non-parametric if $\Theta$ is infinite-dimensional,
semi-parametric if $\Theta = \Theta_1 \times \Theta_2$ where $\Theta_1 \subseteq \R^d$ is the parameter set we are interested in, and $\Theta_2$ is the nuisance parameter set.
The parameter $\theta$ is called identifiable if the mapping $\theta \mapsto \P_\theta$ is injective, i.e.:
If $\mathcal{I}$ is a confidence interval of level $\alpha$, it is also a confidence interval of level $\beta$ for all $\beta \geq \alpha$.
A confidence interval of level $\alpha$ means that if we repeat the experience multiple times, then the real parameter will be in the confidence interval with frequency $\alpha$. It does not mean that the real parameter is in the interval with a certain confidence (as the real parameter is deterministic).
Lecture 5: Delta Method and Confidence Intervals
The Delta Method. Let $(X_n)_n$ be a sequence of r.v. such as
for some $\theta \in \R, \sigma^2 > 0$. Let $g: \R \rightarrow \R$ continuously differentiable at $\theta$. Then $(g(X_n))_n$ is asymptotically normal around $g(\theta)$ and:
for some statistic $T_n$ and threshold $c$. $T_n$ is called the test statistic and the rejection region is $R_\psi = {T_n > c}$.
We have two main types of tests:
If $H_0: \theta = \theta_0$ and $H_1: \theta \neq \theta_0$, it is a two-sided test,
If $H_0: \theta \leq \theta_0$ and $H_1: \theta > \theta_0$, it is a one-sided test.
Example: Let $X_1, \ldots, X_n \iid \mathcal{D}(\mu)$ where $\mu$ and $\sigma^2$ are the expectation and variance of $\mathcal{D}(\mu)$. Let's consider $\mu_0$.
We want to test $H_0: \mu = \mu_0$ against $H_1: \mu \neq \mu_0$ with asymptotic level $\alpha$.
Let $\psi_\alpha = \one{|T_n| > q_{\alpha/2}}$.
Now, we want to test $H_0: \mu \leq \mu_0$ and $H_1: \mu > \mu_0$ with asymptotic level $\alpha$. Which value of $\mu \in \Theta_0$ should we consider?
Type 1 error is the function $\mu \mapsto \P_\mu[\psi(\Xton) = 1]$. To control the level, we need the $\mu$ that maximises this expression over $\Theta_0$: clearly, this happens for $\mu = \mu_0$. Therefore, if $H_0$ is true, then
$$\P_{\mu_0}[T_n > q_\alpha] \conv{} \alpha$$
If $H_1: \mu < \mu_0$, then we want $\P_{\mu_0}[T_n < -q_\alpha]$.
The (asymptotic) p-value of a test $\psi_\alpha$ is the smallest (asymptotic) level $\alpha$ at which $\psi_\alpha$ rejects $H_0$. The p-value is random and depends on the sample.
The golden rule: $H_0$ is rejected by $\psi_\alpha$ at any asymptotic level $\alpha \geq \pval(\Xton)$.
Unit 3: Methods of estimation
Lecture 8: Distance measures between distributions
Let $\statmodel$ be a statistical model, associated with a sample of i.i.d r.v. $\Xton$. We assume the model is well specified, i.e. $\exists \trth \in \Theta, X \sim \P_\trth$. $\trth$ is called the true parameter.
We want to find an estimator $\ethn$ such that $\P_{\ethn}$ is close to $\P_\trth$.
The total variation distance between two probability measures $\P_1$ and $\P_2$ is defined as:
The $\TV$ is symmetric: $\forall \P_1, \P_2, \TV(\P_1, \P_2) = \TV(\P_2, \P_1)$,
The $\TV$ is positive: $\forall \P_1, \P_2, 0 \leq \TV(\P_1, \P_2) \leq 1$,
The $\TV$ is definite: $\forall \P_1, \P_2, \TV(\P_1, \P_2) = 0 \implies \P_1 = \P_2$ almost everywhere,
The $\TV$ verifies the triangle inequality: $\forall \P_1, \P_2, \P_3, \TV(\P_1, \P_3) \leq \TV(\P_1, \P_2) + \TV(\P_2, \P_3)$.
Therefore, $\TV$ is a distance between probability distributions.
Problem 1: $\TV$ cannot compare discrete and continuous distributions, which means we cannot trust it to give a reliable estimation of distance between unrelated distributions.
Problem 2: We cannot build an estimator $\theta \mapsto \widehat{\TV}(\P_\theta, \P_\trth)$ as we do not know $\trth$.
Hence we need another "distance" between distributions. The Kullback-Leibler (KL) divergence between two probability distributions $\P_1, \P_2$ is defined as:
$$\tag{3.4} \KL(\P_1, \P_2)= \begin{cases}\displaystyle\sum_{x\in E}p_1(x)\ln\frac{p_1(x)}{p_2(x)} & \text{if }E\text{ is discrete} \ \displaystyle \int_Ef_1(x)\ln\frac{f_1(x)}{f_2(x)}dx & \text{if }E\text{ is continuous}\end{cases}$$
Proprieties of the $\KL$ divergence:
In general, $\KL(\P_1, \P_2) \neq \KL(\P_2, \P_1)$
The $\KL$ is positive: $\forall \P_1, \P_2, \KL(\P_1, \P_2) \geq 0$
The $\KL$ is definite: $\forall \P_1, \P_2, \KL(\P_1, \P_2) = 0 \implies \P_1 = \P_2$
In general, $\KL(\P_1, \P_3) \not\leq \KL(\P_1, \P_2) + \KL(\P_2, \P_3)$
$\KL$ is not a distance, it's a divergence. But we still have that $\trth$ is the only minimizer of $\theta \mapsto \KL(\P_\trth, \P_\theta)$.
$$\tag{3.17} \CCov(A\XX + B) = \CCov(A\XX) = A\CCov(\XX)A^\top$$
The Multivariate Central Limit Theorem, let $\XX_1, \ldots, \XX_n \in \R^d$ i.i.d copies of $\XX$ such that $\E[\XX] = \mmu$ and $\CCov(\XX) = \SSigma$. Then:
Let $\Xton$ be an i.i.d sample associated with a statistical model $\statmodel$, with $E \subseteq \R$ and $\Theta \subseteq \R^d$. The population moments are:
$$\tag{3.24} \forall 1 \leq k \leq d, m_k(\tth) = \E_\tth[X^k]$$
Let $\hat M = (\hat m_1, \ldots, \hat m_d)$. Let $\SSigma(\tth) = \CCov_\tth(X, X^2, \ldots, X^d)$. Let us assume $M^{-1}$ is continuously differentiable at $M(\tth)$.
We can generalize the method of moments to any set of functions $g_1, \ldots, g_d : \R \rightarrow \R$ well chosen, by defining $m_k(\tth) = \E_\tth[g_k(X)]$ and $\SSigma(\tth) = \CCov_\tth(g_1(X), \ldots, g_k(X))$.
The generalized method of moments yields, by applying the CLT and the Delta method:
The $MLE$ is more accurate than the $MM$, the $MLE$ still gives good results if the model is mis-specified, however the $MM$ is easier to compute and the $MLE$ can be intractable sometimes.
Lecture 12: M-Estimation
Let $\Xton \iid \P$ on a sample space $E \subseteq \R^d$.
The goal is to estimate some parameter $\true\mmu$ associated with $\P$. We find a function $\rho : E \times \mathcal{M} \rightarrow \R$, where $\mathcal{M}$ is the parameter set for $\true\mmu$, such that:
where $Z \sim \Norm(0, 1), V \sim \chi^2_d$ and $Z \perp V$.
Student's T test (one sample, two-sided): let $\Xton \iid \Norm(\mu, \sigma^2)$. We want to test $H_0: \mu = \mu_0 = 0$ against $H_1: \mu \neq 0$. We define the test statistic as:
Since $\sqrt{n}\frac{\bar{X}_n - \mu_0}{\sigma} \sim \Norm(0, 1)$ under $H_0$ and $\frac{\tilde{S}n}{\sigma^2} \sim \frac1{n-1}\chi^2{n-1}$ are independent by Cochran's Theorem,
$$T_n \sim t_{n-1}$$
The non-asymptotic Student's test is therefore written, at level $\alpha$:
Student's T test (one sample, one-sided): if instead we have $H_0: \mu = \mu_0 = 0$ and $H_1: \mu > 0$, the test is written, at level $\alpha$:
$$\psi_\alpha = \one{T_n > q_\alpha(t_{n-1})}$$
Student's T test (two samples, two-sided): let $\Xton \iid \Norm(\mu_X, \sigma^2_X)$ and $Y_1, \ldots, Y_m \iid \Norm(\mu_Y, \sigma^2_Y)$. We want to test $H_0: \mu_X = \mu_Y$ against $H_1: \mu_X \neq \mu_Y$. The test statistic is written:
Non-asymptotic and can be run on small samples, and can also be seemlessly applied for large samples.
The samples must be Gaussian.
Lecture 14: Wald's Test, Likelihood Ratio Test, and Implicit Hypothesis Testing
Wald's Test: let $\Xton$ i.i.d samples with a statistical model $\statmodel$, where $\Theta \subseteq \R^d, d \geq 1$. Let $\trtth$ be the true parameter and $\tth_0 \in \Theta$. We want to test $H_0: \trtth = \tth_0$ against $H_1: \trtth \neq \tth_0$. Let $\etthn^{MLE}$ be the maximum likelihood estimator, assuming the conditions are satisfied. Under $H_0$, using Slutsky:
Wald's test is also valid for a one-sided test, but is less powerful.
Likelihood ratio test: let $r \in \lb 0, d \rb$. Let $\tth_0 = (\theta^0_{r+1}, \ldots, \theta^0_d)^\top \in \R^{d-r}$. Let us suppose the null hypothesis is given by:
Implicit testing: let $\gg: \R^d \rightarrow \R^k$ be continuously differentiable, with $k \leq d$. We want to test $H_0: \gg(\tth) = \zz$ against $H_1: \gg(\tth) \neq \zz$. By applying the Delta Method:
Lecture 15: Goodness of Fit Test for Discrete Distributions
Categorical distribution: let $E = {a_1, \ldots, a_K}$ be a finite space and $(\P_\pp)_{\pp \in \Delta_K}$ be the family of all distributions over $E$:
$$\tag{4.14}\forall k \in \lb 1, K \rb, \P_\pp(X = a_k) = p_k$$
Goodness of fit test: let us consider $\pp, \pp^0 \in \Delta_K$, and $\Xton \iid \P_\pp$. We want to test $H_0: \pp = \pp^0$ against $H_1: \pp \neq \pp^0$.
For example, we can test against the uniform distribution $\pp^0 = (1/K, \ldots, 1/K)^\top$.
We cannot apply Wald's test directly because of the constraint $\sum_kp_k = 1$. Under this constraint, the MLE is:
$$\tag{4.15}\forall k \in \lb 1, K \rb, \hat{p}_k = \frac{N_k}{n}$$
where $N_k = |{X_i = a_k, i \in \lb 1, n \rb}|$ is the number of occurences of the $k$-th element in the data.
where $\etth$ is the MLE estimator of $\tth$ given the data under $H_0$.
For example, let us test against any binomial distribution $\mathcal{Binom}(N, p)$ where $p$ is unknown. The support is of cardinal $N+1$, hence $K = N+1$, and the dimension of $\Theta$ is $d=1$; hence, the asymptotic distribution is $\chi^2_{K-1-d} = \chi^2_{N-1}$.
Lecture 16: Goodness of Fit Test Continued: Kolmogorov-Smirnov test, Kolmogorov-Lilliefors test, Quantile-Quantile Plots
Let $\Xton$ be i.i.d random variables. The CDF of a random variable $X$ is defined as:
$$\forall t \in \R, F(t) = \P[X \leq t]$$
The empirical CDF of the sample $\Xton$ is defined as:
$$\tag{4.18}F_n(t) = \frac1n\sum_{i=1}^n\one{X_i \leq t} = \frac{|{X_i \leq t, i \in \lb 1, n \rb}|}n$$
The Glivenko-Cantelli Theorem (Fundamental theorem of statistics) gives us:
where $\B$ is the Brownian bridge distribution over $[0, 1]$, and more importantly, a pivot distribution.
Kolmogorov-Smirnov test: let $\Xton$ be i.i.d random variables with unknown CDF $F$. Let $F^0$ be a continuous CDF. We want to test $H_0: F = F^0$ against $H_1: F \neq F^0$. Let $F_n$ be the empirical CDF of the sample. Then, under $H_0$:
Please be careful: some tables gives the values for $\frac{T_n}{\sqrt{n}}$, instead of $T_n$.
We can compute $T_n$ with a formula, using the property that $F^0$ is increasing and $F_n$ piecewise constant. Let us reorder our samples $X_{(1)} \leq X_{(2)} \leq \ldots \leq X_{(n)}$. Then:
Pivotal distribution: let us consider $U_i = F^0(X_i)$, with associated empirical CDF $G_n$. Under $H_0$, $U_1, \ldots, U_n \iid \Unif(0, 1)$ and
$$T_n = \sqrt{n}\sup_{x \in [0, 1]}|G_n(x) - x|$$
Which justifies that $T_n$ is indeed a pivotal statistic.
To estimate the quantiles numerically, as long as we can generate random values along a uniform distribution, we can proceed as follows:
With $M$ large, simulate $M$ copies of $T_n$, $T_n^1, \ldots, T_n^M$,
Estimate the $q_\alpha(T_n)$ quantile with $\hat{q}_\alpha^M(T_n)$ by finding the $1-\alpha$ sample cutoff among the $T_n^m$,
Apply the Kolmogorov-Smirnov test with $\delta_\alpha = \one{T_n > \hat{q}_\alpha^M(T_n)}$.
The $p$-value is then given by:
$$\pval \approx \frac{|{T_n^m > T_n, m \in \lb 1, M \rb }|}M$$
Other distances can measure the difference between two functions than the $\sup$. For example, the Cramér-Von Mises and the Anderson-Darling distances:
The Kolmogorov-Smirnov test is not valid against a family of distributions. For example, if we want to test if $X$ has any Gaussian distribution, we cannot simply plugin the estimators $\hat\mu, \est{\sigma^2}$ into the Kolmogorov-Smirnov estimator.
Kolmogorov-Lilliefors test. However, for a Gaussian distribution,
Quantile-quantile (QQ) plots: informal visual cues to decide whether it's likely a distribution is close to another one. Given a sample CDF $F_n$ and a target CDF $F$, we plot:
If the plot is aligned along the $y = x$ axis, they are likely close to each other. There are four patterns of differences between distributions:
Heavier tails: below > above the diagonal.
Lighter tails: above > below the diagonal.
Right-skewed: above > below > above the diagonal.
Left-skewed: below > above > below the diagonal.
Unit 5: Bayesian Statistics
Lecture 17: Introduction to Bayesian Statistics
Bayesian inference conceptually amounts to weighting the likelihood $L_n(\tth)$ by a prior knowledge we might have on $\tth$.
Given a statistical model $\sstatmodel$, we technically model our parameter $\tth$ as if it were a random variable. We therefore define the prior distribution (PDF):
$$\pi(\tth)$$
Let $\Xton$. We note $L_n(\Xton|\tth)$ the joint probability distribution of $\Xton$ conditioned on $\tth$ where $\tth \sim \pi$. This is exactly the likelihood from the frequentist approach.
Bayes' formula. The posterior distribution verifies:
We can often use an improper prior, i.e. a prior that is not a proper probability distribution (whose integral diverges), and still get a proper posterior. For example, the improper prior $\pi(\tth) = 1$ on $\Theta$ gives the likelihood as a posterior.
Lecture 18: Jeffrey's Prior and Bayesian Confidence Interval
where $I(\tth)$ is the Fisher information. This prior is invariant by reparameterization, which means that if we have $\eeta = \phi(\tth)$, then the same prior gives us a probability distribution for $\eeta$ verifying:
Bayesian confidence region. Let $\alpha \in (0, 1)$. A Bayesian confidence region with level $\alpha$ is a random subset $\mathcal{R} \subset \Theta$ depending on $\Xton$ (and the prior $\pi$) such that:
Bayesian confidence region and confidence interval are distinct notions.
The Bayesian framework can be used to estimate the true underlying parameter. In that case, it is used to build a new class of estimators, based on the posterior distribution.
The Bayes estimator (posterior mean) is defined as:
We can also consider different descriptions of the distribution, like the median, quantiles or the variance.
Linear regression: trying to fit any function to $\E[Y | X=x]$ is a nonparametric problem; therefore, we restrict the problem to the tractable one of linear function:
$$f: x \mapsto a + bx$$
Theoretical linear regression: let $X, Y$ be two random variables with two moments such as $\V[X] > 0$. The theoretical linear regression of $Y$ on $X$ is the line $\true{a} + \true{b}x$ where
$$\tag{6.2}(\true a, \true b) = \argmin{(a, b) \in \R^2}\E\left[(Y - a - bX)^2\right]$$
Which gives:
$$\tag{6.3}\true b = \frac{\Cov(X, Y)}{\V[X]}, \quad \true a = \E[Y] - b^* \E[X]$$
Noise: we model the noise of $Y$ around the regression line by a random variable $\varepsilon = Y - \true a - \true b X$, such as:
Matrix form: we can rewrite these expressions. Let $\YY = (Y_1, \ldots, Y_n)^\top \in \R^n$, and $\eepsilon = (\varepsilon_1, \ldots, \varepsilon_n)^\top$. Let
Bonferroni's test: if we want to test the significance level of multiple tests at the same time, we cannot use the same level $\alpha$ for each of them. We must use a stricter test for each of them. Let us consider $S \subseteq {1, \ldots, p}$. Let us consider
This test also works for implicit testing (for example, $\beta_1 \geq \beta_2$).
Unit 7: Generalized Linear Models
Lecture 21: Introduction to Generalized Linear Models; Exponential Families
The assumptions of a linear regression are:
The noise is Gaussian: $Y | \XX = \xx \sim \Norm_d(\mu(\xx), \sigma^2)$,
The regression function is linear: $\mu(\xx) = \xx^\top \bbeta$.
We want to relax both these assumptions for Generalized Linear Models, because some response random variables cannot fit in this framework (for example, a binary answer $Y \in {0, 1}$). Instead, we assume:
Some distribution for the noise: $Y | \XX = \xx \sim \mathcal{D}$,
A link function$g$ before the mean: $g(\mu(\xx)) = \xx^\top \bbeta$.
Exponential family: a family of distributions ${\P_\tth, \tth \in \Theta}, \Theta \subseteq \R^k$ is said to be a $k$-parameter exponential family on $\R^d$, if there exist:
Lecture 22: GLM: Link Functions and the Canonical Link Function
We need to link our model back to the parameter of interest, $\bbeta$. For that, we need a link function between our linear predictor and the mean parameter $\mu$:
$$\tag{7.5}\XX^\top \bbeta = g(\mu(\XX))$$
We require $g$ to be monotone increasing and differentiable.
$g$ maps the domain of the parameter $\mu$ of the distribution to the entire real line (the range of $\XX^\top\bbeta$). For example:
For a linear model, $g$ is the identity.
For a Poisson distribution or an Exponential distribution ($\mu > 0$), we can use $g = \ln: (0, \infty) \rightarrow \R$.
For a Bernoulli distribution, we can use:
The logit function $g: \mu \mapsto \ln(\frac{\mu}{1 - \mu})$,
The probit function $g: \mu \mapsto \Psi^{-1}(\mu)$ ($\Psi$ being the normal CDF).
The logit is the natural choice; such as model is called a logistic regression.
The link $g$ mapping $\mu$ to the parameter $\theta$ is called the canonical link:
$$\tag{7.6} g_c(\mu) = \theta$$
As $\mu = b'(\theta)$,
$$\tag{7.7} g_c(\mu) = (b')^{-1}(\mu)$$
If $\phi > 0$, as $b''(\theta) = \phi\V[Y] > 0$, $b'$ is strictly increasing and $g_c$ is also strictly increasing.
Back to $\bbeta$: let us consider $(\XX_1, Y_1), \ldots, (\XX_n, Y_n) \in \R^{p+1}$ i.i.d, such that the PDF of $Y_i | \XX_i = \xx_i$ has density in the canonical exponential family:
Using the matrix notation $\YY = (Y_1, \ldots, Y_n)^\top$, $\X = (\XX_1, \ldots, \XX_n)^\top \in \R^{n\times p}$, the parameters $\theta_i$ are linked to $\beta$ via the following relations:
We can therefore apply our statistical tests (Wald's, likelihood ratio, …) to test hypotheses about our parameter (for example, the significance of some $\beta_i$).
Parameters: $a, b \in \R, a < b$ (usually, $a = 0$)
Support: $E = [a, b] \subset \R$
Probability density function (PDF):
$$\tag{A.6.1} f_{a, b}(x) = \frac1{b-a}\one{a \leq x \leq b}$$
Cumulative density function (CDF):
$$\tag{A.6.2} F_{a, b}(x) = \begin{cases}\displaystyle 0 & \text{if}\quad x < a \ \displaystyle \frac{x-a}{b-a} & \text{if}\quad a \leq x \leq b \ \displaystyle 1 & \text{if}\quad x > b \end{cases}$$
A probability space is a triplet $(\Omega, \mathcal{F}, P)$ where $\Omega$ is the set of possible outcomes, $\mathcal{F}$ a set of subsets of $\Omega$ such as $(\Omega, \mathcal{F})$ is measurable and $P: \mathcal{F} \rightarrow [0, 1]$ is the probability function such as it is (countably) additive and $P(\Omega) = 1$.
A random variable$X$ is a function:
$$\tag{B.1.1}X: \Omega \rightarrow E$$
Where $E$ is the value-space of $X$; usually, $E \subseteq \R$. For any $S \subseteq E$, the probability of $X \in S$ is given by: