- 1. Types of response variable distributions
- 2. Types predictors
- 3. Basic n-way ANOVA models
- 4. Tables and types of contrasts
- 5. Check the assumptions for the "good" model
In statistic we can find four main distributions:
Is a type of continuous probability distribution. Is defined by two statistics:
-
Mean (
$\mu$ ) = 0 -
Standard deviation (
$\sigma$ ) = 1. It is a measure how spread out our date are.
The parameter
Key points:
-
A normal distribution is the proper term for a probability bell curve.
-
Normal distributions are symmetrical, but not all symmetrical distributions are normal.
-
In normal distribution:
$\mu$ = median = mode
This distriburtion has the following characteristics:
-
Created by integers (usually counts)
-
It can not have negative values
-
Its distribution is described by only one parameter, lambda (
$\lambda$ ), so that the mean is equal to the variance of the distribution$\lambda$ -->$\mu$ =$\sigma^2$
More spread out distributions than a Poisson (variance > mean). Their characteristics are:
- Distribution of integers (usually counts)
- It can not have negative values
- Are described by two parameters:
- Mean (
$\mu$ ) - Inflated variance (called
size
in R)
- Mean (
The variance is defines by:
When size is very large 1/size
tend to 0 and the variance ends up looking by
Depending on the nature of our response variable, we will work with different types of models:
Linear General Models | Linear Generalised Linear Models |
---|---|
Gaussian Distribution | Poisson Distribution |
Binomial Distribution | |
Negative Binomial Distribution |
TRANSFORM THE RESPONSE VARAIBALE AND APPLY GENERAL LINEAR MODELS
In statistics, normality tests are used to determine if a data set is well-modeled by a normal distribution and to compute how likely it is for a random variable underlying the data set to be normally distributed.
The residuals of the model should fit a normal distribution. Should look something like this:
METER UNA FOTO DE LA DISTRIBUCION NORMLA
How we can explore the normality in our residuals?
-
Visually:
- Histogram
- Q-Q plot (normal probability plot)
-
Analytically / Statistically:
-
Test the Shapiro-Wilk
The
null hypothesis
of this test is that the population is normally distributed. Thus:- If p-value < 0.05 ==> Data NOT NORMALLY distributed
- If p-value > 0.05 ==> Data NORMALLY distributed
More info about Shapiro test here
-
Kurtosis (K)
Kurtosis tell us the height and sharpness of the central peak, relative to that of a standard bell curve. The
null hypothesis
of this test is that the population is normally distributed. -
Skew
-
-
The variance of the residuals should be similar across model predictions.
-
A random scatter pattern of points should appear, without drawing any geometric pattern (e.g., like many randomly distributed balls on a billard table)
The predictor variables were for a long time called independent
. This was an explicit recognition that they shoud be uncorrelated with each other.
If there is dependence between the predictor variables, we have COLLINEARITY
The collinearity could be the result of:
-
By definition the predictor variables are correlated ( e.g. altitude and temperature)
-
Because there is not homogeneity in the sample size along the different factors levels.
The problems that we can find when the variables are correlated:
- Variables cancel each other out
- significance estimates are altered
- Effect sizes are altered
- No convergence between type I and type II sum of squares (SS) results.
Exploring collinearity between predictor variables:
-
Correlation between variables.
-
VIF index
AFTER ALL THESE EXPLORATION OF THE CANONICAL ASSUMPTIONS OF THE MODELS...
WE CAN NOW PROCEED TO ASSESS THEIR RESULTS
How to interpret the model results here