2/24/2021

High-level goals

Logistics

We'll be using four libraries:

  • library(tidyverse)
  • library(lme4)
  • library(lmtest)
  • library(ggridges)

Also, if you want to clone a copy of this tutorial locally on your computer, you can do so with git:

git clone https://github.com/seantrott/model_comparison.git

Introduction

Science involves building models of the world. Today, I'll be focusing on statistical models in particular.

A good statistical model should satisfy the following criteria:

  1. Accuracy: A model should describe data accurately and generate accurate predictions.
  2. Parsimony: All else being equal, we should prefer a simpler model with fewer assumptions (i.e., fewer parameters).

Thus, from this point of view, the goal of model selection procedure is to identify the simplest model that best explains your data.

Model comparison: Background

We want to compare the explanatory power of two or more models. Which model best explains variance in \(Y\)?

Of course, a model with more parameters will almost always explain slightly more variance than a model with less. So reducing variance in \(Y\) isn't our only goal: we also want a parsimonious model.

There are a number of different model comparison methods, each designed to assess, essentially, model accuracy and parsimony.

Likelihood ratio tests

Likelihood

The likelihood of a model, \(\mathcal{L}(M_{k})\), is a way of quantifying how well a model fits the data.

  • What is the likelihood that a given model (or hypothesis) generated a given dataset?

A likelihood ratio test compares the likelihoods of different models for the same data.

  • The basic intuition is that we should prefer the model under which the data are more likely.

Calculating likelihood

How do we calculate likelihood?

We use a likelihood function, \(\mathcal{L}(y | \theta, X)\), where:

  • \(X\) is our predictor matrix
  • \(\theta\) is our parameter set
  • \(y\) is our observations

In maximum likelihood estimation (MLE), the goal is to (as the name implies) maximize this likelihood (or minimize the negative log-likelihood).

Comparison to OLS Regression

In OLS regression, we want to identify the "best-fitting" line, that minimizes the sum of squared residuals:

\(\sum{(\hat{y} - y)^2}\)

In essence, the sum of squared residuals measures the "distance" between the predictions from your model and the actual values in your data. OLS thus makes intuitive sense: with linear regression, we want to identify the best-fitting line, where "best fit" is the line with the smallest deviations in the real data.

OLS Example

\(Y = 10 + 2X + Normal(\sigma = 20)\)

set.seed(1)
X = c(1:100)
Y = 10 + 2 * X + rnorm(n = length(X), sd = 20)

df_xy = data.frame(X, Y)

df_xy %>%
  ggplot(aes(x = X,
             y = Y)) +
  geom_point(alpha = .6) +
  theme_minimal()

OLS Example

\(Y = 10 + 2X + Normal(\sigma = 20)\)

OLS Example

\(Y = 10 + 2X + Normal(\sigma = 20)\)

OLS Example

We can measure the fit of each line (i.e., model) by calculating the residual sum of squares:

\(\sum{(\hat{y} - y)^2}\)

y_real = 10 + 2*X
y1 = 20 + 1.5 * X
y2 = 5 + 3 * X
y3 = 40 + .5 * X

df_rss = data.frame(
  real = sum((Y - y_real)**2),
  m1 = sum((Y - y1)**2),
  m2 = sum((Y - y2)**2),
  m3 = sum((Y - y3)**2)) %>%
  pivot_longer(c(real, m1, m2, m3), names_to = "model", 
               values_to = "rss")

OLS Example

OLS vs. MLE

OLS is actually a derivation of a more general approach to parameter estimation: maximum likelihood estimation (MLE).

\(\mathcal{L}(\theta | x_1, x_2, ..., x_n) = f(x_1, x_2, ..., x_n | \theta)\)

In particular, we want to calculate the joint probability of each data point, under the model's estimated parameters. We can write that as follows:

\(\mathcal{L}(\theta | X) = \prod\limits_{i=1}^N p(X_i | \theta)\)

Reformulating MLE

One issue with this formulation is that the probability densities for each data point can quickly become either very small or very large!

Thus, we typically rewrite this as the sum of log probabilities.

\(\mathcal{L}(\theta | X) = \sum\limits_{i=1}^N log(p(X_i | \theta))\)

But how exactly is \(p(X_i | \theta)\) calculated?

Back to basics: likelihood for distributions

Distributions

Let's think of a probability distribution as a generative process: for any possible value \(X_i\), a probability function assigns some probability density to that value.

This function can be used to:

  • generate data
  • evaluate the likelihood of a given data point under that distribution

Let's start with likelihood of continuous, normal distributions.

Likelihood of continuous distributions

Let's simulate a normal distribution with parameters: \(\mu = 0, \sigma = 10\).

set.seed(1)
X = rnorm(n = 1000, mean = 0, sd = 10)
hist(X)

Likelihood of continuous distributions

Let's simulate a normal distribution with parameters: \(\mu = 0, \sigma = 10\).

Likelihood of continuous distributions

The density function for some value \(x\) is defined as the following:

\(f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x - \mu}{\sigma})^2}\)

But fortunately, we can also just use the dnorm function in R.

Likelihood of continuous distributions

As you'd expect with the parameters of this normal distribution, values closer to 0 are more likely than those close to 20:

plot(X, dnorm(X, mean = 0, sd = 10))

Likelihood of continuous distributions

Each of those values of \(X\) is assigned a likelihood.

To find the likelihood of all our data, i.e., \(\mathcal{L}(X|\theta)\), we would have to multiply each of these probability values—or sum their log probabilities. We can do this by first applying log to the output of the dnorm function, then using sum to add up all those logged probabilities:

sum(log(dnorm(X, mean = 0, sd = 10)))
## [1] -3756.581

Interim summary

Let's recap exactly what we just did:

  1. First, we generated a bunch of data by drawing samples from a normal distribution with parameters \(\mu = 0, \sigma = 10\).
  2. Then, for each of those data points, we calculated the probability density of that value under a normal distribution with parameters \(\mu = 0, \sigma = 10\).
  3. Finally, we summed all of those log probabilities.

That final step gives us the log probability, \(\mathcal{L}\), of our parameters, \(\theta\)!

Likelihood for parameter estimation

Now imagine that we didn't know the parameters used to generate our dataset \(X\), and we wanted to estimate them. How would you go about this?

Likelihood for parameter estimation

One intuitive way to do this is to conduct a grid search of all the possible parameters.

Let's make this simple:

  • Assume we already know the standard deviation (\(\sigma\)) and we want to just estimate the mean (\(\mu\))
  • Consider just the integers between -50 and 50
df_likelihood = data.frame(
  mu = seq(from = -50, to = 50, length.out = 101)
)

Likelihood for parameter estimation

Now, for each possible parameter value, let's calculate the log likelihood of the observed data \(X\).

LL = function(mean, sd) {
  # Calculate log likelihood for dataset X
  sum(dnorm(x = X, mean = mean, sd = sd, log = TRUE))
}

# Apply log likelihood function for 
# each hypothesized value of mu (and fix SD at 10).
df_likelihood$ll <- mapply(LL, mean=df_likelihood$mu, sd = 10)

Likelihood for parameter estimation

Clearly, the best (most likely) log likelihood value is obtained when \(\mu = 0\) (or at least is close to 0).

Likelihood for parameter estimation

And of course, if were trying to run this through an optimization algorithm, we'd want the negative log likelihood instead:

Recap

This is good news! Remember that our original distribution was generated with \(\mu = 0\).

Thus, it makes sense that the data generated from that distribution would be most likely under a model with similar parameter values.

Now, let's apply these principles to evaluating the likelihood of statistical models.

Likelihood of statistical models

Models as generative

Models are generative.

A model's parameters entail certain probabilities for the possible values that our dependent variable could take on.

  • Model generates predictions
  • These predictions can be compared to actual values of our dependent variable.

What models "know"

Throughout this section, we can think about the parameters of a model in terms of what a model "knows".

We can think about this in terms of successive complexity:

  • A model that only knows about the dependent variable
  • A model that knows about the dependent variable and one covariate
  • A model that knows about the dependent variable and two covariates

And so on.

Example: predicting height

Imagine we want to make predictions about height. This is our distribution in question:

Example: predicting height

One of the simplest predictions we could make would just be the predict the mean of this distribution.

This approach won't get us too far!

Example: predicting height

But what if in addition to knowing each person's height, we also knew their mother's height?

This would give us much more leverage.

df_height$m_height = rnorm(n=100, df_height$height, sd = 10) 

df_height %>%
  ggplot(aes(x = m_height,
             y = height)) +
  geom_point() +
  labs(x = "Mother's height",
       y = "Child's height") +
  theme_minimal()

Example: predicting height

Example: predicting height

Now, we can estimate additional parameters for our model.

  • Our predictions won't just be based on the mean height
  • We can incorporate information about the mother's height

Thus, each prediction is conditioned on the values of X, and we learn a slope relating X to Y.

Height: fitting a model

Let's fit a linear model to this data:

m = lm(data = df_height,
       height ~ m_height)
m$coefficients
## (Intercept)    m_height 
##  32.6854721   0.4678476

Height: generating predictions

We can use this model to generate predictions, \(\hat{Y}\).

df_height$y_pred = predict(m)

df_height %>%
  ggplot(aes(x = m_height,
             y = height)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(x = "Mother's height",
       y = "Actual height") +
  theme_minimal()

Height: generating predictions

We can use this model to generate predictions, \(\hat{Y}\).

## `geom_smooth()` using formula 'y ~ x'

Evaluating predictions

We can also evaluate these predictions. One way to do that is to calculate the residual standard error, essentially the average error between our predictions and the real values of Y.

rss = sum((df_height$y_pred - df_height$height)**2)
rse = sqrt(rss / (nrow(df_height) - 2))
rse
## [1] 6.588668

On average, we can expect a prediction error of about 6.59.

Relating this to likelihood

What does this have to do with likelihood?

Earlier, we evaluated the likelihood of different parameters of a normal distribution:

  • the mean of that distribution
  • the standard deviation of that distribution

For each unique X value, we now have a set of predicted values for Y, and an estimate of the average error around each prediction. These correspond, respectively, to the mean and *standard deviation of many "mini-distributions" along values of X.

(Error is assumed to be constant: this is why homoscedasticity is an important assumption!!)

Many mini-distributions

For each actual value of Y, we can calculate the probability of that value occurring under a normal distribution with the corresponding mean and standard deviation, where:

\(\mu_i = \beta_1 * X_i + \beta_0\)

\(\sigma = \sqrt{\frac{\sum{(\hat{y} - y)^2}}{n-2}}\)

(Again: error is assumed to be constant!)

Calculating likelihood

This is pretty straightforward to calculate. As before, we sum the probability densities of each value of Y under a given distribution, but here, the distribution in question has a variable mean depending on the predicted value of X.

ll = sum(dnorm(x = df_height$height, 
               mean = df_height$y_pred, 
               sd = rse, 
               log = TRUE))
ll
## [1] -329.429

Visualization

This has all gotten a little abstract. Let's visualize what's going on for a couple points in particular:

Wrap-up: likelihood of statistical models

So far, we've learned:

  • Likelihood is just the probability of obtaining a set of data, under a particular distribution.
  • We can apply this to statistical models (like linear regression), by calculating the probability of each data point under distributions centered around the predicted value.

Now we can move onto likelihood ratio tests.

Likelihood ratio tests

Intuition

Given a choice between two models, we should prefer the model that best accounts for some set of data.

In particular, a LRT is typically used to select between two nested models, i.e., where one model is considered a "special case" of a larger model.

Formal description (pt. 1)

Intuitively, the likelihood ratio is a fraction expressing how much more likely the data are under one model vs. another model. Let's first define this ratio term:

\(\lambda = \frac{\mathcal{L}(M_{k-1})}{\mathcal{L}(M_{k})}\)

Because the smaller model is "nested" within the larger model, its likelihood should never be any larger. Thus, this ratio should be bound between 0 and 1.

  • Values closer to 1 mean the "full" model is not so much better.
  • Values closer to 0 suggest a larger improvement between reduced and full model.

Formal description (pt. 2)

To calculate the full likelihood ratio (let's call this \(\Delta\)), there's just a few more steps:

\(\Delta = -2 *log(\frac{\mathcal{L}(M_{k-1})}{\mathcal{L}(M_{k})})\)

Remember:

  • a smaller ratio between the likelihoods indicates a greater improvement of the full model over the reduced.
  • thus, the log of this ratio should be more negative for smaller ratios (i.e., a bigger improvement).

Unpacking the LRT formula

First, let's set up a vector of possible values of lambda, bounded between 0 and 1.

lambdas = seq(.01, 1, by=.001)

We then take the log of these \(\lambda\) values:

log_lambdas = log(lambdas)

And finally, we calculate the values of \(\Delta\):

ll_ratio = -2 * log_lambdas
df_loglik_example = data.frame(lambdas = lambdas,
                               log_lambdas = log_lambdas,
                               lr = ll_ratio)

Unpacking the LRT formula

As you'd expect, larger lambdas yield log values closer to 0.

Unpacking the LRT formula

Ultimately, larger lambdas also correspond to smaller likelihood ratios \(\Delta\).

Final notes

Takeaway: The better a full model is than a smaller, "special case" model, the larger the likelihood ratio between those models will be.

Often we have the log likelihoods rather than raw likelihood, so instead we use:

\(\Delta = -2 * (log(\mathcal{L}(M_{k-1})) - log(\mathcal{L}(M_{k})))\)

Hypothesis testing with LRT

NHST

Many scientists use the paradigm of Null Hypothesis Significance Testing (NHST).

  • Obtain some test statistic
  • Calculate probability of obtaining that test statistic under some sampling distribution
  • If probability is sufficiently low, reject the null hypothesis

Examples: t-test, F-ratio, chi-squared statistic, etc.

Can we apply this to log likelihood ratio tests? Yes!

NHST and LRTs

Wilks' theorem states that for likelihood ratio tests:

As the sample increases, the distribution of the test statistic approaches the chi-squared distribution parameterized by the difference in degrees of freedom between the models.

The chi-squared distribution is well-understood. So all we have to do is look up our LRT in the appropriate chi-squared distribution and obtain a p-value.

  • "Appropriate" meaning correct degrees of freedom, where this is equal to the difference in number of parameters between the models

If that p-value is sufficiently low, we can reject the null hypothesis that the full model is not significantly better than the reduced model.

Chi-squared distributions

To help illustrate this, let's consider several different chi-squared distributions with different degrees of freedom:

The centrality of degrees of freedom

Clearly, the degrees of freedom is very important!

The same chi-squared statistic is differentially likely under different degrees of freedom.

The centrality of degrees of freedom

Let's calculate the p-value for each of the log-likelihood ratios we simulated earlier, assuming DF = 1.

The centrality of degrees of freedom

Now let's assume DF = 3.

Applying LRT

Let's apply what we've learned to the models we considered earlier when predicting height:

  • Model 1 (simple): predicting height from the mean height (i.e., just an intercept)
  • Model 2 (full): predicting height from the mother's height (i.e., both slope and intercept)

Is Model 2 significantly better than Model 1?

Applying LRT

First, let's calculate the log likelihood of the full model:

m = lm(data = df_height,
       height ~ m_height)
logLik(m)[1]
## [1] -329.4188

Applying LRT

Second, let's calculate the log likelihood of the reduced model:

m2 = lm(data = df_height,
        height ~ 1)

logLik(m2)[1]
## [1] -360.9135

Applying LRT

As expected, the log likelihood of the full model is lower than that of the reduced model. But how much lower?

Let's calculate the log likelihood ratio:

delta = -2 * (logLik(m2)[1] - logLik(m)[1])
delta
## [1] 62.98936

Is this a "significant" improvement? How would we know?

Applying LRT

Our models models also differ in one degree of freedom. The full model contains a freely varying intercept, and a freely varying slope coefficient for the mother's height variable; the reduced model only contains only a freely varying intercept.

So we can calculate the p-value for this LRT value by looking it up in a chi-squared distribution with the appropriate degrees of freedom:

pchisq(delta, 1, lower.tail = FALSE)
## [1] 2.078258e-15

Recap: LRT

The likelihood of a model is akin to calculating the probability of the data, given a particular model and its parameters.

We can compare the likelihood of models to determine whether adding certain parameters improves model fit.

See the tutorial (https://seantrott.github.io/model_comparison/) for an application to mixed effects models!

Akaike Information Criterion (AIC)

Akaike Information Criterion (AIC)

Another common model evaluation (and comparison) approach is the Akaike Information Criterion, or AIC (Akaike, 1974). The formula is as follows:

\(AIC = 2k - 2ln(L)\)

Where k is the number of parameters, and L is the likelihood of the model.

To understand this formula, we can walk through the philosophy underlying AIC.

NHST vs. Information theory

AIC is not a "statistical test".

  • There is no null hypothesis, only a variety of alternative plausible hypotheses.
  • This is in contrast to the LRT we explored above, in which we were "rejecting" some "null hypothesis".

AIC is not tied to any decision rule, whereas NHST is explicitly about decision thresholds.

KL-Divergence

But there is a deep connection between AIC and LRT.

AIC is rooted in Kullback-Leibler Divergence.

  • KL-D: the "distance" between two probability distributions
  • Sometimes viewed as "information loss" when some model is used to approximate "true reality"

Ideally, we want the model that loses the least amount of information about "reality" relative to other models under consideration.

But we don't know what true reality is!

  • Instead, we can calculate the divergence between each of our models and the best model in the dataset.

KL-Divergence (continued)

KL-information is expressed as follows:

\(I(f, g) =\int f(x)log(f(x))dx - \int f(x)log(g(x|\theta))dx\)

As Burnham & Anderson (2004) write, this can also be expressed as:

\(I(f, g) = E_{f}[log(f(x))] - E_{f}[log(g(x|\theta))]\)

Key insights:

  • That first term is constant across all our models, because "true reality" shouldn't change.
  • The second term can be estimated—albeit with some bias—using the maximum likelihood value (Akaike, 1974).
  • The bias in this estimation is approximately equal to the number of parameters in the model (Akaike, 1974)!

Recap: KL-D and AIC

Key takeaways:

  • We want to identify the model that minimizes the information loss about "reality"
  • A classic way to measure information loss is KL-D
  • Akaike figured out that likelihood is a (biased) metric of KL-D
  • That bias is approximately equal to the number of parameters in the model

Now let's walk through the formula in depth!

Walking through AIC

Recall that Akaike demonstrated the following equivalence:

\(\hat{E}(KL) = log(L) - K\)

Akaike then multiplied this result by \(-2\) "for historical reasons", giving us the final value:

\(AIC = -2*log(L) + 2k = 2k - 2*log(\mathcal{L})\)

Understanding AIC

A lower AIC is "better". Why?

  • Smaller negative log likelihood values indicate better model fit
  • AIC is essentially the negative log likelihood, plus a correction for bias

Understanding AIC

Let's simulate some models with various likelihood values and numbers of parameters:

## Possible likelihood values
l = seq(.1, .99, length.out = 50)
## Possible number of parameters
k = seq(1, 100, length.out = 25)

## Expand into a "grid", with each 
# row corresponding to a particular model.
df_models = expand.grid(likelihood = l, 
                        parameters = k)

Understanding AIC

Now let's calculate AIC for each of these models.

## Formula to calculate AIC
aic_og = function(l, k) {
  aic = 2*k - 2*log(l)
  aic
}

## Calculate AIC
df_models = df_models%>%
  mutate(aic = aic_og(likelihood, parameters))

Visualizing AIC (pt. 1)

Now let's plot AIC by likelihood, and color each dot according to the number of parameters.

Visualizing AIC (pt. 2)

We can also view AIC directly against the number of parameters.

Key takeaways

So far, we should have derived two key takeaways:

  • All else being equal, models with higher likelihoods have lower AIC values.
  • All else being equal, models with more parameters have higher AIC values.

Both things make sense. AIC directly reflects both accuracy and parsimony, the two goals of model selection!

Corrected AIC

If the number of parameters is quite large relative to the number of observations, you are advised to use the corrected AIC. This just adds another correction term:

\(AIC_c = AIC + \frac{2K(K + 1)}{n - K - 1}\)

The correction is scaled to the relative difference between the number of parameters and the number of observations.

  • as \(n \to \infty\), this term approaches \(0\)
  • as \(K \to n\), this correction term gets larger and larger

Corrected AIC

I'll plot out various values of n, K, and the resulting correction.

Comparing AIC and corrected AIC

How do these measures compare?

Recap: AIC vs. corrected AIC

The largest deviations from the line of parity are for the smallest sample sizes (e.g., n = 20).

  • as K approaches n, this correction term gets larger and larger
  • thus, corrected AIC gets larger and larger

Burnham and Anderson (2004) suggest you should use corrected AIC unless you have 40x more observations than parameters.

From my perspective, it doesn't seem like there's much of a harm to using corrected AIC. With large enoguh samples, the penalty term approaches 0 anyway.

AIC for model selection

Comparing AIC values

The whole point of this is model selection. How do we use AIC for this?

  • We've already established that a lower AIC reflects better model fit.
  • Unlike with log-likelihood ratio tests, the calculation for AIC explicitly incorporates the number of parameters

This makes it pretty straightforward: we can just rank our models from lowest to highest AIC.

AIC Delta

The "absolute" AIC value isn't particularly informative. To better evaluate models, people sometimes rescale AIC according to the best model:

\(\Delta_i = AIC_i - AIC_{min}\)

So for example, if our AIC values were as follows:

## [1] 1000 1002 1004 1006 1008 1010

We would rescale them in the following way:

models_rescaled = models - min(models)
models_rescaled
## [1]  0  2  4  6  8 10

Relative likelihood

We can calculate the relative likelihood of two models by simply exponentiating the Delta value:

\(\ell_i = \mathcal{L}(m_i | D) = exp(-\frac{1}{2}\Delta_i)\)

Let's calculcate \(\ell_i\) for each of our rescaled models:

L = exp(-(1/2) * models_rescaled)
L
## [1] 1.000000000 0.367879441 0.135335283 0.049787068 0.018315639 0.006737947

Akaike weights

This also allows us to calculate the relative probability of each model, given the data and our set of candidate models, by normalizing each likelihood value:

\(p(m_i | D) = \frac{\ell_i}{\sum\limits_{j=1}^M\ell_j}\)

p = L / sum(L)
p
## [1] 0.633691323 0.233122010 0.085760795 0.031549633 0.011606461 0.004269779

Common pitfalls

Burnham and Anderson (2002) list several common pitfalls people encounter when using information-theoretic methods, including AIC. I'll summarize just a few below:

  • Information-theoretic models are not a "test"
  • Use of AIC instead of corrected AIC
  • Incorrect estimation of number of parameters
  • Comparing AIC across datasets

This last one is especially important if some of your predictors contain missing data, such that different models will actually be fit to different subsets of the dataset. Not all software packages alert you when certain rows of data are ignored or omitted from a model, so this is something to be aware of.

Relationship to LRT

Clearly, AIC depends on likelihood, since it uses likelihood in its formula.

Further, both LRTs and AIC account in some way for the number of parameters in a model.

  • LRT does this by accounting for the difference in number of parameters across two models
  • AIC calculates this explicitly for a given model

And in fact, the connection is even deeper. The difference in AIC between two models is equivalent to the likelihood ratio between those models, plus a term correcting for the difference in number of parameters:

\(AIC_i - AIC_j = LRT + 2(K_i - K_j)\)

AIC and NHST revisited

Unlike LRTs, AIC is (as far as I know) not typically used in an NHST framework.

This sometimes feels frustrating—we crave binary decision rules: "Is it significant or not? Do I reject the null hypothesis or not?"

But the world is fuzzy and uncertain, so perhaps it's good to acquaint ourselves with thinking about the relative strength of evidence for different models, rather than approaching theory construction from the perspective of which models are "correct" vs. "incorrect".

Final notes