- What is model comparison?
- What's the conceptual intuition behind a model comparison?
- What are some of the different kinds of model comparison approaches?
Follow along: https://seantrott.github.io/model_comparison/
2/24/2021
Follow along: https://seantrott.github.io/model_comparison/
We'll be using four libraries:
library(tidyverse)
library(lme4)
library(lmtest)
library(ggridges)
Also, if you want to clone a copy of this tutorial locally on your computer, you can do so with git
:
git clone https://github.com/seantrott/model_comparison.git
Science involves building models of the world. Today, I'll be focusing on statistical models in particular.
A good statistical model should satisfy the following criteria:
Thus, from this point of view, the goal of model selection procedure is to identify the simplest model that best explains your data.
We want to compare the explanatory power of two or more models. Which model best explains variance in \(Y\)?
Of course, a model with more parameters will almost always explain slightly more variance than a model with less. So reducing variance in \(Y\) isn't our only goal: we also want a parsimonious model.
There are a number of different model comparison methods, each designed to assess, essentially, model accuracy and parsimony.
The likelihood of a model, \(\mathcal{L}(M_{k})\), is a way of quantifying how well a model fits the data.
A likelihood ratio test compares the likelihoods of different models for the same data.
How do we calculate likelihood?
We use a likelihood function, \(\mathcal{L}(y | \theta, X)\), where:
In maximum likelihood estimation (MLE), the goal is to (as the name implies) maximize this likelihood (or minimize the negative log-likelihood).
In OLS regression, we want to identify the "best-fitting" line, that minimizes the sum of squared residuals:
\(\sum{(\hat{y} - y)^2}\)
In essence, the sum of squared residuals measures the "distance" between the predictions from your model and the actual values in your data. OLS thus makes intuitive sense: with linear regression, we want to identify the best-fitting line, where "best fit" is the line with the smallest deviations in the real data.
\(Y = 10 + 2X + Normal(\sigma = 20)\)
set.seed(1) X = c(1:100) Y = 10 + 2 * X + rnorm(n = length(X), sd = 20) df_xy = data.frame(X, Y) df_xy %>% ggplot(aes(x = X, y = Y)) + geom_point(alpha = .6) + theme_minimal()
\(Y = 10 + 2X + Normal(\sigma = 20)\)
\(Y = 10 + 2X + Normal(\sigma = 20)\)
We can measure the fit of each line (i.e., model) by calculating the residual sum of squares:
\(\sum{(\hat{y} - y)^2}\)
y_real = 10 + 2*X y1 = 20 + 1.5 * X y2 = 5 + 3 * X y3 = 40 + .5 * X df_rss = data.frame( real = sum((Y - y_real)**2), m1 = sum((Y - y1)**2), m2 = sum((Y - y2)**2), m3 = sum((Y - y3)**2)) %>% pivot_longer(c(real, m1, m2, m3), names_to = "model", values_to = "rss")
OLS is actually a derivation of a more general approach to parameter estimation: maximum likelihood estimation (MLE).
\(\mathcal{L}(\theta | x_1, x_2, ..., x_n) = f(x_1, x_2, ..., x_n | \theta)\)
In particular, we want to calculate the joint probability of each data point, under the model's estimated parameters. We can write that as follows:
\(\mathcal{L}(\theta | X) = \prod\limits_{i=1}^N p(X_i | \theta)\)
One issue with this formulation is that the probability densities for each data point can quickly become either very small or very large!
Thus, we typically rewrite this as the sum of log probabilities.
\(\mathcal{L}(\theta | X) = \sum\limits_{i=1}^N log(p(X_i | \theta))\)
But how exactly is \(p(X_i | \theta)\) calculated?
Let's think of a probability distribution as a generative process: for any possible value \(X_i\), a probability function assigns some probability density to that value.
This function can be used to:
Let's start with likelihood of continuous, normal distributions.
Let's simulate a normal distribution with parameters: \(\mu = 0, \sigma = 10\).
set.seed(1) X = rnorm(n = 1000, mean = 0, sd = 10) hist(X)
Let's simulate a normal distribution with parameters: \(\mu = 0, \sigma = 10\).
The density function for some value \(x\) is defined as the following:
\(f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x - \mu}{\sigma})^2}\)
But fortunately, we can also just use the dnorm
function in R.
As you'd expect with the parameters of this normal distribution, values closer to 0
are more likely than those close to 20
:
plot(X, dnorm(X, mean = 0, sd = 10))
Each of those values of \(X\) is assigned a likelihood.
To find the likelihood of all our data, i.e., \(\mathcal{L}(X|\theta)\), we would have to multiply each of these probability values—or sum their log probabilities. We can do this by first applying log
to the output of the dnorm
function, then using sum
to add up all those logged probabilities:
sum(log(dnorm(X, mean = 0, sd = 10)))
## [1] -3756.581
Let's recap exactly what we just did:
That final step gives us the log probability, \(\mathcal{L}\), of our parameters, \(\theta\)!
Now imagine that we didn't know the parameters used to generate our dataset \(X\), and we wanted to estimate them. How would you go about this?
One intuitive way to do this is to conduct a grid search of all the possible parameters.
Let's make this simple:
df_likelihood = data.frame( mu = seq(from = -50, to = 50, length.out = 101) )
Now, for each possible parameter value, let's calculate the log likelihood of the observed data \(X\).
LL = function(mean, sd) { # Calculate log likelihood for dataset X sum(dnorm(x = X, mean = mean, sd = sd, log = TRUE)) } # Apply log likelihood function for # each hypothesized value of mu (and fix SD at 10). df_likelihood$ll <- mapply(LL, mean=df_likelihood$mu, sd = 10)
Clearly, the best (most likely) log likelihood value is obtained when \(\mu = 0\) (or at least is close to 0
).
And of course, if were trying to run this through an optimization algorithm, we'd want the negative log likelihood instead:
This is good news! Remember that our original distribution was generated with \(\mu = 0\).
Thus, it makes sense that the data generated from that distribution would be most likely under a model with similar parameter values.
Now, let's apply these principles to evaluating the likelihood of statistical models.
Models are generative.
A model's parameters entail certain probabilities for the possible values that our dependent variable could take on.
Throughout this section, we can think about the parameters of a model in terms of what a model "knows".
We can think about this in terms of successive complexity:
And so on.
Imagine we want to make predictions about height. This is our distribution in question:
One of the simplest predictions we could make would just be the predict the mean of this distribution.
This approach won't get us too far!
But what if in addition to knowing each person's height, we also knew their mother's height?
This would give us much more leverage.
df_height$m_height = rnorm(n=100, df_height$height, sd = 10) df_height %>% ggplot(aes(x = m_height, y = height)) + geom_point() + labs(x = "Mother's height", y = "Child's height") + theme_minimal()
Now, we can estimate additional parameters for our model.
Thus, each prediction is conditioned on the values of X, and we learn a slope relating X to Y.
Let's fit a linear model to this data:
m = lm(data = df_height, height ~ m_height) m$coefficients
## (Intercept) m_height ## 32.6854721 0.4678476
We can use this model to generate predictions, \(\hat{Y}\).
df_height$y_pred = predict(m) df_height %>% ggplot(aes(x = m_height, y = height)) + geom_point() + geom_smooth(method = "lm") + labs(x = "Mother's height", y = "Actual height") + theme_minimal()
We can use this model to generate predictions, \(\hat{Y}\).
## `geom_smooth()` using formula 'y ~ x'
We can also evaluate these predictions. One way to do that is to calculate the residual standard error, essentially the average error between our predictions and the real values of Y.
rss = sum((df_height$y_pred - df_height$height)**2) rse = sqrt(rss / (nrow(df_height) - 2)) rse
## [1] 6.588668
On average, we can expect a prediction error of about 6.59.
What does this have to do with likelihood?
Earlier, we evaluated the likelihood of different parameters of a normal distribution:
For each unique X value, we now have a set of predicted values for Y, and an estimate of the average error around each prediction. These correspond, respectively, to the mean and *standard deviation of many "mini-distributions" along values of X.
(Error is assumed to be constant: this is why homoscedasticity is an important assumption!!)
For each actual value of Y, we can calculate the probability of that value occurring under a normal distribution with the corresponding mean and standard deviation, where:
\(\mu_i = \beta_1 * X_i + \beta_0\)
\(\sigma = \sqrt{\frac{\sum{(\hat{y} - y)^2}}{n-2}}\)
(Again: error is assumed to be constant!)
This is pretty straightforward to calculate. As before, we sum the probability densities of each value of Y under a given distribution, but here, the distribution in question has a variable mean depending on the predicted value of X.
ll = sum(dnorm(x = df_height$height, mean = df_height$y_pred, sd = rse, log = TRUE)) ll
## [1] -329.429
This has all gotten a little abstract. Let's visualize what's going on for a couple points in particular:
So far, we've learned:
Now we can move onto likelihood ratio tests.
Given a choice between two models, we should prefer the model that best accounts for some set of data.
In particular, a LRT is typically used to select between two nested models, i.e., where one model is considered a "special case" of a larger model.
Intuitively, the likelihood ratio is a fraction expressing how much more likely the data are under one model vs. another model. Let's first define this ratio term:
\(\lambda = \frac{\mathcal{L}(M_{k-1})}{\mathcal{L}(M_{k})}\)
Because the smaller model is "nested" within the larger model, its likelihood should never be any larger. Thus, this ratio should be bound between 0 and 1.
To calculate the full likelihood ratio (let's call this \(\Delta\)), there's just a few more steps:
\(\Delta = -2 *log(\frac{\mathcal{L}(M_{k-1})}{\mathcal{L}(M_{k})})\)
Remember:
First, let's set up a vector of possible values of lambda, bounded between 0 and 1.
lambdas = seq(.01, 1, by=.001)
We then take the log of these \(\lambda\) values:
log_lambdas = log(lambdas)
And finally, we calculate the values of \(\Delta\):
ll_ratio = -2 * log_lambdas df_loglik_example = data.frame(lambdas = lambdas, log_lambdas = log_lambdas, lr = ll_ratio)
As you'd expect, larger lambdas yield log values closer to 0.
Ultimately, larger lambdas also correspond to smaller likelihood ratios \(\Delta\).
Takeaway: The better a full model is than a smaller, "special case" model, the larger the likelihood ratio between those models will be.
Often we have the log likelihoods rather than raw likelihood, so instead we use:
\(\Delta = -2 * (log(\mathcal{L}(M_{k-1})) - log(\mathcal{L}(M_{k})))\)
Many scientists use the paradigm of Null Hypothesis Significance Testing (NHST).
Examples: t-test, F-ratio, chi-squared statistic, etc.
Can we apply this to log likelihood ratio tests? Yes!
Wilks' theorem states that for likelihood ratio tests:
As the sample increases, the distribution of the test statistic approaches the chi-squared distribution parameterized by the difference in degrees of freedom between the models.
The chi-squared distribution is well-understood. So all we have to do is look up our LRT in the appropriate chi-squared distribution and obtain a p-value.
If that p-value is sufficiently low, we can reject the null hypothesis that the full model is not significantly better than the reduced model.
To help illustrate this, let's consider several different chi-squared distributions with different degrees of freedom:
Clearly, the degrees of freedom is very important!
The same chi-squared statistic is differentially likely under different degrees of freedom.
Let's calculate the p-value for each of the log-likelihood ratios we simulated earlier, assuming DF = 1.
Now let's assume DF = 3.
Let's apply what we've learned to the models we considered earlier when predicting height:
Is Model 2 significantly better than Model 1?
First, let's calculate the log likelihood of the full model:
m = lm(data = df_height, height ~ m_height) logLik(m)[1]
## [1] -329.4188
Second, let's calculate the log likelihood of the reduced model:
m2 = lm(data = df_height, height ~ 1) logLik(m2)[1]
## [1] -360.9135
As expected, the log likelihood of the full model is lower than that of the reduced model. But how much lower?
Let's calculate the log likelihood ratio:
delta = -2 * (logLik(m2)[1] - logLik(m)[1]) delta
## [1] 62.98936
Is this a "significant" improvement? How would we know?
Our models models also differ in one degree of freedom. The full model contains a freely varying intercept, and a freely varying slope coefficient for the mother's height
variable; the reduced model only contains only a freely varying intercept.
So we can calculate the p-value for this LRT value by looking it up in a chi-squared distribution with the appropriate degrees of freedom:
pchisq(delta, 1, lower.tail = FALSE)
## [1] 2.078258e-15
The likelihood of a model is akin to calculating the probability of the data, given a particular model and its parameters.
We can compare the likelihood of models to determine whether adding certain parameters improves model fit.
See the tutorial (https://seantrott.github.io/model_comparison/) for an application to mixed effects models!
Another common model evaluation (and comparison) approach is the Akaike Information Criterion, or AIC (Akaike, 1974). The formula is as follows:
\(AIC = 2k - 2ln(L)\)
Where k is the number of parameters, and L is the likelihood of the model.
To understand this formula, we can walk through the philosophy underlying AIC.
AIC is not a "statistical test".
AIC is not tied to any decision rule, whereas NHST is explicitly about decision thresholds.
But there is a deep connection between AIC and LRT.
AIC is rooted in Kullback-Leibler Divergence.
Ideally, we want the model that loses the least amount of information about "reality" relative to other models under consideration.
But we don't know what true reality is!
KL-information is expressed as follows:
\(I(f, g) =\int f(x)log(f(x))dx - \int f(x)log(g(x|\theta))dx\)
As Burnham & Anderson (2004) write, this can also be expressed as:
\(I(f, g) = E_{f}[log(f(x))] - E_{f}[log(g(x|\theta))]\)
Key insights:
Key takeaways:
Now let's walk through the formula in depth!
Recall that Akaike demonstrated the following equivalence:
\(\hat{E}(KL) = log(L) - K\)
Akaike then multiplied this result by \(-2\) "for historical reasons", giving us the final value:
\(AIC = -2*log(L) + 2k = 2k - 2*log(\mathcal{L})\)
A lower AIC is "better". Why?
Let's simulate some models with various likelihood values and numbers of parameters:
## Possible likelihood values l = seq(.1, .99, length.out = 50) ## Possible number of parameters k = seq(1, 100, length.out = 25) ## Expand into a "grid", with each # row corresponding to a particular model. df_models = expand.grid(likelihood = l, parameters = k)
Now let's calculate AIC for each of these models.
## Formula to calculate AIC aic_og = function(l, k) { aic = 2*k - 2*log(l) aic } ## Calculate AIC df_models = df_models%>% mutate(aic = aic_og(likelihood, parameters))
Now let's plot AIC by likelihood, and color each dot according to the number of parameters.
We can also view AIC directly against the number of parameters.
So far, we should have derived two key takeaways:
Both things make sense. AIC directly reflects both accuracy and parsimony, the two goals of model selection!
If the number of parameters is quite large relative to the number of observations, you are advised to use the corrected AIC. This just adds another correction term:
\(AIC_c = AIC + \frac{2K(K + 1)}{n - K - 1}\)
The correction is scaled to the relative difference between the number of parameters and the number of observations.
I'll plot out various values of n, K, and the resulting correction.
How do these measures compare?
The largest deviations from the line of parity are for the smallest sample sizes (e.g., n = 20).
Burnham and Anderson (2004) suggest you should use corrected AIC unless you have 40x more observations than parameters.
From my perspective, it doesn't seem like there's much of a harm to using corrected AIC. With large enoguh samples, the penalty term approaches 0 anyway.
The whole point of this is model selection. How do we use AIC for this?
This makes it pretty straightforward: we can just rank our models from lowest to highest AIC.
The "absolute" AIC value isn't particularly informative. To better evaluate models, people sometimes rescale AIC according to the best model:
\(\Delta_i = AIC_i - AIC_{min}\)
So for example, if our AIC values were as follows:
## [1] 1000 1002 1004 1006 1008 1010
We would rescale them in the following way:
models_rescaled = models - min(models) models_rescaled
## [1] 0 2 4 6 8 10
We can calculate the relative likelihood of two models by simply exponentiating the Delta value:
\(\ell_i = \mathcal{L}(m_i | D) = exp(-\frac{1}{2}\Delta_i)\)
Let's calculcate \(\ell_i\) for each of our rescaled models:
L = exp(-(1/2) * models_rescaled) L
## [1] 1.000000000 0.367879441 0.135335283 0.049787068 0.018315639 0.006737947
This also allows us to calculate the relative probability of each model, given the data and our set of candidate models, by normalizing each likelihood value:
\(p(m_i | D) = \frac{\ell_i}{\sum\limits_{j=1}^M\ell_j}\)
p = L / sum(L) p
## [1] 0.633691323 0.233122010 0.085760795 0.031549633 0.011606461 0.004269779
Burnham and Anderson (2002) list several common pitfalls people encounter when using information-theoretic methods, including AIC. I'll summarize just a few below:
This last one is especially important if some of your predictors contain missing data, such that different models will actually be fit to different subsets of the dataset. Not all software packages alert you when certain rows of data are ignored or omitted from a model, so this is something to be aware of.
Clearly, AIC depends on likelihood, since it uses likelihood in its formula.
Further, both LRTs and AIC account in some way for the number of parameters in a model.
And in fact, the connection is even deeper. The difference in AIC between two models is equivalent to the likelihood ratio between those models, plus a term correcting for the difference in number of parameters:
\(AIC_i - AIC_j = LRT + 2(K_i - K_j)\)
Unlike LRTs, AIC is (as far as I know) not typically used in an NHST framework.
This sometimes feels frustrating—we crave binary decision rules: "Is it significant or not? Do I reject the null hypothesis or not?"
But the world is fuzzy and uncertain, so perhaps it's good to acquaint ourselves with thinking about the relative strength of evidence for different models, rather than approaching theory construction from the perspective of which models are "correct" vs. "incorrect".
Thanks for coming!
See the tutorial (https://seantrott.github.io/model_comparison/) for more details on AIC, BIC, and multimodel inference.