Sean Trott
November 18, 2019

High-level goals

This tutorial is intended as an accompaniment to my 11/18/2019 workshop, "Mixed effects models in R". The goals of the lecture-portion of that workshop (and thus this tutorial) are to address the following questions:

Why build mixed effects models?
How can you decide what to model as a fixed vs. a random effect? And more broadly, which random effects do you need to account for?
What should your data look like? How do you actually run a model in lme4?
How do you interpret and report the results?

The examples given will likely focus on linear models in particular (where the critical contrast is between OLS regression and linear mixed models), but where possible I will describe extensions to other models¹.

In writing this tutorial, I drew heavily from a number of existing tutorials and resources. I strongly endorse all of them:

Also, this website has beautiful visualizations illustrating the concept of nested structure in datasets.

Load libraries

First, logistics. We'll be working with two main libraries: library(tidyverse) and library(lme4). If you don't have them installed already, use install.packages("tidyverse") and install.packages("lme4") to install them.

Brief (and likely unnecessary) introduction

The point of building statistical models is to quantify relationships between two or more variables. For example, how correlated are A and B?

This quantification is done in the service of some theoretical goal, e.g., to adjudicate between competing hypotheses. Thus, researchers often want to know some second-order information about this relationship, such as how unexpected it is. In the traditional approach of null hypothesis significance testing, this amounts to asking about the probability of your data under the null hypothesis, e.g., p(d|h): a p-value².

Another popular approach is to use some kind of goodness of fit measure, i.e., how successfully does your model fit the data? Using a model selection procedure, various models can be fit to the data and selected on the basis of their explanatory power (using some heuristic to enforce model parsimony). This is especially useful when examining complex, multivariate relationships because it allows researchers to ask whether their variable of interest explains independent sources of variance from other variables, i.e., how much they improve a model without that variable. The goal of a model selection procedure is to settle on the simplest model that best explains your data³.

Why use mixed models?

All statistical methods have a set of assumptions baked into them.

One assumption shared by many statistical methods is that each data point is independent, i.e., not dependent on or related to any other observation. For example, if you flip a coin, the probability of getting heads should be independent of other flips; in our theoretical model of coin-flipping, we assume that the generative process underlying a given coin flip is independent of the process underlying other coin flips. Similarly, a procedure like OLS regression assumes that each data point was generated by an independent causal process, and therefore, that your data points are all independent.

This is a useful assumption, and in many cases, is perfectly acceptable.

When the independence assumption is met

When is independence fair to assume?

Broadly speaking: your data points are independent if you've ruled out the possibility of some nested causal process generating multiple observations in your dataset. This causal process could be a person (e.g., the same participant contributing multiple data points), an item (e.g., the same item being answered by different participants, where the item itself isn't your variable of interest), or in certain contexts, even an institution (e.g., multiple participants coming from the same school).

Example 1: You want to know whether a new pharmaceutical drug increases self-reported happiness. You randomly assign people to take the drug vs. some placebo, and 1 month later, each participant answers a single question (e.g., "How happy are you (1-7)?"). You then use a t-test or OLS regression to compare the distribution of people assigned to the treatment group to the distribution of people assigned to the placebo group. Here, each participant contributed exactly 1 data point, so it's pretty reasonable to assume each data point is independent.

Example 2: You want to know whether people respond differently to one kind of metaphor than another. You assign 200 subjects to 1 of two conditions (A vs. B). Each subject is exposed to exactly 1 of 2 possible items (A vs. B). In other words, it's a between-subjects design with only one item. Again, it seems like these are independent observations; the only similarities across observations are exactly the ones you're interested in (i.e., experimental condition). Thus, you can run a t-test or OLS regression to ask whether responses are different in the A vs. B condition.

Examples 1-2 seem to meet the independence assumption⁴.

When this assumption is met, you can analyze your data with something like OLS regression; an example using simulated data is shown below.

Simulation and analysis

Simulate and visualize the data

Let's simulate some data. Just to make it obvious, we'll start out by simulating two normal distributions with different means and identical standard deviations. As depicted below, these groups will have some overlap but still look pretty different.

### Set random seed
set.seed(42)

### Illustration of two different distributions, A vs. B

A = rnorm(100, mean = 725, sd = 15)
B = rnorm(100, mean = 705, sd = 15)

df = data.frame(A = A,
           B = B) %>%
  gather() %>%
  mutate(group = key)

df %>%
  ggplot(aes(x = value,
             fill = group)) +
  geom_density(alpha = .6) +
  labs(title = "Group A vs. Group B") +
  theme_minimal()

Detour: explanation of OLS

These two groups look pretty different. One way to quantify their difference is to use ordinary least squares regression, predicting value from group, e.g., value ~ group.

Recall that the goal of OLS is to learn a set of coefficients for your predictors that minimize the difference between the predicted values of your model and the observed values. Abstractly, the formula for linear regression looks like this:

\(Y \sim X + \epsilon\)

Where \(Y\) represents our dependent variable, \(X\) represents our predictor (or fixed effect), and \(\epsilon\) represents an error term. Here, "error" broadly encompasses everything that the model doesn't "know" about where variance in \(Y\) might come from (i.e., variance not correlated with variance in \(X\)); the tighter the relationship between \(X\) and \(Y\), the smaller this error term will be.

With OLS, we want to estimate coefficients for our regressors. These coefficients are sometimes called slopes, because they refer to the slope of the line that best fits \(X\) to \(Y\). They're usually represented by the \(\beta\) symbol:

\(Y \sim \beta_{0} + X_{1}\beta_{1} + \epsilon\)

Here, \(\beta_{0}\) refers to the intercept, and \(\beta_{1}\) refers to the slope for the \(X_{1}\) predictor.

With continuous data, \(\beta_{0}\) means the value of \(Y\) when \(X\) is 0. \(\beta_{1}\) means: with every 1-unit increase in \(X\), change \(Y\) by \(\beta_{1}\).

Categorical data is conceptually similar. R chooses one of the levels of a categorical factor (e.g., group A) as the intercept, so \(\beta_{0}\) refers to the mean value of \(Y\) in that group. \(\beta_{1}\) is thus the average difference between the two groups, which can be expressed as the slope of the line between their means. For example, if the mean value of \(Y\) in group A is 40 and the mean value of \(Y\) in group B is 50, then \(\beta_{0}\) = 40, and \(\beta_{1}\) = 10.

Model the data

Back to our simulated data. We can apply these principles to modeling the potential differences between our two groups.

Let's run a linear model:

simple_model = lm(data = df,
           value ~ group)

summary(simple_model)

## 
## Call:
## lm(formula = value ~ group, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.384  -8.757   0.775   9.320  41.841 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  725.488      1.463  495.97   <2e-16 ***
## groupB       -21.800      2.069  -10.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.63 on 198 degrees of freedom
## Multiple R-squared:  0.3593, Adjusted R-squared:  0.3561 
## F-statistic: 111.1 on 1 and 198 DF,  p-value: < 2.2e-16

There's a lot of results here, but for our purposes, we're primarily interested in the coefficient on our fixed effect, group. As noted above, R uses reference coding, which means that one level of a categorical variable is chosen as the intercept; thus, if A is chosen as our intercept, then the intercept value will correspond to the mean of that distribution. We can verify this ourselves:

simple_model$coefficients["(Intercept)"]

## (Intercept) 
##    725.4877

mean(A)

## [1] 725.4877

If we look below the intercept value, we see the coefficient for our fixed effect, group. Here, it should say something like groupB. The coefficient should thus be interpreted in contrast (or in reference) to the intercept value.

coef_value = simple_model$coefficients["groupB"]
round(coef_value, 2)

## groupB 
##  -21.8

In other words: according to our model, observations in group B are on average -21.8 units lower than observations in group A. Again, we can verify this ourselves by calculating the difference between the means of these groups:

round(mean(B) - mean(A), 2)

## [1] -21.8

We can also express this visually. Below, we see that the mean (and median) of group A is equivalent to the intercept, i.e., \(\beta_{0}\), and the best-fit line (i.e., \(\beta_{1}\)) connects this intercept to the mean of group B.

df %>%
  ggplot() +
  geom_boxplot(aes(x = group,
                    y = value)) +
  geom_point(aes(x = as.integer(factor(group)),
                 y = value)) +
  geom_smooth(aes(x = as.integer(factor(group)),
                 y = value), method = "lm") +
  labs(title = "Value across groups A and B") +
  theme_minimal()

## `geom_smooth()` using formula 'y ~ x'

When independence is not met

Unfortunately, this independence assumption isn't always met. Most people conducting psychology experiments collect more than one data point from each participant⁵ (this is inevitable in within-subjects designs). Furthermore, most of these experiments probably reuse the same set of items across participants; that is, the same stimulus is likely presented and responded to by multiple participants.

This is problematic for traditional OLS approaches, regardless of what your dependent variable (DV) is:

Reaction time (RT)
Yes/No responses
Gaze fixations
EEG amplitude
Pretty much any other DV you might be interested in

For example, different subjects likely differ substantially in their reaction time; similarly, different subjects may have different biases towards responding "yes" or "no" to your items. Let's label this problem nested variability in the dependent variable. Even worse, different subjects may differ in the effect of your experimental manipulation on their response (or RT, etc.). That is, some subjects may show a stronger effect than others. Let's call this problem nested variability in the effect of interest. These two problems correspond, roughly, to two solutions offered by mixed models: random intercepts and random slopes. I'll discuss what these terms mean a bit later in the tutorial.

Example: violating the independence assumption

For a quick example, let's load a dataset that comes with the lme4 package (see also Bates et al (2014) for a more detailed walkthrough of the dataset).

This dataset, called sleepstudy, contains data about the average reaction time (Reaction, measured in ms) per day (Days) of subjects in a sleep deprivation study. There are 18 subjects total, with a total of 180 observations; in other words, each subject is responsible for 10 observations in the dataset.

Naive analysis

Let's first pretend we didn't know anything about this source of non-independence in the dataset. Naively, we could plot (and analyze) the data as a set of 180 independent observations.

sleepstudy %>%
  ggplot(aes(x = Days,
             y = Reaction)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "RT over successive days") +
  theme_minimal()

## `geom_smooth()` using formula 'y ~ x'

Looks like a positive relationship between Days and Reaction time, as we'd expect: more sleep deprivation leads to slower reaction times. We can use OLS regression to model this relationship:

naive_model = lm(data = sleepstudy,
           Reaction ~ Days)
summary(naive_model)

## 
## Call:
## lm(formula = Reaction ~ Days, data = sleepstudy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.848  -27.483    1.546   26.142  139.953 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  251.405      6.610  38.033  < 2e-16 ***
## Days          10.467      1.238   8.454 9.89e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.71 on 178 degrees of freedom
## Multiple R-squared:  0.2865, Adjusted R-squared:  0.2825 
## F-statistic: 71.46 on 1 and 178 DF,  p-value: 9.894e-15

The model gives us an intercept of 251.41; this is the estimated RT when \(X\)=0. The coefficient for Days is 10.47; this means that for each continued day of sleep deprivation, RT increases by an average of Days is 10.47.

But now we recall that there are only 18 subjects accounting for 180 observations. There must be non-independence in our dataset. That is, different subjects probably have variability in their overall reaction time, and might even show variability in how resilient they are to sleep deprivation. As shown below, subjects do appear to vary considerably in their overall reaction time (and also in the variance in each of their distributions of reaction times).

sleepstudy %>%
  ggplot(aes(x = reorder(Subject, Reaction),
             y = Reaction)) +
  geom_boxplot() +
  labs(title = "Subject-level variance in RT",
       x = "Subject") +
  theme_minimal()

Mixed effect models are one solution to this problem of non-independence⁶.

Mixed models: a walkthrough

The sleepstudy dataset is an example of data with nested structure. In this case, nested structure manifests as multiple observations coming from the same subject. We can control for this nested structure in two ways:

Random intercepts: Control for subject-level variance in \(Y\), i.e., \(\beta_{0}\).
Random slopes: Control for subject-level variance in the effect of \(X\) on \(Y\), i.e., \(\beta_{1}\).

Below I'll walk through both in turn.

Random intercepts: accounting for nested variance in \(Y\)

Different people probably differ in their average reaction time, regardless of how sleep-deprived they are. Rather than ignoring this individual variation, we can actually incorporate it into our model as something called a random effect⁷ Recall that the model from earlier looks like this in R:

Reaction ~ Days

Using mixed models, we can augment this model with a random intercept for each subject, which basically means: in addition to calculating the line of best-fit between Days and Reaction, also calculate the mean Reaction for each subject, rather than assuming that subjects don't vary at all. The R syntax to do this is as follows:

Reaction ~ Days + (1 | Subject)

We've now accounted for subject-level variance in \(Y\).

Often, experiments will involve other kinds of nested structure, such as item-level variance. Our current dataset averages across all observations for a subject in a given day, so it doesn't include any fine-grained information about items. But if it did (e.g., Item), and these items exhibited nested structure, we could model it as follows:

Reaction ~ Days + (1 | Subject) + (1 | Item)

In fact, this is one of the major benefits of mixed models over repeated-measures ANOVAs: the same model, fit to the same dataset, can include many different random effects.

Random slopes: accounting for nested variance in the effect of \(X\) on \(Y\)

Random intercepts account for subject-level variance, e.g., the average reaction time of a subject. But there might be another kind of nested variability in our dataset: different subjects may vary in their effect of sleep deprivation on reaction time. This is actually not unreasonable to assume. We've all met people who can function perfectly well on 2-3 hours of sleep, while others (like myself) become a kind of zombie.

We can actually visualize this variability by drawing separate best-fit lines for each subject⁸. The graph will be kind of messy, but hopefully it illustrates the point that each subject shows a different effect of Days on Reaction. Some subjects exhibit a steep, positive slope: they're having a hard time adjusting to the lack of sleep. Other subjects exhibit milder slopes: they're more resilient. And one subject appears to have a slightly negative slope: somehow, they're getting faster with each day of sleep deprivation. See also this website for a nicer illustration of group-level variability in slope.

sleepstudy %>%
  ggplot(aes(x = Days,
             y = Reaction,
             color = Subject)) +
  # geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "RT over successive days") +
  theme_minimal()

## `geom_smooth()` using formula 'y ~ x'

If our ultimate goal is to come up with a coefficient estimate for the effect of Days, we need to account for the fact that not all subjects show this effect to equal degrees. By partialing out subject-level variance in this effect, we can actually obtain a better estimate, which is less biased by particular subjects.

The syntax for specifying a random slope (and random intercept) looks like this:

Reaction ~ Days + (1 + Days | Subject)

As above, we're accounting for subject-level variance in Reaction, but now, we're also accounting for subject-level variance in the effect of Days on Reaction.

Building and interpreting a model

Now we can actually construct our first linear mixed effects model. Let's construct the full model specified above⁹:

model_full = lmer(data = sleepstudy,
                  Reaction ~ Days + (1 + Days | Subject),
                  REML = FALSE)

As with a normal lm model object, we can view the output using summary:

summary(model_full)

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: Reaction ~ Days + (1 + Days | Subject)
##    Data: sleepstudy
## 
##      AIC      BIC   logLik deviance df.resid 
##   1763.9   1783.1   -876.0   1751.9      174 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.9416 -0.4656  0.0289  0.4636  5.1793 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  Subject  (Intercept) 565.48   23.780       
##           Days         32.68    5.717   0.08
##  Residual             654.95   25.592       
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  251.405      6.632  37.907
## Days          10.467      1.502   6.968
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.138

This looks pretty similar to the summary of an lm object. We see an estimate for our intercept, and a coefficient estimate for Days.

One difference you might notice is that, by default, this summary doesn't tell us anything about the p-value associated with this coefficient. This is an intentional design decision by the authors of lme4, and has to do with the way lme4 calculates its t- and F-statistics¹⁰. If you really want to estimate p-values for each coefficient, you can import the package lmerTest, and rerun the model.

However, this isn't really standard practice where mixed models are concerned. Rather, the standard approach for mixed models is to use model comparison.

Model comparison: Background

The core idea behind the model comparison approach is to compare the explanatory power of two or more models, with the goal of identifying the model that best explains variance in \(Y\). Of course, a model with more parameters will always explain more variance than a model with less. That is, if \(N\) is the number of observations and \(k\) is the number of parameters, the model will necessarily improve its predictions as \(k\) approaches \(N\). This can result in overfitting: a model whose parameters are too tuned to the variance specific to a particular dataset, and thus exhibits poor generalizability across datasets. So reducing variance in \(Y\) isn't our only goal: we also want a parsimonious model.

Theoretically, you can compare any two models. Here, we'll be comparing nested models fit to the same data, in which only a single parameter differs, e.g., \(M_{k}\) and \(M_{k-1}\). This allows us to infer the explanatory power of the parameter added in the full model \(M_{k}\): if such a parameter substantially improves the model by some measure of model fit, we can infer that this parameter explains variance in \(Y\).

There are a number of different model comparison methods, each designed with the twin goals of:

Comparing the amount of variance explained by \(M_{k}\) and \(M_{k-1}\).
Enforcing model parsimony; all else being equal, we prefer a simpler model (e.g., \(M_{k-1}\)) over a model with more parameters (\(M_{k}\)).

We'll be using something called the likelihood ratio test¹¹. For more details on likelihood ratio tests, see my tutorial on model comparison strategies more generally.

Model comparison: Implementation

Earlier, we built a model called model_full, which looked like this: Reaction ~ Days + (1 + Days | Subject). Now our question is: does the fixed effect of Days explain a significant amount of variance in Reaction? Another way of putting this is: does Days improve the fit of a model?

To answer this question, we can compare model_full to a model omitting the fixed effect of Days. Fortunately, R wraps all the complicated likelihood ratio test calculations into a single, convenient function call: anova({model_1}, {model_2}).

First, let's build our reduced model, omitting the effect of Days. This model doesn't include any fixed effects; it's only modeling by-subject variability in Reaction (the random intercept for Subject), as well as by-subject variability in the effect of Days on Reaction. In other words, it only contains random effects.

model_reduced = lmer(data = sleepstudy,
                  Reaction ~ (1 + Days | Subject),
                  REML = FALSE)

The reduced model will attempt to capture all the variance in \(Y\) that's accounted for by these random effects. Now we can compare our full model to this reduced model using the anova command:

comparison = anova(model_full, model_reduced)
comparison

## Data: sleepstudy
## Models:
## model_reduced: Reaction ~ (1 + Days | Subject)
## model_full: Reaction ~ Days + (1 + Days | Subject)
##               npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## model_reduced    5 1785.5 1801.4 -887.74   1775.5                         
## model_full       6 1763.9 1783.1 -875.97   1751.9 23.537  1  1.226e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

chisq = round(comparison$Chisq[2], 2)
chisq

## [1] 23.54

df = comparison$`Df`[2]
df

## [1] 1

p = comparison$`Pr(>Chisq)`[2]
p

## [1] 1.22564e-06

There's a fair amount of output here. At the top we see the formulas for the two models printed out, just in case we forgot what we were comparing. We also see a bunch of labeled columns. Let's focus on a few of these columns:

logLik: this column refers to the log-likelihood values for each of our models, respectively. Note that model_reduced has a more negative value (indicating a lower likelihood) than model_full. Our likelihood ratio can be computed as follows: \(LR = -2 * (LL_{reduced} - LL_{full})\).
Chisq: this is the chi-squared statistic we'll be looking up to obtain our p-value. Note that this is identical to the log-likelihood ratio \(LR\) (see Wilks' Theorem).
DF: this is the difference in degrees of freedom between our models. model_full only adds one parameter, so as expected, \(DF = 1\).
Pr(>Chisq): this is the probability of obtaining a chi-squared statistic that high under the null hypothesis, i.e., our p-value.

The p-value indicates this difference in model likelihoods is quite significant, i.e., this Chisq statistic is very unlikely under the corresponding null distribution.

In model comparison terms, we might interpret the result as follows: adding a fixed effect of Days significantly increases the explanatory power of our model.

If you wanted to report this result in a paper, you'd write something like this:

A model including a fixed effect of Days, as well as by-subject random slopes for the effect of Days (and by-subject random intercepts), explained significantly more variance than a reduced model omitting only the fixed effect of Days \([\chi^{2}\)(1) = 23.54, \(p\) = 1.225610^{-6}\(]\).

Note that this model comparison doesn't directly tell us about the direction of the effect of Days. For that, we'd need to inspect the coefficient as usual:

coef_days = fixef(model_full)["Days"]
coef_days

##     Days 
## 10.46729

As in the OLS regression, the coefficient is positive: for every day of sleep deprivation, we expect participants' Reaction to slow by an average of 10.47 ms. The key difference is that this parameter estimate now accounts for nested variance in the data, i.e., individual differences in Reaction overall (random intercepts for each Subject), and individual differences in the effect of Days of sleep deprivation on Reaction (by-participant random slopes for Reaction ~ Days).

Interim summary

So far, this walkthrough has attempted to address the following questions:

Why use mixed models? (Answer: to address non-independence in your dataset.)
How do you run a mixed model? (Answer: see above.)
What's the standard practice for interpreting your effects in a mixed model? (Answer: likelihood ratio test¹².)

Below, I'll briefly address a few other "best practices" that I'm aware of in the domain of mixed models. I'll also describe in a bit more detail what it means to calculate a random effect.

Best practices

A few best practices and ways of dealing with common questions have emerged over the years. I'll try to address some of these common questions below. Bear in mind that there's not necessarily a single "recipe" to follow when building mixed models (or, indeed, any statistical model); any decision you make has trade-offs. That said, it's important to understand what these trade-offs are so that you make informed decisions.

Which effects are fixed, and which are random?

Sometimes it's very straightforward to determine which variables ought to be entered as random vs. fixed effects. For example, a variable like Subject ID pretty intuitively maps onto the notion of a random effect, as does something like Item or School District. But this distinction isn't always so clear, especially in observational analyses involving many variables you might want to model and control for.

Unfortunately, as Andrew Gelman has pointed out on his blog, there's not even a clear consensus on the distinction between random and fixed effects, which is why he prefers not to use those terms at all. Sometimes the distinction is defined in terms of "things you care about modeling directly" (fixed) vs. "effects related to the population you're interested in modeling" (random), sometimes it's defined in terms of whether a sample exhausts a population (fixed) vs. samples only negligibly from that population (random), and sometimes its defined in terms of whether the coefficients are calculated using maximum likelihood estimation (fixed) vs. shrinkage (random). As Gelman notes: not only are these definitions different, they're not even at at the same level of analysis.

Here are some rules of thumb I use when I encounter this question in my own research; these are by no means the "best" heuristics, so I welcome any comments or criticism about them.

Should I include random factor \(X\) in the model? If \(X\) demonstrates nested structure in my dataset, then I include it. It seems preferable to me to try to account for potential nested variance, and perhaps be overly cautious, than to miss something important--i.e., I try to keep it maximal. That said, you can also use model comparison approaches to determine your random effects structure, as Bates et al (2015) demonstrate. This can be helpful for avoiding overly/unnecessarily complex random factor specifications and the dreaded convergence errors.

Should I enter variable \(X\) as a random or fixed effect? This is a little trickier. For me, it comes down to my a priori assumptions about the relationship between this variable and \(Y\). If I think that \(X\) covaries in important, systematic ways with \(Y\), then I enter it as a fixed effect. If I think that each cluster of nested structure in \(X\) is essentially random with respect to \(Y\)--that is, each group in \(X\) varies, but not systematically, such that we can assume each group has been pulled from some overarching distribution--then I enter it as a random effect. This latter decision actually mirrors how random effects are computed via partial pooling (see the section on "Computing random effects" below).

How do I know how many random effects to include?

Should you include random effects for both Item and Subject ID? Further, should you model them as random intercepts, random slopes, or both?

Standard advice, as described in a famous paper by Barr et al (2013), is to "keep it maximal". Using simulation-based analyses, the authors demonstrate that the best way to minimize false positive rates and improve generalizability is to begin with a maximally specified model: include both slopes and intercepts for all random factors. This can sometimes result in pretty complicated models, such as the one below:

\(Y \sim X1 * X2 + (X1 * X2 | Subject) + (X1 * X2 | Item) + (X1 * X2 | Group)\)

This model specification is essentially saying: calculate by-subject, by-item, and by-group random slopes for the effect of \(X1\), \(X2\), and their interaction, as well as random intercepts for all those variables.

Sometimes this best practice isn't always possible. For example, a common issue is that your model weren't converge. Given a warning about convergence, there are a few approaches you can take:

Rescale and center continuous parameters. Differences in the scales of your parameters can lead to issues estimating standard deviations of your fixed effects.
Use a different optimizer. For example, I sometimes use the bobyqa optimizer, using the following syntax: lmerControl(optimizer="bobyqa").
Simplify your model.

If (1) and (2) both fail, then you can take a principled approach to (3). You can use the same model comparison procedure described earlier in the tutorial to identify random factors that actually explain substantial amounts of variance in your model; Bates et al (2015): Parsimonious mixed models argue that this should actually be standard practice in determining your model specification.

How are estimates for random factors computed?

Earlier, I mentioned that estimates for random factors are calculated using something called shrinkage (sometimes called partial pooling or regularization). What does this mean?

Note: The descriptions below draw heavily from the following sources:

Theoretical background

The short definition is that when we compute our estimates for each group (e.g., intercept and slope), we assume that they are pulled from a normal distribution around the overall mean for those values across all groups. Hence the term "shrinkage": rather than being allowed to vary freely, the estimates are constrained ("shrunk") in terms of the values they can take on¹³

We can compare this to two extreme cases, focusing specifically on random intercepts. Let's first consider the case in which we have no random factors--i.e., typical OLS regression. In this case, we're not accounting for any nested structure, and we just compute a single intercept value for the entire dataset. This intercept is "pooled" over all observations:

\(Y \sim \beta_{0} + \beta_{1}X + \epsilon\)

We can also consider the extreme case in which we compute a different intercept for each group (e.g., each subject), and this intercept is allowed to vary freely for each participant \(i\):

\(Y \sim \beta_{0i} + \beta_{1}X + \epsilon\)

In this case, any given \(\beta_{0i}\) will be equivalent that participant's mean \(Y\) (e.g., mean Reaction).

Finally, we can consider the case of shrinkage: we allow each person's intercept to vary, but we model these intercepts as normally distributed around the group mean.

Example

It's helpful to illustrate this procedure with an example. Let's return to our sleepstudy data from earlier, and consider how random intercepts for each Subject are computed in light of what we now know about shrinkage.

First, let's compute the mean Reaction for each subject. This is equivalent to allowing their intercepts to vary freely. We also might want to center this variable around the group mean:

group_mean = mean(sleepstudy$Reaction)

by_subject = sleepstudy %>%
  group_by(Subject) %>%
  summarise(mean_rt = mean(Reaction)) %>%
  mutate(mean_rt_centered = mean_rt - group_mean)

## `summarise()` ungrouping output (override with `.groups` argument)

by_subject %>%
  ggplot(aes(x = mean_rt_centered)) +
  geom_histogram(bins = 20) +
  labs(title = "Mean RT by subject (Centered)",
       x = "Mean RT (Centered)") +
  scale_x_continuous(limits = c(-100, 100)) +
  theme_minimal()

Now let's extract the random intercepts from a model only including our random intercepts:

model_null = lmer(data = sleepstudy,
                  Reaction ~ (1 | Subject),
                  REML = FALSE)
by_subject$random_intercepts = ranef(model_null)$Subject$`(Intercept)`

by_subject %>%
  ggplot(aes(x = random_intercepts)) +
  geom_histogram(bins = 20) +
  labs(title = "Random intercepts for each subject",
       x = "Random intercept") +
  scale_x_continuous(limits = c(-100, 100)) +
  theme_minimal()

Note that this plot looks very similar to the one showing each subject's mean RT. This is exactly what we'd expect. Recall that the intercepts are meant to account for subject-level variance in Reaction; each intercept thus roughly corresponds to that subject's mean Reaction.

However, the random intercepts are also subject to shrinkage, meaning they're assumed to be drawn from a normal distribution around the group mean. We can illustrate exactly what this looks like by plotting each subjects mean Reaction against that subject's random intercept; furthermore, we can shade each point in terms of the difference between the subject's random intercept and their mean RT:

by_subject$difference = abs(by_subject$random_intercepts - by_subject$mean_rt_centered)

# This shows:
# 1) perfectly linear (as you'd expect)
# 2) slight shrinkage (max/min real RT are slightly larger magnitude than random intercept estimates)
# 3) the differences between real RT and random intercept grow larger as you approach the ends of the distribution.
by_subject %>%
  ggplot(aes(x = mean_rt_centered,
             y = random_intercepts,
             color = difference)) +
  geom_point() +
  scale_x_continuous(limits = c(-100, 100)) +
  scale_y_continuous(limits = c(-100, 100)) +
  labs(title = "By-subject intercepts and by-subject mean RTs",
       x = "By-subject mean RT (centered)",
       y = "Random intercept") +
  theme_minimal()

This visualization should illustrate a few points:

First, the relationship is perfectly linear.
Second, there's evidence of slight shrinkage towards the mean. That is, for any given data-point, the values on the x-axis tend to be somewhat larger than those on the y-axis.
The point made in (2) is especially true around the margins of the distribution, as evidenced by the shading. That is, points that are extreme at either end (i.e., subjects that are either very fast or very slow on average) are the most subject to shrinkage.

For a more detailed description (and animated visualization!), I recommend checking out this blog post on shrinkage in mixed models.

Contact

If you have any questions, comments, or criticisms about anything in this tutorial, don't hesitate to contact me by email: the username is "sttrott"," at ucsd.edu. I'll likely augment this tutorial with more details (or pointers to other resources) over time.

References

Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of memory and language, 68(3), 255-278. Link

Bates, D., Mächler, M., Bolker, B., & Walker, S. (2014). Fitting linear mixed-effects models using lme4. arXiv preprint arXiv:1406.5823. Link

Bates, D., Kliegl, R., Vasishth, S., & Baayen, H. (2015). Parsimonious mixed models. arXiv preprint arXiv:1506.04967. Link

Clark (2019, May 14). Michael Clark: Shrinkage in Mixed Effects Models. Retrieved from https://m-clark.github.io/posts/2019-05-14-shrinkage-in-mixed-models/ Link

Freeman, M. Hierarchical Models. Link

Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.

Piccininni, P. Linear mixed effects models. Link

Winter, B. (2013). Linear models and linear mixed effects models in R with linguistic applications. arXiv preprint arXiv:1308.5499. Link

Citation

For citation, please cite as:

Trott, S. (2019). Mixed models in R. Retrieved from https://seantrott.github.io/mixed_models_R/.

Footnotes

Note that glmer is the command to fit a generalized linear mixed model, which allows you to fit gaussian models (though gaussian family models are now deprecated in favor of just using lmer) and also binomial, poisson, and more.↩
This is not the probability that the null hypothesis is true, nor is it the probability that your effect will replicate.↩
One useful way to think about a statistical model is as a description of a phenomenon. There are many ways to describe the same phenomenon, and these distinct descriptions (or explanations) involve different factors.↩
Of course, one can imagine circumstances under which we'd be worried that this assumption is violated; e.g., maybe the participants assigned to take the drug happened to come from one part of the country, and the participants assigned to take the placebo happened to come from another. If your assignment procedure is truly randomized, this shouldn't be the case, but nonetheless: it's always important to watch out for signs of non-independence.↩
In case you're wondering, the same problem arises in observational analyses, even when you only have one data point per person. The nesting just occurs at a different level. For example, let's say you're comparing test scores on the SAT across a bunch of individuals. There's likely implicit nested structure in this dataset, such as the School District, School, or even City. You'll need to account for this nested structure in your model specifications.↩
Another historically popular solution has been to used repeated-measures ANOVAs. One advantage of mixed models is that they allow crossed random effects, e.g., modeling random effects for subjects, items, group, and more in the same model. Repeated-measures ANOVAs aren't so flexible, requiring you to build separate models partitioning out subject-level or item-level variance, then making inferences across those models.↩
You might wonder: why not just include subject-level variance as a fixed effect? There's always some debate about which variables to include as fixed vs. random effects in your model, and the answer will depend multiple things. For example, what sorts of things are you actually interested in modeling theoretically? And what sorts of things do you have reason to believe vary systematically with \(Y\) as opposed to randomly? I'll go into more detail on this later in the tutorial, but the quick answer is that adding something as a random effect means that its values (e.g., its intercept) are subject to partial pooling, which is essentially a form of regularization. This places useful constraints on the range of values that these intercepts and slopes can take.↩
Note that this plot will also draw different intercepts for each subject↩
I'm adding a parameter REML = FALSE, which specifies whether to use the restricted maximum likelihood estimator instead of the log-likelihood estimator. We want the latter in this case, to facilitate model comparison. Note that if you do run lmer with REML = TRUE and use this model in a model comparison, R will actually just rebuild the model with REML = FALSE anyway.↩
Douglas Bates describes it much better than I can here, but the basic idea is that p-values for model parameters are normally calculated by looking up the t-statistics for a parameter in a t-distribution with the appropriate degrees of freedom; this isn't really sensible for mixed models, in which the t-/F-statistics are calculated using a fixed denominator based on the overall model specifications.↩
Other well-known approaches include AIC and BIC, both of which also involve calculating the likelihood of a model.↩
Again, there are alternatives, such as AIC or BIC. R's anova function conveniently gives us these values as well.↩
This is why "shrinkage" is also sometimes called "regularization". Regularization is an approach to reducing the problem of overfitting. Well-known regularization algorithms for regression are ridge and LASSO, which place constraints on the values your coefficients can adopt, and penalize overly large coefficients (or having too many large coefficients).↩