Analysis of experimental data usually involves asking whether there is a larger difference between experimental groups than you’d expect from random sampling error alone.
For example, you might want to know whether the metaphor you use to describe happiness affects people’s Happiness Score. You use two metaphors: PURSUIT (“in search of happiness”), and SUBSTANCE (e.g., “full of joy”).
With only two metaphors, you can use a t-test to determine whether there’s a difference across groups. But sometimes, you want to compare more than two groups. For example, maybe you add a third metaphor: FLIGHT (“flying high”). In each case, you could just run a series of t-tests, but then you’ll run into a multiple comparisons problem. Thus, you need a way to ask: are there more differences across each group than within each group?
This is exactly what an ANOVA (“Analysis of Variance”) aims to do. As with a t-test, an ANOVA tests a null hypothesis: namely, that the means of each group are equal. We can express this as follows:
\(H_0: \mu_1 = \mu_2 = \mu_3 = ... = \mu_k\)
Where \(k\) is the number of groups.
\(H_0\) is put to the test by comparing the variance between groups to the variance within groups. This comparison is called the F-value (or F-statistic):
\(F = \frac{\text{Between-group variance}}{\text{Within-group variance}}\)
If there is a treatment effect of some variable \(X\) on our dependent variable \(Y\), that should show up as variance between groups. Otherwise, \(H_0\) predicts that the variance across groups should be the same as the variance within groups.
One way to think about a null effect here is that in the absence of a treatment effect, the \(\text{Between-group variance}\) should be equivalent to random error, i.e., the same as our \(\text{Within-group variance}\).
Imagine if we sampled people from the population and randomly assigned them to three groups. Each group was assigned a different color hat: red, green, and blue. Now imagine that we measure each individual’s blood pressure after receiving the height.
My intuition is that the color of the hat each person was given should have no relationship to their blood pressure. Accordingly, our different “groups” shouldn’t show much variability at all. Any variability they do show should just be a function of random error—just the same as the variability within each group. That is, the means of each group \({\mu_{green}, \mu_{red}, \mu_{blue}}\) should be approximately the same.
Recall that the denominator of our \(F\) equation, \(\text{Within-group variance}\), is meant to reflect the amount of variability we expect by chance. Thus, we can also think of this as being equal to \(\text{Random error}\).
Now, if our groups aren’t meaningfully different, then our \(\text{Between-group variance}\) should also be equal to \(\text{Random error}\). But if they are, then it should be equal to \(1 + \text{Random error}\).
Thus, we can rewrite our hypotheses as follows:
\(H_0: F = \frac{\text{Random error}}{\text{Random error}}\)
\(H_1: F = \frac{\text{Treatment effect + Random error}}{\text{Random error}}\)
Thus, logically, \(H_0\) predicts a value for \(F\) that’s close to \(1\). And \(H_1\) predicts a value for \(F\) that’s larger than \(1\).
At its core, then, a one-factor ANOVA asks whether we should reject the null hypothesis that there is no difference between our groups.
Let’s return to the example at the beginning. We assign participants to 1 of three metaphorical conditions: PURSUIT, FLIGHT, and SUBSTANCE. Each participant fills out a survey indicating their Happiness Score (out of 100). There are four participants in each condition.
Our data is as follows:
df_metaphor = data.frame(
pursuit = c(95, 90, 97, 95),
flight = c(85, 89, 92, 89),
substance = c(75, 77, 79, 80)
) %>%
gather(key = "metaphor", value = "happiness")
This yields the following group means:
df_sum = df_metaphor %>%
group_by(metaphor) %>%
summarise(mean_happiness = mean(happiness),
total_happiness = sum(happiness))
df_sum
## # A tibble: 3 x 3
## metaphor mean_happiness total_happiness
## <chr> <dbl> <dbl>
## 1 flight 88.8 355
## 2 pursuit 94.2 377
## 3 substance 77.8 311
Our group means and totals certainly look different. We can plot them out to visualize these differences:
df_metaphor %>%
ggplot(aes(x = metaphor,
y = happiness)) +
geom_boxplot() +
scale_y_continuous(limits = c(0, 100)) +
theme_minimal()