Sean Trott

The goal of this tutorial is to give both a conceptual and technical introduction to correlations and linear regression.

Part 1: Correlation

A correlation is a measure between \([-1.0, 1.0]\) of the linear relationship between two quantitative variables \(X\) and \(Y\). Examples include:

  • The relationship between a parent’s height and a child’s height.
  • The relationship between country GDP and average years of education.
  • The relationship between average temperature and C02 concentration.
  • Number of hours slept per night and GPA.
  • Stress level and reaction time.

In this tutorial, we’ll be focusing on Pearson’s r as a measure of correlation.

Conceptual background: why compute correlations?

Put simply, a correlation allows you to infer: when \(X\) goes up, does \(Y\) tend to go up or down? A positive correlation (\(r > 0\)) means that when \(X\) goes up, \(Y\) also tends to go up; a negative correlation (\(r < 0\)) means that when \(X\) goes up, \(Y\) tends to go down.

Furthermore, the magnitude of \(r\) tells us about the degree of relationship. A larger value (closer to \(1\) or \(-1\)) generally indicates that \(Y\) tends to change in linear, systematic ways with respect to \(Y\)—i.e., that any variance is \(X\) has corresponding, systematic (either positive or negative) variance in \(Y\).

An example of a perfectly positive correlation can be illustrated by plotting two identical datasets, i.e., \(X = Y\):

x = c(1:100)
y = c(1:100)

df = data.frame(X = x,
                Y = y)

df %>%
  ggplot(aes(x = X,
             y = Y)) +
  geom_point() +
  labs("Perfect positive correlation") +