Sean Trott

The goal of this tutorial is to give both a conceptual and technical introduction to correlations and linear regression.

Part 1: Correlation

A correlation is a measure between $[-1.0, 1.0]$ of the linear relationship between two quantitative variables $X$ and $Y$. Examples include:

The relationship between a parent’s height and a child’s height.
The relationship between country GDP and average years of education.
The relationship between average temperature and C02 concentration.
Number of hours slept per night and GPA.
Stress level and reaction time.

In this tutorial, we’ll be focusing on Pearson’s r as a measure of correlation.

Conceptual background: why compute correlations?

Put simply, a correlation allows you to infer: when $X$ goes up, does $Y$ tend to go up or down? A positive correlation ($r > 0$) means that when $X$ goes up, $Y$ also tends to go up; a negative correlation ($r < 0$) means that when $X$ goes up, $Y$ tends to go down.

Furthermore, the magnitude of $r$ tells us about the degree of relationship. A larger value (closer to $1$ or $-1$) generally indicates that $Y$ tends to change in linear, systematic ways with respect to $Y$—i.e., that any variance is $X$ has corresponding, systematic (either positive or negative) variance in $Y$.

An example of a perfectly positive correlation can be illustrated by plotting two identical datasets, i.e., $X = Y$:

x = c(1:100)
y = c(1:100)

df = data.frame(X = x,
                Y = y)

df %>%
  ggplot(aes(x = X,
             y = Y)) +
  geom_point() +
  labs("Perfect positive correlation") +
  theme_minimal()

Sure enough, $r = 1$ for this dataset:

cor(df$X, df$Y)

## [1] 1

Importantly, this is true even if we apply linear transformations to $Y$, such as multiplying everything by 2 or dividing everything by 2. All of the correlations visualized below have an $r = 1$!

df$multiply = df$Y * 2
df$divide = df$Y / 2

df_gathered = df %>%
  gather(y_type, value, -X)

df_gathered %>%
  ggplot(aes(x = X,
             y = value,
             color = y_type)) +
  geom_point() +
  theme_minimal()

cor(df$X, df$multiply)

## [1] 1

cor(df$X, df$divide)

## [1] 1

We can also create a perfectly negative correlation by multiplying everything in $X$ by $-1$:

df$Y_neg = df$Y * -1

df %>%
  ggplot(aes(x = X,
             y = Y_neg)) +
  geom_point() +
  labs("Perfect negative correlation") +
  theme_minimal()

cor(df$X, df$Y_neg)

## [1] -1

Obligatory caveat: Correlation != Causation

You’ve likely heard this before, but it never hurts to repeat: finding evidence of a linear correlation between two variables does not imply that one causes the other.

Causal inference and causality more broadly is a heady topic beyond the scope of this tutorial. But it’s important to remember that just because $X$ and $Y$ systematically covary, does not mean $X$ causes $Y$. The relationship could be driven by a series of mediating variables, or may even be entirely spurious.

How do you compute correlation?

The formula for computing Pearson’s r is as follows:

$r(X, Y) = \frac{SP_\text{XY}}{\sqrt{SS_X*SS_Y}}$

Where $SP_\text{XY}$ is the sum of products of $X$ and $Y$, and $SS$ refers to the sum of squares for $X$ and $Y$, respectively. Let’s walk through computing each of these terms, using the following observations as our starting point:

a = c(1, 3, 3, 5, 7, 8, 10, 10)
b = c(2, 2, 4, 6, 9, 10, 11, 9)

plot(a, b)

Sum of squares

The sum of squares is a way of quantifying the amount of squared deviation from your mean.

To calculate it (e.g., for $SS_X$), follow these steps:

Calculate the mean of $X$ ($\bar{X}$).
For each data point in $X$, subtract $\bar{X}$.
Square each of these scores.
Sum the squared difference scores.

Here’s the series of calculations in R:

## Get our difference scores
a_diff = a - mean(a)
b_diff = b - mean(b)

## Square the scores
a_sq = a_diff ** 2
b_sq = b_diff ** 2

## Sum them
ss_a = sum(a_sq)
ss_b = sum(b_sq)

Sum of products

Here’s the step-by-step approach. You should see some parallels to calculating the sum of squares!

Calculate the mean of $X$ ($\bar{X}$) and the mean of $Y$ ($\bar{Y}$).
For each data point in $X$, subtract $\bar{X}$; for each data point in $Y$, subtract $\bar{Y}$. 3. Now, for each of these difference scores $(x_i, y_i)$, take the product.
Sum these products together.

And here’s the series of calculations in R:

## Get our difference scores
a_diff = a - mean(a)
b_diff = b - mean(b)

## Take the product
ab_product = a_diff * b_diff

## Sum the products
sp = sum(ab_product)

Putting it altogether

Now, we divide $SP_\text{XY}$ by $\sqrt{SS_X*SS_Y}$:

r = sp/(sqrt(ss_a*ss_b))
r

## [1] 0.9469289

We can double-check against this against R’s built-in correlation function:

R_r = cor(a, b)
R_r

## [1] 0.9469289

Building an intuition

Personally, I find it helpful to think about why one would use this formula. Why this series of steps and not another?

First let’s think about what the sum of products is calculating. Essentially, we’re asking: does each pair $(x_i, y_i)$ vary in similar ways with respect to the means of $X$ and $Y$? Notably, the sum of products will only be positive if most of the deviations have the same sign—i.e., if $X$ goes up when $Y$ goes up. Moreover, the sum of products will have a larger absolute value as more of the values change together. If $X$ and $Y$ are exactly the same, then each pair $(x_i, y_i)$ will vary in identical ways from their respective means.

But now we need to normalize this value against something else. It’s hard to interpret the sum of products value without knowing how much $X$ and $Y$ vary in general. This is exactly what we can quantify with the sum of squares.

$SS_X$ and $SS_Y$ are measures of how much observations in $X$ and $Y$ vary from their respective means. If all values in $X$ are the same—i.e., there is no variability—then $SS_X$ will be 0. The more variable these observations are, the larger sum of squares will be. Multipling $SS_X$ and $SS_Y$ gives us a measure of the combined variability in both measures; we take the square root of that to put it on the same playing field as ${SP_\text{XY}}$.

This may also be clear by considering, once again, the case where $X = Y$. This will mean that $SS_X = SS_Y = SP_\text{XY}$, which in turn means that $r = 1$.

Limitations of correlation

Measurements of linear correlation have a number of limitations. As noted above, linear correlation doesn’t tell us anything about causality.

But even disregarding theoretical inferences these measures afford, there are other limitations:

Linear correlations will not capture trends in non-linear data.
$r$ can be easily distorted by outliers.
Restricted ranges (e.g., accidentally sampling from a biased portion of the underlying population) will strongly bias $r$.
$r$ doesn’t actually give a measure of the central tendency in bivariate data—i.e., the slope of the line.

The problem in (4) is highlighted in the figure from earlier, showing how the same $r$ value can be achieved from very different relationships. Here’s the figure again, just to reinforce the idea:

df_gathered %>%
  ggplot(aes(x = X,
             y = value,
             color = y_type)) +
  geom_point() +
  theme_minimal()

These three relationships all seem pretty different, right? But for all of them, our value for $r$ is 1. Thus, if all we were looking at was our $r$ value, we’d never know how different they were.

This is exactly where linear regression comes in ¹.

Part 2: Linear regression

Linear regression is a way to capture the central tendency in the relationship between two or more variables. Specifically, linear regression computes the slope of the line that best fits some set of variables. For the purposes of this tutorial, we’ll be limiting our discussion to linear regression with only one predictor variable.

This addresses one of the key limitations mentioned above—namely, that $r$ doesn’t really tell us anything about the actual predictived value of $Y$ for a certain value of $X$. It just tells us that in general, $Y$ increases or decreases when $X$ increases, and how systematic that trend is.

In contrast, linear regression aims to quantify the predicted value for $Y$, which we’ll refer to here as $Y'$, for values of $X$. This is achieved by identifying a set of parameters which, when used in a linear function, give us $Y'$ for some value of $X$. The function looks like this²:

$Y' = bX + a$.

Here, $b$ refers to the slope of the line, and $a$ refers to the intercept (the predicted value for $Y$ when $X = 0$). These terms will hopefully become clear in a moment, but first let’s discuss the assumptions of linear regression.

Assumption 1: Linearity

First, the relationship in our data is assumed to be linear. Of course, you can fit a line to any set of data, but if the relationship is clearly curvilinear then a line won’t characterize that relationship very well.

Here’s an example of a clearly linear relationship:

df %>%
  ggplot(aes(x = X,
             y = Y)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Y = X",
       y = "Y") +
  theme_minimal()

The problem with fitting a line to nonlinear data is illustrated below by plotting a regression line over a relationship that’s fundamentally nonlinear:

df$Y_quad = df$X ** 2

df %>%
  ggplot(aes(x = X,
             y = Y_quad)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Y = X ^ 2",
       y = "X ^ 2") +
  theme_minimal()

The problem is even worse for an exponential relationship:

df$Y_exp = 1.1 ** df$X

df %>%
  ggplot(aes(x = X,
             y = Y_exp)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Y = 1.1 ^ X",
       y = "1.1 ^ X") +
  theme_minimal()

So just as with linear correlations, it’s important to plot your data to make sure you’re not dealing with nonlinear relationships!

Assumption 2: Homoscedasticity

Second, data points are assumed to be distributed evenly across the regression line. This property is called homoscedasticity³.

The opposite of this is called heteroscedasticity, meaning that our values for $Y$ aren’t distributed evenly across the regression line. A classic example of this is the relationship between income and weekly expenditures. Overall, this relationship is usually positive, and will thus yield a positive slope (and positive $r$). But while lower-income individuals might be relatively consistent in their spending, there’s sometimes much more variability among higher-income individuals—i.e., some individuals spend a huge amount of money each week, and others spend a much more modest amount. This means that our values for $Y$ don’t have equal variance across the regression line.

Heteroscedasticity can be identified by a signature cone-shape in a regression plot:

df$Y_heteroscedasticity = df$X * rnorm(mean = 1, sd = .2, n = 100)

df %>%
  ggplot(aes(x = X,
             y = Y_heteroscedasticity)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Y = 1.1 ^ X",
       y = "1.1 ^ X") +
  theme_minimal()

Heteroscedasticity doesn’t necessarily bias our coefficient estimates for ordinary least squares regression, but it can bias our estimates of the variance in a given relationship, meaning that the standard errors for our coefficients are suspect.

Assumption 3: Independence

A very important assumption of ordinary least squares regression (the method we’ll be discussing today) is that each data point is independent: i.e., generated by an independent causal process. I won’t go into that here, but see my tutorial on mixed models in R for more background on independence.

Finding the line of best fit

Intuitively, many of us could probably draw a pretty good line through a set of data points. But what does it mean to find the best-fitting line?

The basic idea is that we want to find values for our parameters $b$ and $a$ that minimize the difference between our predicted values $Y'$ and our actual values $Y$.

The analytical solution is as follows:

$b = \frac{SP_\text{XY}}{SS_X}$
$a = \bar{Y} - b\bar{X}$

Alternatively, $b$ can also be found via:

$b = \sqrt{\frac{SS_Y}{SS_X}} * r$

Walkthrough

Consider the following $X$ and $Y$ variables:

x = c(2, 4, 9, 10, 11, 14, 14, 15, 16, 19, 22)
y = c(5, 6, 10, 14, 15, 20, 22, 22, 23, 27, 33)

df_new = data.frame(x = x,
                    y = y)

df_new %>%
  ggplot(aes(x = x,
             y = y)) +
  geom_point() +
  theme_minimal()

By eye, we could probably draw a pretty straight line between $X$ and $Y$. But what would our parameter values for the line of best fit be?

First, we need to find $b$. Recall that $b = \frac{SP_\text{XY}}{SS_X}$.

sp_xy = sum((x - mean(x)) * (y - mean(y)))
ss_x = sum((x - mean(x))**2)

b_value = sp_xy / ss_x
b_value

## [1] 1.44574

This means: as $X$ increases by 1, $Y$ increases by 1.45.

Now we c an find $a$, which is {Y} - b{X}$:

a_value = mean(y) - b_value * mean(x)
a_value

## [1] 0.03448276

This means: when $X = 0$, $Y$ is 0.03.

Just to check, let’s validate against the results from R:

lm(y ~ x)

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##     0.03448      1.44574

Building our intuition

Making predictions

We can use our fit parameters to generate predictions. This allows us to:

Evaluate how well our model captures the actual data in our dataset.
Make predictions about data we’ve never seen before.

To generate predictions, simply plug in a value of $X$ into the equation with the fit parameters. Let’s first try this with our original set of $X$ points:

df_new$y_pred = b_value * x + a_value

We can plot our original $Y$ values against the predicted values, and we see that it’s pretty good!

df_new %>%
  ggplot(aes(x = y_pred,
             y = y)) +
  geom_point() +
  theme_minimal()

We can also visualize the differences (i.e., the error) between our predictions and the actual values (the dotted line is at 0, reflecting 0 prediction error). Again, we see that we do pretty well. Most of our predictions are within a couple units of our actual values:

df_new$diff = df_new$y_pred - df_new$y

df_new %>%
  ggplot(aes(x = diff)) +
  geom_histogram() +
  geom_vline(xintercept = 0, linetype = "dotted") +
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Finally, we can also make predictions about new data points. When these data points fall within the original range of $X$, e.g., 2, 22, this is called interpolation. When they fall outside range, it’s called extrapolation. In both cases, you can just think of this as continuing the line that we fit—either to fill in missing values, or to extend it past the range of our initial observations. Note, however, that extrapolation can be dangerous: it’s not necessarily a guarantee that the same linear relationship will hold for larger or smaller values of $X$.

Correlations and regressions in R!

The above steps show you how to compute correlations and linear regressions essentially “by hand” (With the help of an R calculator). But R has many of these functions built in. For example, to compute a linear regression, simply use the lm function:

model = lm(data = df_new,
           y ~ x)
summary(model)

## 
## Call:
## lm(formula = y ~ x, data = df_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0461 -0.4977 -0.1663  0.7193  2.0740 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.03448    1.05835   0.033    0.975    
## x            1.44574    0.07772  18.603 1.72e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.472 on 9 degrees of freedom
## Multiple R-squared:  0.9747, Adjusted R-squared:  0.9718 
## F-statistic: 346.1 on 1 and 9 DF,  p-value: 1.718e-08

This model can then be used to generate predictions about novel data using the predict function:

df_newdata = data.frame(x = c(50:60))

df_newdata$y_pred = predict(model, newdata = df_newdata)

df_newdata %>%
  ggplot(aes(x = x,
             y = y_pred)) +
  geom_point() +
  theme_minimal()

Similarly, the cor function can be used to find correlations between variables:

cor(df_new$x, df_new$y)

## [1] 0.9872449

Conclusion

This tutorial only scratches the surface of correlations and linear regression. Linear regression, in particular, is in my opinion a great starting point for learning more about statistics. It involves building a model of your data with certain assumptions (e.g., linearity), which allows you to fit parameters and make novel predictions; it also gives you a grounds for thinking about model fit, i.e., how well your model describes your data, by comparing your predictions to the actual $Y$ values.

Footnotes

Note that linear regression still won’t solve the problem of nonlinear data. Anscombe’s Quartet highlights the limitations of assuming linearity when analyzing bivariate data.↩
You might also see it written like this, depending on the context: $Y' = \beta_1X + \beta_0$.↩
I sometimes wonder whether the length and difficulty of remembering this word contributes to the difficulty in understanding the concept.↩

Correlation and linear regression