*Sean Trott*

The goal of this tutorial is to give both a **conceptual** and **technical** introduction to correlations and linear regression.

# Part 1: Correlation

A **correlation** is a measure between \([-1.0, 1.0]\) of the linear relationship between two quantitative variables \(X\) and \(Y\). Examples include:

- The relationship between a parent’s height and a child’s height.
- The relationship between country GDP and average years of education.
- The relationship between average temperature and C02 concentration.
- Number of hours slept per night and GPA.

- Stress level and reaction time.

In this tutorial, we’ll be focusing on **Pearson’s r** as a measure of correlation.

## Conceptual background: why compute correlations?

Put simply, a correlation allows you to infer: when \(X\) goes up, does \(Y\) tend to go up or down? A **positive** correlation (\(r > 0\)) means that when \(X\) goes up, \(Y\) also tends to go up; a **negative** correlation (\(r < 0\)) means that when \(X\) goes up, \(Y\) tends to go down.

Furthermore, the **magnitude** of \(r\) tells us about the degree of relationship. A larger value (closer to \(1\) or \(-1\)) generally indicates that \(Y\) tends to change in linear, systematic ways with respect to \(Y\)—i.e., that any variance is \(X\) has corresponding, systematic (either positive or negative) variance in \(Y\).

An example of a perfectly positive correlation can be illustrated by plotting two identical datasets, i.e., \(X = Y\):

```
x = c(1:100)
y = c(1:100)
df = data.frame(X = x,
Y = y)
df %>%
ggplot(aes(x = X,
y = Y)) +
geom_point() +
labs("Perfect positive correlation") +
theme_minimal()
```