lm.
lm model outputs: coefficients, p-values, and \(R^2\).What are “models”, and why should we build them?
A statistical model is a mathematical model representing a data-generating process.
Statistical models help us understand our data, and also predict new data.
Note
Both serve a useful function, and both are compatible! In general, models can help you think more clearly about your data.
A statistical model often represents a function mapping from \(X\) (inputs) to \(Y\) (outputs).
\(Y = \beta X + \epsilon\)
💭 Check-in
Think of a research question from your domain. What would the \(X\) and \(Y\) be, if it was “translated” into a statistical model?
No model is perfect; all models have some amount of prediction error, typically called residuals or error.

In general, there is often a trade-off between the flexibility of a model and the interpretability of that model.
Linear equation, basic premise, key assumptions.
The goal of linear regression is to find the line of best fit between some variable(s) \(X\) and the continuous dependent variable \(Y\).
Given some bivariate data, there are many possible lines we could draw. Each line is defined by the linear equation:
\(Y = \beta_1 X_1 + \beta_0\)
To illustrate this, let’s simulate some data:

Now let’s plot different lines with the same slope but different intercepts.
# Create the plot
ggplot(df, aes(x = x, y = y)) +
# Add several "possible" lines
geom_abline(intercept = 2, slope = 0.5, color = "gray60", linetype = "dashed", linewidth = 1) +
geom_abline(intercept = 3, slope = 0.5, color = "gray60", linetype = "dashed", linewidth = 1) +
geom_abline(intercept = 4, slope = 0.5, color = "gray60", linetype = "dashed", linewidth = 1) +
# Add data points
geom_point(color = "steelblue", size = 3, alpha = 0.7) +
theme_minimal() 
We can also try the same intercept but different slopes.
# Create the plot
ggplot(df, aes(x = x, y = y)) +
# Add several "possible" lines
geom_abline(intercept = 3, slope = 0.75, color = "gray60", linetype = "dashed", linewidth = 1) +
geom_abline(intercept = 3, slope = 0.5, color = "gray60", linetype = "dashed", linewidth = 1) +
geom_abline(intercept = 3, slope = 0.25, color = "gray60", linetype = "dashed", linewidth = 1) +
# Add data points
geom_point(color = "steelblue", size = 3, alpha = 0.7) +
theme_minimal() 
The line of best fit minimizes the residual error, i.e., the difference between the predictions (the line) and the actual values.
\(RSS = \sum_{i=1}^{N} (\hat{y_i} - y_i^2)\)
Tip
Intuition: A “better” line is one that has smaller differences between the predicted and actual values.
The mean-squared error (MSE) is the average squared error (as opposed to the sum).
We can compare the MSE for two different lines for the same data.
The standard error of the estimate is a measure of the expected prediction error, i.e., how much your predictions are “wrong” on average.
\(S_{Y|X} = \sqrt{\frac{RSS}{n-2}}\)
We can calculate standard error of the estimate:
The \(R^2\), or coefficient of determination, measures the proportion of variance in \(Y\) explained by the model.
\(R^2 = 1 - \frac{RSS}{SS_Y}\)
Where \(SS_Y\) is the sum of squared error in \(Y\).
💭 Check-in
What does this formula mean and why does it measure the proportion of variance in \(Y\) explained by the model?
\(R^2 = 1 - \frac{RSS}{SS_Y}\)
Ordinary least squares (OLS) regression has a few key assumptions.
| Assumption | What it means | Why it matters |
|---|---|---|
| Linearity | The relationship between \(X\) and \(Y\) is linear | OLS fits a straight line, so if the true relationship is curved, predictions will be systematically biased |
| Independence | The observations are independent of each other | Dependent observations (e.g., repeated measures, time series) violate the assumption that errors are uncorrelated, leading to underestimated standard errors and invalid p-values |
| Homoscedasticity | The variance of residuals is constant across all levels of \(X\) (equal spread) | If variance changes with \(X\) (heteroscedasticity), standard errors will be incorrect: some coefficients appear more/less significant than they truly are |
| Normality of residuals | The errors are approximately normally distributed | Needed for valid confidence intervals and hypothesis tests (p-values). Less critical with large samples due to the Central Limit Theorem. |
Using and interpreting fit lm models, using broom.
lm functionA linear model can be fit using the
lmfunction.
y ~ x).lm(data = df_name, y ~ x).
y and x are columns in df_name.To illustrate linear regression in R, we’ll work with a sample dataset.

As we discussed before, geom_smooth(method = "lm") can be used to plot a regression line over your data.

Tip
But to actually fit a model, we need to use lm.
lm modelCalling summary on a fit lm model object returns information about the coefficients and the overall model fit.
Call:
lm(formula = Income ~ Education, data = df_income)
Residuals:
Min 1Q Median 3Q Max
-19.568 -8.012 1.474 5.754 23.701
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41.9166 9.7689 -4.291 0.000192 ***
Education 6.3872 0.5812 10.990 1.15e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.93 on 28 degrees of freedom
Multiple R-squared: 0.8118, Adjusted R-squared: 0.8051
F-statistic: 120.8 on 1 and 28 DF, p-value: 1.151e-11
summary outputCalling summary returns information about the coefficients of our model, as well as indicators of model fit.
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41.916612 9.7689490 -4.290801 1.918257e-04
Education 6.387161 0.5811716 10.990148 1.150567e-11
[1] 0.8118069
y explained by x.There are a few relevant things to note about coefficients:
💭 Check-in
How would you report and interpret the intercept and slope we obtained for Income ~ Education? (As a reminder, \(\beta_0 = -41.9\) and \(\beta_1 = 6.4\).)
broom packageThe broom package is also an easy way to quickly (and tidily) extract coefficient estimates.
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -41.9 9.77 -4.29 1.92e- 4
2 Education 6.39 0.581 11.0 1.15e-11
Once coefficients are in a dataframe, we can plot them using ggplot: a great way to visualize model fits!
glancebroom::glance() provides a tidy summary of overall model statistics.
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.812 0.805 11.9 121. 1.15e-11 1 -116. 238. 242.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
r.squared: proportion of variance explainedadj.r.squared: adjusted for number of predictorssigma: residual standard errorp.value: p-value for the F-statisticlm with %>%If you like the %>% syntax, you can integrate lm into a series of pipe operations.
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -41.9 9.77 -4.29 1.92e- 4
2 Education 6.39 0.581 11.0 1.15e-11
Once we’ve fit a model, we can use it to make predictions for new data!
We can also assess the residuals.

A categorical (or qualitative) variable takes on one of several discrete values.
# A tibble: 3 × 2
Condition RT
<chr> <dbl>
1 Congruent 12.1
2 Congruent 16.8
3 Congruent 9.56
💭 Check-in
How might you model and interpret the effect of a categorical variable?
A common approach is to use the mean of one level (e.g., Congruent) as the intercept; the slope then represents the difference in means across those levels.
lm: fit the model.summary/broom::tidy/broom::glance: interpret the model coefficients and \(R^2\).predict: get predictions from the model.CSS 211 | UC San Diego