The goal of this hands-on exercise is to familiarize you with building and interpreting multiple regression models in R, and also debugging common issues in regression models.
We’ll be working with a dataset of housing prices in California. Our goal is to build a model predicting median_house_value in a given district/county.
Load dataset
To get started, let’s load the dataset.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 20640 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): ocean_proximity
dbl (9): longitude, latitude, housing_median_age, total_rooms, total_bedroom...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Exercise 1: Understand your data
To get started, let’s explore our data.
Look at each of the variables and create histograms of them (or bar plots, etc.). What do they look like? What are they?
Which variables correlate with each other? Use gcorrplot?
Exercise 2: Build univariate models
Now, build a series of univariate models to predict median_house_value.
Try building models for each of the following predictors: housing_median_age, total_bedrooms, population, median_income, and ocean_proximity.
For each model, interpret the coefficients and also the \(R^2\).
Optionally, make a barplot comparing the \(R^2\) of these univariate models.
Exercise 3: Build a big multivariate model!
Now let’s construct a multivariate model. Combine all the predictors from Exercise 2 into a single multivariate model.
How does this change your interpretation of the coefficients?
How does this change the overall model fit?
Using vif (from the car package), determine whether there’s multicollinearity in your model.
Exercise 4: A more principled approach
Now let’s use a technique called forward stepwise regression to build a locally optimal set of predictors.
Start with the best model from Exercise 2.
Then, build a series of 2-variable models (i.e., with each other predictor), and choose the best of those.
Do this until you’ve added all the variables from Exercise 2 in order of how much they help the model.
What’s the \(R^2\) of each of these progressively more complicated models?
Exercise 5: Plot a map of California!
Our data also includes latitude and longitude information. Use that (e.g., in a scatterplot) to make a visualization of how housing prices change across California.