Week 5: Multiple regression

Introduction

The goal of this hands-on exercise is to familiarize you with building and interpreting multiple regression models in R, and also debugging common issues in regression models.

We’ll be working with a dataset of housing prices in California. Our goal is to build a model predicting median_house_value in a given district/county.

Load dataset

To get started, let’s load the dataset.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggcorrplot)

### Housing dataset
df_housing <- read_csv("https://raw.githubusercontent.com/seantrott/ucsd_css211_datasets/main/main/regression/housing.csv")
Rows: 20640 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): ocean_proximity
dbl (9): longitude, latitude, housing_median_age, total_rooms, total_bedroom...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Exercise 1: Understand your data

To get started, let’s explore our data.

  • Look at each of the variables and create histograms of them (or bar plots, etc.). What do they look like? What are they?
  • Which variables correlate with each other? Use gcorrplot?

Exercise 2: Build univariate models

Now, build a series of univariate models to predict median_house_value.

  • Try building models for each of the following predictors: housing_median_age, total_bedrooms, population, median_income, and ocean_proximity.
  • For each model, interpret the coefficients and also the \(R^2\).
  • Optionally, make a barplot comparing the \(R^2\) of these univariate models.

Exercise 3: Build a big multivariate model!

Now let’s construct a multivariate model. Combine all the predictors from Exercise 2 into a single multivariate model.

  • How does this change your interpretation of the coefficients?
  • How does this change the overall model fit?
  • Using vif (from the car package), determine whether there’s multicollinearity in your model.

Exercise 4: A more principled approach

Now let’s use a technique called forward stepwise regression to build a locally optimal set of predictors.

  • Start with the best model from Exercise 2.
  • Then, build a series of 2-variable models (i.e., with each other predictor), and choose the best of those.
  • Do this until you’ve added all the variables from Exercise 2 in order of how much they help the model.

What’s the \(R^2\) of each of these progressively more complicated models?

Exercise 5: Plot a map of California!

Our data also includes latitude and longitude information. Use that (e.g., in a scatterplot) to make a visualization of how housing prices change across California.