Week 6: Logistic regression

Introduction

The goal of this hands-on exercise is to familiarize you with building and interpreting logistic regression models in R.

We’ll be working with the Titanic dataset we used earlier in the quarter. More information can be found here.

Here’s a handy reference table for the different columns of the dataset.

Column Description Values / Notes
Survived Survival 0 = No, 1 = Yes
Pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
Sex Sex
Age Age in years
SibSp # of siblings / spouses aboard
Parch # of parents / children aboard
ticket Ticket number
Fare Passenger fare
Cabin Cabin number
Embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Load dataset

To get started, let’s load the dataset.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggcorrplot)

### Titanic dataset
df_titanic <- read_csv("https://raw.githubusercontent.com/seantrott/ucsd_css211_datasets/main/main/wrangling/titanic.csv")
Rows: 891 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Name, Sex, Ticket, Cabin, Embarked
dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df_titanic)
# A tibble: 6 × 12
  PassengerId Survived Pclass Name    Sex     Age SibSp Parch Ticket  Fare Cabin
        <dbl>    <dbl>  <dbl> <chr>   <chr> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>
1           1        0      3 Braund… male     22     1     0 A/5 2…  7.25 <NA> 
2           2        1      1 Cuming… fema…    38     1     0 PC 17… 71.3  C85  
3           3        1      3 Heikki… fema…    26     0     0 STON/…  7.92 <NA> 
4           4        1      1 Futrel… fema…    35     1     0 113803 53.1  C123 
5           5        0      3 Allen,… male     35     0     0 373450  8.05 <NA> 
6           6        0      3 Moran,… male     NA     0     0 330877  8.46 <NA> 
# ℹ 1 more variable: Embarked <chr>

Exercise 1: Does Age predict Survived?

Build a logistic regression model predicting Survived from Age.

  • Is the coefficient for Age significant? If so, what does that mean?
  • Interpret the intercept and slope coefficient. What do they correspond to?
  • According to this model, what is the probability that a 20-year-old person would survive? What about a 30-year-old person?
  • What is the AIC of this model?

Exercise 2: Does Pclass predict Survived?

Build a logistic regression model predicting Survived from Pclass (passenger class).

  • Is the coefficient for Pclass significant? If so, what does that mean?
  • Interpret the intercept and slope coefficient. What do they correspond to?
  • According to this model, what is the probability that a person in 1st class would survive? What about third-class?
  • What is the AIC of this model?

Exercise 3: Does Sex predict Survived??

Build a logistic regression model predicting Survived from Sex.

  • Is the coefficient for Sex significant? If so, what does that mean?
  • Interpret the intercept and slope coefficient. What do they correspond to?
  • According to this model, what is the probability that a person listed as female would survive? What about male?
  • What is the AIC of this model?

Exercise 4: Compare the single-variable models.

In Exercises 1-3, you calculated the AIC of each of these models. Now let’s compare those AIC values. Which is lowest? What does that tell us about the ability of each predictor to account for whether someone Survived?

Exercise 5: Build a multivariate model.

Now let’s combine those three variables in a single model predicting Survived.

  • Do any of the coefficients change? How so and how much?
  • Use the broom package (and ggplot2) to visualize the coefficients with their standard errors.
  • What is the AIC of this new model?
  • Write out the linear equation corresponding to this model to help think through what it means.