Sean Trott
February 17, 2020

High-level goals

This tutorial is intended as an introduction to two1 approaches to binary classification: logistic regression and support vector machines. It will accompany my 02/18/2020 workshop, “Binary classification in R”.

Concepts covered will include:

  • What is binary classification?
  • Why use logistic regression instead of linear regression?
  • Why use an SVM instead of logistic regression?
  • How do we build a classification model in R, and how do we interpret the output?
  • How do we generate predictions from a classification model?

As a second-order goal, the tutorial will also briefly cover certain aspects of data wrangling and visualization in R.

The examples and descriptions throughout the tutorial were sourced heavily from An Introduction to Statistical Learning in R. I also drew from several online blogs, including:

Load libraries

We’ll work with three main libraries: library(tidyverse), library(ISLR), and library(e1071). If you don’t have them installed already, use install.packages("tidyverse"), install.packages("ISLR"), and install.packages("e1071") to install them.

Introduction: what is binary classification?

Classification is the task of predicting a qualitative or categorical response variable. This is a common situation: it’s often the case that we want to know whether manipulating some \(X\) variable changes the probability of a certain categorical outcome (rather than changing the value of a continuous outcome).

Examples of classification problems include:

  • Classifying a person’s medical condition from a list of symptoms.
  • Detecting whether a certain transaction is fradulent.
  • Predicting the part of speech of a word.
  • Pretty much any experiment in which the task involves a forced choice between different responses (e.g., “Yes” or “No”).

Binary classification refers to a subset of these problems in which there are two possible outcomes. Given some variables \(X_1, ..., X_n\), we want to predict the probability that a particular observation belongs to one class or another.

In this tutorial, we’ll use several different datasets to demonstrate binary classification. We’ll start out by using the Default dataset, which comes with the ISLR package. We’ll then extend some of what we learn on this dataset to one of my own datasets, which involves trying to predict whether or not an utterance is a request (request vs. non-request) from a set of seven acoustic features.

Logistic regression

To start off, let’s try to model the binary outcome of whether an individual will default on their loan. If you’ve imported the ISLR library, the Default dataset should be available.

Load and inspect the data

We can get a sense for what the dataset looks like by looking at the first few rows:

head(Default)
##   default student   balance    income
## 1      No      No  729.5265 44361.625
## 2      No     Yes  817.1804 12106.135
## 3      No      No 1073.5492 31767.139
## 4      No      No  529.2506 35704.494
## 5      No      No  785.6559 38463.496
## 6      No     Yes  919.5885  7491.559

And we can get a sense for the distribution of default values by looking at it in a table:

table(Default$default)
## 
##   No  Yes 
## 9667  333

It’s clearly an unbalanced distribution. Most people in this dataset didn’t default on their loans. We can calculate the proportion by dividing the values in this table by the total number of rows in the dataset:

table(Default$default) / nrow(Default)
## 
##     No    Yes 
## 0.9667 0.0333

Note that an alternative way to get this information would be to use several handy functions from the tidyverse library. We can group our observations by whether or not they defaulted using group_by, then count the number and proportion in each cell using some operations in the summarise function.

df_summary = Default %>%
  group_by(default) %>%
  summarise(count = n(),
            proportion = n() / nrow(Default))
## Warning: package 'bindrcpp' was built under R version 3.4.4
df_summary
## # A tibble: 2 x 3
##   default count proportion
##   <fct>   <int>      <dbl>
## 1 No       9667     0.967 
## 2 Yes       333     0.0333

Note the %>% syntax: this is called piping, and is a common way to chain together multiple functions in R. It’s a shorthand for saying: use the output of this function call (or dataframe) as the input to the next function. Personally, I think it makes for nice, readable code, especially if you write each function call on a different line—that makes it clear exactly which transformations you’re applying to the data.

Why not linear regression?

Our \(Y\) variable is categorical: yes vs. no. We can’t use linear regression for a categorical variable.

Theoretically, however, we could recode this as a quantitative variable, setting yes to 1 and no to 0. In fact, we could even do this for variables with >2 levels, like recoding noun/verb/adjective as 0/1/2. But there are a couple of problems with this approach:

Problem 1: Imposing a false ordering

Recoding your \(Y\) variable as quantitative imposes an ordering on that variable. If there’s only two levels, this is less problematic. But iff your \(Y\) variable has more than two levels (e.g., noun/verb/adjective/), then the ordering you choose will greatly affect the slope of your regression line. Given that nominal variables have no intrinsic ordering by definition, this makes linear regression unsuitable for predicting \(Y\) variables with >2 classes.

As noted above, this is less of a problem for \(Y\) variables with only two levels. We can simply impose a decision threshold, e.g., “if \(\hat{Y} > .5\), predict yes, otherwise no”.

Problem 2: Linear regression doesn’t respect boundaries

Even for binary variables, another serious problem is that linear regression will produce estimates for \(\hat{Y}\) that fall outside our \([0, 1]\) range. Linear regression fits a line, and lines will extend past \([0, 1]\) at some constant rate (i.e., the slope).

Sometimes this is okay, depending on our objective. But it certainly makes the values for \(\hat{Y}\) very crude as probability estimates, since many values will be less than 0 or greater than 1.

We can demonstrate this empirically on the \(Default\) dataset. First, let’s recode our default variable as a numeric variable2:

# Recode as numeric
Default = Default %>%
  mutate(
    default_numeric = case_when(
      default == "Yes" ~ 1,
      default == "No" ~ 0
    )
  )

The mean of our new default_numeric variable should match the proportion of individuals who defaulted on their loan, whih we calculated earlier:

mean(Default$default_numeric)
## [1] 0.0333

We can then build a linear model using lm, predicting default_numeric from one or more of our possible predictors. Let’s start out just using balance:

simple_linear_model = lm(data = Default,
                         default_numeric ~ balance)

summary(simple_linear_model)
## 
## Call:
## lm(formula = default_numeric ~ balance, data = Default)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.23533 -0.06939 -0.02628  0.02004  0.99046 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.519e-02  3.354e-03  -22.42   <2e-16 ***
## balance      1.299e-04  3.475e-06   37.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1681 on 9998 degrees of freedom
## Multiple R-squared:  0.1226, Adjusted R-squared:  0.1225 
## F-statistic:  1397 on 1 and 9998 DF,  p-value: < 2.2e-16

We see that balance has a positive coefficient, meaning that individuals are more likely to default when they have a higher balance. We can plot the relationship like so:

Default %>%
  ggplot(aes(x = balance,
             y = default_numeric)) +
  geom_point(alpha = .4) +
  geom_smooth(method = "lm") +
  labs(y = "Default (1 = yes, 0 = no)",
       title = "Default outcome by balance") +
  theme_minimal()