Sean Trott
February 17, 2020
This tutorial is intended as an introduction to two1 approaches to binary classification: logistic regression and support vector machines. It will accompany my 02/18/2020 workshop, “Binary classification in R”.
Concepts covered will include:
As a second-order goal, the tutorial will also briefly cover certain aspects of data wrangling and visualization in R.
The examples and descriptions throughout the tutorial were sourced heavily from An Introduction to Statistical Learning in R. I also drew from several online blogs, including:
We’ll work with three main libraries: library(tidyverse)
, library(ISLR)
, and library(e1071)
. If you don’t have them installed already, use install.packages("tidyverse")
, install.packages("ISLR")
, and install.packages("e1071")
to install them.
Classification is the task of predicting a qualitative or categorical response variable. This is a common situation: it’s often the case that we want to know whether manipulating some \(X\) variable changes the probability of a certain categorical outcome (rather than changing the value of a continuous outcome).
Examples of classification problems include:
Binary classification refers to a subset of these problems in which there are two possible outcomes. Given some variables \(X_1, ..., X_n\), we want to predict the probability that a particular observation belongs to one class or another.
In this tutorial, we’ll use several different datasets to demonstrate binary classification. We’ll start out by using the Default
dataset, which comes with the ISLR
package. We’ll then extend some of what we learn on this dataset to one of my own datasets, which involves trying to predict whether or not an utterance is a request (request vs. non-request) from a set of seven acoustic features.
To start off, let’s try to model the binary outcome of whether an individual will default on their loan. If you’ve imported the ISLR
library, the Default
dataset should be available.
We can get a sense for what the dataset looks like by looking at the first few rows:
head(Default)
## default student balance income
## 1 No No 729.5265 44361.625
## 2 No Yes 817.1804 12106.135
## 3 No No 1073.5492 31767.139
## 4 No No 529.2506 35704.494
## 5 No No 785.6559 38463.496
## 6 No Yes 919.5885 7491.559
And we can get a sense for the distribution of default
values by looking at it in a table:
table(Default$default)
##
## No Yes
## 9667 333
It’s clearly an unbalanced distribution. Most people in this dataset didn’t default on their loans. We can calculate the proportion by dividing the values in this table by the total number of rows in the dataset:
table(Default$default) / nrow(Default)
##
## No Yes
## 0.9667 0.0333
Note that an alternative way to get this information would be to use several handy functions from the tidyverse
library. We can group our observations by whether or not they defaulted using group_by
, then count the number and proportion in each cell using some operations in the summarise
function.
df_summary = Default %>%
group_by(default) %>%
summarise(count = n(),
proportion = n() / nrow(Default))
## Warning: package 'bindrcpp' was built under R version 3.4.4
df_summary
## # A tibble: 2 x 3
## default count proportion
## <fct> <int> <dbl>
## 1 No 9667 0.967
## 2 Yes 333 0.0333
Note the %>%
syntax: this is called piping, and is a common way to chain together multiple functions in R. It’s a shorthand for saying: use the output of this function call (or dataframe) as the input to the next function. Personally, I think it makes for nice, readable code, especially if you write each function call on a different line—that makes it clear exactly which transformations you’re applying to the data.
Our \(Y\) variable is categorical: yes
vs. no
. We can’t use linear regression for a categorical variable.
Theoretically, however, we could recode this as a quantitative variable, setting yes
to 1
and no
to 0
. In fact, we could even do this for variables with >2 levels, like recoding noun/verb/adjective
as 0/1/2
. But there are a couple of problems with this approach:
Recoding your \(Y\) variable as quantitative imposes an ordering on that variable. If there’s only two levels, this is less problematic. But iff your \(Y\) variable has more than two levels (e.g., noun/verb/adjective/
), then the ordering you choose will greatly affect the slope of your regression line. Given that nominal variables have no intrinsic ordering by definition, this makes linear regression unsuitable for predicting \(Y\) variables with >2 classes.
As noted above, this is less of a problem for \(Y\) variables with only two levels. We can simply impose a decision threshold, e.g., “if \(\hat{Y} > .5\), predict yes
, otherwise no
”.
Even for binary variables, another serious problem is that linear regression will produce estimates for \(\hat{Y}\) that fall outside our \([0, 1]\) range. Linear regression fits a line, and lines will extend past \([0, 1]\) at some constant rate (i.e., the slope).
Sometimes this is okay, depending on our objective. But it certainly makes the values for \(\hat{Y}\) very crude as probability estimates, since many values will be less than 0 or greater than 1.
We can demonstrate this empirically on the \(Default\) dataset. First, let’s recode our default
variable as a numeric variable2:
# Recode as numeric
Default = Default %>%
mutate(
default_numeric = case_when(
default == "Yes" ~ 1,
default == "No" ~ 0
)
)
The mean
of our new default_numeric
variable should match the proportion of individuals who defaulted on their loan, whih we calculated earlier:
mean(Default$default_numeric)
## [1] 0.0333
We can then build a linear model using lm
, predicting default_numeric
from one or more of our possible predictors. Let’s start out just using balance
:
simple_linear_model = lm(data = Default,
default_numeric ~ balance)
summary(simple_linear_model)
##
## Call:
## lm(formula = default_numeric ~ balance, data = Default)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.23533 -0.06939 -0.02628 0.02004 0.99046
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.519e-02 3.354e-03 -22.42 <2e-16 ***
## balance 1.299e-04 3.475e-06 37.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1681 on 9998 degrees of freedom
## Multiple R-squared: 0.1226, Adjusted R-squared: 0.1225
## F-statistic: 1397 on 1 and 9998 DF, p-value: < 2.2e-16
We see that balance
has a positive coefficient, meaning that individuals are more likely to default when they have a higher balance. We can plot the relationship like so:
Default %>%
ggplot(aes(x = balance,
y = default_numeric)) +
geom_point(alpha = .4) +
geom_smooth(method = "lm") +
labs(y = "Default (1 = yes, 0 = no)",
title = "Default outcome by balance") +
theme_minimal()