**Sean Trott**

*February 17, 2020*

This tutorial is intended as an introduction to two^{1} approaches to binary classification: logistic regression and support vector machines. It will accompany my 02/18/2020 workshop, “Binary classification in R”.

Concepts covered will include:

- What is binary classification?

- Why use logistic regression instead of linear regression?
- Why use an SVM instead of logistic regression?
- How do we build a classification model in R, and how do we interpret the output?

- How do we generate predictions from a classification model?

As a second-order goal, the tutorial will also briefly cover certain aspects of data wrangling and visualization in R.

The examples and descriptions throughout the tutorial were sourced heavily from An Introduction to Statistical Learning in R. I also drew from several online blogs, including:

We’ll work with three main libraries: `library(tidyverse)`

, `library(ISLR)`

, and `library(e1071)`

. If you don’t have them installed already, use `install.packages("tidyverse")`

, `install.packages("ISLR")`

, and `install.packages("e1071")`

to install them.

**Classification** is the task of predicting a *qualitative* or *categorical* response variable. This is a common situation: it’s often the case that we want to know whether manipulating some \(X\) variable changes the probability of a certain categorical outcome (rather than changing the value of a continuous outcome).

Examples of classification problems include:

- Classifying a person’s
**medical condition**from a list of symptoms.

- Detecting whether a certain transaction is
**fradulent**.

- Predicting the
**part of speech**of a word.

- Pretty much any experiment in which the task involves a forced choice between different responses (e.g., “Yes” or “No”).

**Binary classification** refers to a subset of these problems in which there are two possible outcomes. Given some variables \(X_1, ..., X_n\), we want to predict the *probability* that a particular observation belongs to one class or another.

In this tutorial, we’ll use several different datasets to demonstrate binary classification. We’ll start out by using the `Default`

dataset, which comes with the `ISLR`

package. We’ll then extend some of what we learn on this dataset to one of my own datasets, which involves trying to predict whether or not an utterance is a **request** (*request* vs. *non-request*) from a set of seven acoustic features.

To start off, let’s try to model the binary outcome of whether an individual will default on their loan. If you’ve imported the `ISLR`

library, the `Default`

dataset should be available.

We can get a sense for what the dataset looks like by looking at the first few rows:

`head(Default)`

```
## default student balance income
## 1 No No 729.5265 44361.625
## 2 No Yes 817.1804 12106.135
## 3 No No 1073.5492 31767.139
## 4 No No 529.2506 35704.494
## 5 No No 785.6559 38463.496
## 6 No Yes 919.5885 7491.559
```

And we can get a sense for the distribution of `default`

values by looking at it in a table:

`table(Default$default)`

```
##
## No Yes
## 9667 333
```

It’s clearly an unbalanced distribution. Most people in this dataset didn’t default on their loans. We can calculate the proportion by dividing the values in this table by the total number of rows in the dataset:

`table(Default$default) / nrow(Default)`

```
##
## No Yes
## 0.9667 0.0333
```

Note that an alternative way to get this information would be to use several handy functions from the `tidyverse`

library. We can *group* our observations by whether or not they defaulted using `group_by`

, then count the number and proportion in each cell using some operations in the `summarise`

function.

```
df_summary = Default %>%
group_by(default) %>%
summarise(count = n(),
proportion = n() / nrow(Default))
```

`## Warning: package 'bindrcpp' was built under R version 3.4.4`

`df_summary`

```
## # A tibble: 2 x 3
## default count proportion
## <fct> <int> <dbl>
## 1 No 9667 0.967
## 2 Yes 333 0.0333
```

Note the `%>%`

syntax: this is called piping, and is a common way to **chain together** multiple functions in R. It’s a shorthand for saying: use the *output* of this function call (or dataframe) as the *input* to the next function. Personally, I think it makes for nice, readable code, especially if you write each function call on a different line—that makes it clear exactly which transformations you’re applying to the data.

Our \(Y\) variable is categorical: `yes`

vs. `no`

. We can’t use linear regression for a categorical variable.

Theoretically, however, we *could* recode this as a quantitative variable, setting `yes`

to `1`

and `no`

to `0`

. In fact, we could even do this for variables with >2 levels, like recoding `noun/verb/adjective`

as `0/1/2`

. But there are a couple of problems with this approach:

Recoding your \(Y\) variable as quantitative imposes an *ordering* on that variable. If there’s only two levels, this is less problematic. But iff your \(Y\) variable has more than two levels (e.g., `noun/verb/adjective/`

), then the ordering you choose will greatly affect the slope of your regression line. Given that nominal variables have no intrinsic ordering by definition, this makes linear regression unsuitable for predicting \(Y\) variables with >2 classes.

As noted above, this is less of a problem for \(Y\) variables with only two levels. We can simply impose a decision threshold, e.g., “if \(\hat{Y} > .5\), predict `yes`

, otherwise `no`

”.

Even for binary variables, another serious problem is that linear regression will produce estimates for \(\hat{Y}\) that fall outside our \([0, 1]\) range. Linear regression fits a *line*, and lines will extend past \([0, 1]\) at some constant rate (i.e., the slope).

Sometimes this is okay, depending on our objective. But it certainly makes the values for \(\hat{Y}\) very crude as probability estimates, since many values will be less than 0 or greater than 1.

We can demonstrate this empirically on the \(Default\) dataset. First, let’s recode our `default`

variable as a numeric variable^{2}:

```
# Recode as numeric
Default = Default %>%
mutate(
default_numeric = case_when(
default == "Yes" ~ 1,
default == "No" ~ 0
)
)
```

The `mean`

of our new `default_numeric`

variable should match the proportion of individuals who defaulted on their loan, whih we calculated earlier:

`mean(Default$default_numeric)`

`## [1] 0.0333`

We can then build a linear model using `lm`

, predicting `default_numeric`

from one or more of our possible predictors. Let’s start out just using `balance`

:

```
simple_linear_model = lm(data = Default,
default_numeric ~ balance)
summary(simple_linear_model)
```

```
##
## Call:
## lm(formula = default_numeric ~ balance, data = Default)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.23533 -0.06939 -0.02628 0.02004 0.99046
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.519e-02 3.354e-03 -22.42 <2e-16 ***
## balance 1.299e-04 3.475e-06 37.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1681 on 9998 degrees of freedom
## Multiple R-squared: 0.1226, Adjusted R-squared: 0.1225
## F-statistic: 1397 on 1 and 9998 DF, p-value: < 2.2e-16
```

We see that `balance`

has a positive coefficient, meaning that individuals are more likely to default when they have a higher balance. We can plot the relationship like so:

```
Default %>%
ggplot(aes(x = balance,
y = default_numeric)) +
geom_point(alpha = .4) +
geom_smooth(method = "lm") +
labs(y = "Default (1 = yes, 0 = no)",
title = "Default outcome by balance") +
theme_minimal()
```