Week 4: Regression

Introduction

The goal of this hands-on exercise is to familiarize you with building and interpreting regression models in R.

We’ll be working (again) with the CLEAR Corpus, a dataset of text excerpts rated for their readability (Crossley et al., 2021).

Load dataset

To get started, let’s load the dataset.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggcorrplot)

### Lancaster norms
df_clear <- read_csv("https://raw.githubusercontent.com/seantrott/ucsd_css211_datasets/main/main/viz/CLEAR_corpus_final.csv")
Rows: 4724 Columns: 28
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): Author, Title, Anthology, URL, Categ, Sub Cat, Lexile Band, Locati...
dbl (17): ID, Pub Year, MPAA #Max, MPAA# Avg, Google WC, Sentence Count, Par...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We’ve already explored this dataset in past weeks, so we can jump straight to modeling. Our goal is to figure out which predictors are most useful for predicting human judgments of readability (BT_easiness), and how they relate to readability.

Part 1: Modeling with continuous predictors

In this first section, we’ll try to

Exercise 1: Does Sentence Count predict readability?

Build a linear model predicting BT_easiness from Sentence Count. Call it m_sentence.

  • What is the intercept and slope? Interpret both.
  • Is the slope coefficient significant? What does that mean?
  • What is the \(R^2\) of the model? Interpret what it means.
  • Visualize the relationship in a scatterplot and draw the regression line on top of it.

Exercise 2: Does Paragraphs predict readability?

Build a linear model predicting BT_easiness from Paragraphs. Call it m_paragraphs.

  • What is the intercept and slope? Interpret both.
  • Is the slope coefficient significant? What does that mean?
  • What is the \(R^2\) of the model? Interpret what it means.
  • Visualize the relationship in a scatterplot and draw the regression line on top of it.

Exercise 3: Does Flesch-Kincaid-Grade-Level predict readability?

Build a linear model predicting BT_easiness from Flesch-Kincaid-Grade-Level. Call it m_flesch.

  • What is the intercept and slope? Interpret both.
  • Is the slope coefficient significant? What does that mean?
  • What is the \(R^2\) of the model? Interpret what it means.
  • Visualize the relationship in a scatterplot and draw the regression line on top of it.

Part 2: Modeling with categorical predictors

Now we’ll try to model BT_easiness with categorical predictors, i.e., with discrete levels.

Exercise 4: Does Categ predict readability?

Some texts are categorized as Info and others as Lit. Does this difference covary with differences in readability? Build a model to find out. Call it m_categ.

  • What is the intercept and slope? Interpret both.
  • Is the slope coefficient significant? What does that mean?
  • What is the \(R^2\) of the model? Interpret what it means.

Question: Would you visualize this in a scatterplot? Or a different kind of plot?

Exercise 5: Does Location predict readability?

Does the Location in a text from which an excerpt is drawn predict readability? Build a model to find out.

  • What is the intercept and slope(s)? Interpret both.
  • Is the slope coefficient significant? What does that mean?
  • What is the \(R^2\) of the model? Interpret what it means.

Question: How is this model similar and different from the model with Categ as a predictor? Focus on the number of distinct slope coefficients.

Part 3: Comparing models

Now we’ll compare the models we’ve built. Create a dataframe with two columns:

  • The name of the model.
  • The r_squared of the model.

Then construct a barplot showing the model on the x-axis and the r-squared on the y-axis. Make sure to reorder the levels from least to greatest.