The goal of this lab is to familiarize you with building and interpreting regression models in R.
You’ll be:
Calculating the error associated with different \(\beta\) values.
Fitting lm models to actual data.
Plotting and interpreting the coefficients of those models.
We’ll be working (again) with several datasets measuring properties of English words:
A dataset of concreteness and frequency information.
A dataset of age of acquisition information.
A dataset of response times to individual words.
We’re interested in how the accessibility of a word (as measured by response time and accuracy) is predicted by various other features, such as its frequency, concreteness, and age of acquisition. We’ll be building regression models to help us tease apart these effects.
Load datasets
To get started, let’s load the datasets.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 28612 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Word, Dom_Pos
dbl (2): Concreteness, Frequency
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 31124 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Word
dbl (1): AoA
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nrow(df_aoa)
[1] 31124
### Dataset with response times and accuracydf_blp <-read_csv("https://raw.githubusercontent.com/seantrott/ucsd_css211_datasets/main/main/wrangling/blp.csv")
Rows: 55867 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): spelling, lexicality
dbl (6): rt, zscore, accuracy, rt.sd, zscore.sd, accuracy.sd
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nrow(df_blp)
[1] 55867
Part 0: Wrangling
Exercise 1: Join the datasets
To get started, let’s join these datasets on a common column. You might need to add a new column the the BLP data to make that work. You might also want to drop any NA rows.
Part 1: Building and interpreting simple models
Now that we’ve joined the datasets, we can start building regression models to help us predict response times and accuracy.
Exercise 2: A concreteness advantage? (pt. 1)
Some research on word processing has argued for a “concreteness advantage”: i.e., the idea that concrete words are easier to process. Here, “processing ease” is often measured in terms of response time (faster = easier) and also accuracy (more accurate = easier).
Create a scatterplot showing the relationship between rt and Concreteness.
Build a regression model building rt from Concreteness.
What are the coefficients? Interpret them.
Are any coefficients significant? What does it mean if they are?
What is the overall model fit (i.e., \(R^2\))?
Finally, is this consistent with a concreteness advantage
Exercise 3: A concreteness advantage? (pt. 2)
Now replicate Exercise 2 but using accuracy as a dependnet measure instead of rt. Do you see a similar pattern of results?
Exercise 4: What about raw frequency?
Another well-attested finding in psycholinguistics is that frequent words are easier to process. Is that true of these data too?
Create a scatterplot showing the relationship between rt and Frequency.
Build an lm model predicting rt from raw Frequency. Interpret the coefficients.
What is the \(R^2\) of this model? How does it compare to the concreteness model from exercise 2?
Exercise 5: What about log frequency?
Frequency is a right-skewed variable (if you’re curious, make a histogram of Frequency to illustrate what this means!). Skewed data can sometimes influence regression coefficients or model fit. In some cases, researchers address this by log-transforming frequency.
Apply a log transformation to Frequency, creating a new variable called log_freq.
Redo Exercise 4 using log_freq instead. What differences do you observe?
How does having a log\(X\) variable change our interpretation of the coefficients?
Exercise 6: An age of acquisition advantage?
Another common argument is that words that are learned earlier are easier to process, perhaps because they’ve been stored in the lexical network for longer. Is that true of these data?
Redo exercises 2-3 using AoA instead of Concreteness.
How does the \(R^2\) of this model compare to the models with Concreteness, Frequency, or log_freq?
Part 2: Complex models
One limitation of the models we built in Part 1 is that they’re all univariate models. But many of the predictors (like Concreteness and Age of Acquisition) are actually correlated! That makes it harder to figure out the unique effect of Concreteness, independent from AoA.
We can use multiple regression to get at this question.
Exercise 7: A multivariate model
Build a multiple regression model predicting rt with Concreteness, AoA, and log_freq as predictors.
Interpret each of the coefficients (including the Intercept). How is this different from our interpretation of the coefficients in the univariate models?
Use broom::tidy to store the coefficients in a dataframe. Now, create a visualization of the coefficient values (e.g., a scatterplot with error bars).
Exercise 8: Evaluating this model
Interpret the \(R^2\) of the model from Exercise 7. How does this compare to the univariate models?
Extract predictions from the modelusing the predict function and set them to a new column called rt_pred. Create a scatterplot showing the relationship between rt_pred and rt.
Exercise 9: Checking for multicollinearity
Finally, let’s check our model for signs of multicollinearity using the vif function. Is there anything to worry about?
Exercise 9: A multivariate model
Replicate Exercises 7-8 but with accuracy as a dependent variable instead of rt.
Submission Instructions
Make sure all your code chunks run without errors
Save this file with your name in the filename (e.g., “Lab1_YourLastName.qmd”)
Render the document to HTML
Submit both the .qmd file and the rendered HTML file to Canvas