The goal of this hands-on exercise is to familiarize you with the basics of data wrangling and descriptive analysis. It will be a “semi-guided session”, with periods for independent (or collaborative) work.
Specific topics and skills:
Familiarity with tools for describing data, e.g., group_by and summarise, as well as simple plots like hist.
Experience transforming or manipulating data using filter and mutate.
Experience reshaping and merging datasets using pivot_longer and various joins.
The dataset(s)
We’ll be working with a few Linguistics datasets:
A dataset about word concreteness, which you’ve already seen from class.
A dataset about word iconicity (the extent to which words sound like what they mean).
To get started, let’s load the datasets.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 28612 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Word, Dom_Pos
dbl (2): Concreteness, Frequency
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 14774 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): word
dbl (5): n_ratings, n, prop_known, rating, rating_sd
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Exercise 1: Understand the structure of your data
First, use functions like nrow, colnames, head, and more to understand the structure and content of each dataset.
Try to answer the following questions:
What does each column mean?
Would you say the datasets are in wide or long format?
Do the datasets have any overlapping columns?
How many words are in each dataset? Are they the same words?
What’s the range of values for iconicity and concreteness ratings?
Exercise 2: Join your data
Use a join operation to merge the concreteness and iconicity datasets on a common column name. (Hint: You might need to use mutate to create a new column in at least one of the datasets.)
What are the consequences of using left/right/full/inner_join? Which join did you use, and what does the resulting table look like in terms of lost (or gained) rows?
Exercise 3: Explore your data
Now that you’ve merged your datasets, let’s conduct some initial explorations.
3a: How many ratings?
The iconicity dataset contains a number of ratings for each word, but some words were rated more times than others.
What is the range of n_ratings?
Which word was rated the most times? What about the least?
3b: Which words are most and least iconic?
Which(s) word has (or have) the highest iconicity rating in the merged dataset? What about the word(s) with the lowest rating? Do these results make sense to you? (Hint: Use slice_max/min).
What about the top 5 iconic words and the bottom 5 iconic words? (Hint: Use arrange and slice_head.)
Can you draw any conclusions from this preliminary exploration about which kinds of words are most and least iconic?
Are the extreme words the same before and after merging?
3c: Which words have the most and least variance in their iconicity ratings?
Repeat exercise 3b, but for rating_sd instead of rating.
3d: Which words are most and least frequent?
Repeat exercise 3b, but now for Frequency. Which words are the most and least frequent in the joined dataset?
3e: Which words are most and least concrete?
Repeat exercise 3b, but now for Concreteness. Which words are the most and least concrete in the joined dataset?
Exercise 4: Grouping and summarizing
The dataset also includes information about each word’s part-of-speech. Let’s use that information to learn more about our data.
4a: Which part-of-speech is most and least iconic?
Use group_by and summarise to calculate the mean iconicity for each part of speech.
Which part-of-speech has the highest iconicity on average, and which has the least?
Is this broadly consistent with your qualitative inspection of the most and least iconic words earlier?
4b: Which part-of-speech is most and least concrete?
Use group_by and summarise to calculate the mean frequency for each part of speech.
Which part-of-speech has the highest frequency on average, and which has the least?
Do you think these estimates might be affected by the distribution shape of Frequency? If necessary, recalculate the above with a log-transformed measure.
Exercise 5: Correlations between our variables
5a: A correlation matrix
Now, create a correlation matrix between n_ratings, rating, rating_sd, Concreteness, and Frequency. You might need to use select to first select those columns.
What are the strongest correlations?
Are more concrete words also more iconic?
5b: Does the strength of correlation depend on part-of-speech?
Focusing specifically on the relationship between Concreteness and rating, calculate the correlation between these variables for each part-of-speech.
Which parts-of-speech show the highest correlation? Which show the lowest?
Compare these values to the count of observations for each part of speech.
5c: Concrete vs. abstract words
Now, create a binary variable called Concrete that categorizes words with a Concreteness >2.5 as “Concrete”, and words with a Concreteness <= 2.5 as “Abstract”.
Then, calculate the mean iconicity for words categorized as Concrete vs. Abstract.
Exercise 6: Free exploration!
Try to come up with at least one additional question to ask about your data.
If necessary, that could include merging with additional datasets (e.g., the AoA dataset used in class).
Alternatively, you could try to build a regression model using lm to model the relationships between the variables more directly.