“Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.”
Let’s address each of these in turn:
Each variable is a column.
Each observation is a row.
Each type of observational unit is a table.
Note💭 Check-in
With a partner, discuss what each of these criteria might mean.
Each variable is a column
A variable contains values that measure the same attribute across units.
An observation contains all values measured on the same unit across attributes.
One row = one complete case or measurement.
No splitting observations across multiple rows.
Untidy example:
state
measure
value
California
population
39.5
California
electoral_votes
54
Texas
population
31
Texas
electoral_votes
40
Note💭 Check-in
Why is this not tidy? How would you make it tidy?
Each observation is a row
population and electoral_votes are separate measures for the same case (observation), so should be in the same row.
Tidy version:
state
population_millions
electoral_votes
California
39.5
54
Texas
31.0
40
Each type of observational unit is a table
Different types of things being measured should be in separate tables.
Example: Record student test scores in one table, and school-level information in another.
Example 2: Record monthly temperatures for cities in one table, and city altitude in another.
This avoids repeating information (like school name for every student).
Once we’re ready, we can join these tables.
Note💭 Check-in
Can you think of other examples of “observational units” you would want to keep in separate tables?
Tidy data revisited
“Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.” (Wickham, 2014)
These guidelines aren’t arbitrary.
A standard tidy structure is helpful for visualizing and analyzing data.
Note
In fact, I often conceptualize “tidy” data in terms of:
What are the axes in a plot I want to make? (These should be separate columns.)
What are the terms in a regression formula I want to write? (These should be separate columns.)
Welcome to the tidyverse
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.”
Fundamentally, the tidyverse rests on the philosophy of tidy data.
Includes a range of functions, from importing data (read_csv) to merging datasets (inner_join).
Note💭 Installation Help
Install the tidyverse using install.packages("tidyverse").
Then, load it using library(tidyverse).
Loading the tidyverse
### Loading the tidyverselibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 28612 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Word, Dom_Pos
dbl (2): Concreteness, Frequency
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nrow(df_conc)
[1] 28612
Note💭 Check-in
Use of the functions we’ve discussed (nrow, names, head, etc.) to explore the dataset.
Basics of dplyr
dplyr is a package in the tidyverse that contains functions for transforming and manipulating dataframes.
Each function is a “verb”, for instance:
filter relevant rows.
mutate the dataset by adding new variables.
pull specific columns by name.
rename specific columns.
group_by specific values or factors.
summarize dataset with summary statistics.
The filter
filter works by selecting a subset of rows based on some boolean condition.
df_conc %>%### piping!filter(Word =="class")
# A tibble: 1 × 4
Word Concreteness Frequency Dom_Pos
<chr> <dbl> <dbl> <chr>
1 class 3.85 5985 Noun
Note💭 Check-in
Try using filter to return words that:
Have a Concreteness > 4.5.
Have a Dom_Pos of Noun.
Hint: You can chain multiple filter statements together with a %>% operator.
mutate your data
mutate works by creating a new variable (or overwriting an existing one) via applying some transformation to an existing variable or set of variables.
Let’s create a log frequency variable, since Frequency is highly right-skewed.
Use group_by to count the number of Pokemon of each Type 1. Which Type 1 is most frequent?
Then, filter the dataset to only include Pokemon of the most frequent type (based on what you learned).
Of this filtered dataset, group_bylegendary status and calculate the mean(Attack)
Hands-on practice (solution)
### Part 1df_pokemon <-read_csv("https://raw.githubusercontent.com/seantrott/ucsd_css211_datasets/main/main/wrangling/pokemon.csv")
Rows: 800 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Name, Type 1, Type 2
dbl (9): #, Total, HP, Attack, Defense, Sp. Atk, Sp. Def, Speed, Generation
lgl (1): Legendary
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
### Part 2max_type = df_pokemon %>%group_by(`Type 1`) %>%summarise(count =n()) %>%slice_max(count) %>%pull(`Type 1`)### Parts 3-4df_pokemon %>%filter(`Type 1`== max_type) %>%group_by(Legendary) %>%summarise(mean_attack =mean(Attack))
log_freq Concreteness Frequency
log_freq 1.0000000 0.16983740 0.21158008
Concreteness 0.1698374 1.00000000 -0.01501261
Frequency 0.2115801 -0.01501261 1.00000000
Note💭 Check-in
Try building a correlation matrix of the numerical columns from the Pokemon dataset (e.g., Attack, Defense, etc.). Are there any conclusions you can draw?
Part 2: More complex wrangling
Missing data, reshaping, and joining.
Missing values
Real-world data often has missing values, which are treated as NA (“Not Available”) by R.
There are a few relevant issues here:
First, you need to identify the presence of missing values.
Second, you need to address the issue.
Ignore them?
Remove any rows with missing values?
Try to impute their values?
Identifying missing values
We can use the is.na function to check for missing values.
### Reading in datasetdf_titanic <-read_csv("https://raw.githubusercontent.com/seantrott/ucsd_css211_datasets/main/main/wrangling/titanic.csv")
Rows: 891 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Name, Sex, Ticket, Cabin, Embarked
dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
### Check for missing values; returns a vector of boolean valuesis.na(df_titanic$Age)[0:5]
[1] FALSE FALSE FALSE FALSE FALSE
Note💭 Check-in
How many missing values are there total in the Age column? What about Survived or Cabin? (Hint: You can sum boolean values like TRUE and FALSE).
Identify missing values (pt. 2)
We can combine some handy functions (across and everything) to sum up the number of NA values per column.
## Get number of NA across all columnsdf_titanic %>%summarise(across(everything(), ~sum(is.na(.))))
# A tibble: 1 × 12
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 0 0 0 0 0 177 0 0 0 0 687
# ℹ 1 more variable: Embarked <int>
Note
NA values seem disproportionately located in the Cabin column.
Addressing missing values
Now that we’ve identified missing values, we have a few options:
We could ignore them, i.e., perform analyses as if they don’t exist.
We could remove missing values from our dataset.
We could impute the missing values (guess).
Note💭 Check-in
What do you think are the trade-offs of each approach?
The problem with ignoring NA
Unsurprisingly, ignoring NA values doesn’t mean they go away.
In some cases, analyses or visualizations may “quietly” ignore them.
This is bad because it means you’re not fully aware of what goes into an analysis.
In other cases, it’ll interfere with your summary statistics.
## Mean defaults to NA b/c of missing valuesmean(df_titanic$Age)
[1] NA
## Temporary fix is to remove them from this calculation...mean(df_titanic$Age, na.rm =TRUE)
[1] 29.69912
Removing missing values
Another common approach to NA values is simply to remove them.
We can do this selectively in a given analysis, like below:
## Temporary fix is to remove them from this calculation...mean(df_titanic$Age, na.rm =TRUE)
[1] 29.69912
Or we can actually filter our dataset to remove them entirely (or use drop_na).
What is the relative rate of NA values across levels of Survived for Age?
Imputing missing values
Imputing values means inferring them based on some other set of properties or set of assumption. I.e., an “educated guess”.
Can set to mean or median value (e.g., mean(Age)).
Can “guess” value based on some other property (e.g., mean(Age) for that Pclass).
df_titanic = df_titanic %>%group_by(Pclass) %>%### Group by passenger classmutate(Age_imputed =ifelse(is.na(Age), mean(Age, na.rm =TRUE), ### Replace with mean for that Pclass Age)) %>%### otherwise just use original ageungroup()df_titanic %>%group_by(Pclass) %>%summarise(mean_age =mean(Age_imputed))
Why is this considered “messy”? And what would a “long” version look like?
pivot_longer: from wide to long
We can use the pivot_longer function to transform this into a long format.
df_messy_1 %>%pivot_longer(cols =c(john, mary, jane), ### which columns to pivotnames_to ='name', ### what to call column with keysvalues_to ='result') ### what to call column with values
# A tibble: 6 × 3
treatment name result
<chr> <chr> <dbl>
1 a john 10
2 a mary 20
3 a jane 5
4 b john 11
5 b mary 25
6 b jane 10
Crucial argument is cols (which columns to pivot).
names_to and values_to are helpful for making the resulting table more readable.
### Part 1df_work <-read_csv("https://raw.githubusercontent.com/seantrott/ucsd_css211_datasets/main/main/wrangling/missing_work.csv")
Rows: 110 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (7): Year, Child care problems, Maternity or paternity leave, Other fami...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df_work, 3)
# A tibble: 3 × 7
Year `Child care problems` Maternity or paternity le…¹ Other family or pers…²
<dbl> <dbl> <dbl> <dbl>
1 2012 18 313 246
2 2012 35 278 230
3 2012 13 245 246
# ℹ abbreviated names: ¹`Maternity or paternity leave`,
# ²`Other family or personal obligations`
# ℹ 3 more variables: `Illness or injury` <dbl>, Vacation <dbl>, Month <dbl>
Note💭 Check-in
Right now, the table stores different reasons for missing work as different columns.
Instead, let’s pivot_longer so the table has a column for Reason (for missing) and a column for Count (number of days missed).
Other handy tidyr functions
Although pivoting is probably the main function enabled by tidyr, it also contains other valuable reshaping functions:
separate “splits” values within a given column into multiple columns.
unite does the opposite.
table3 %>%separate(rate, ### column to separateinto =c("cases", "population"), ### columns to split intoconvert =TRUE, ### convert into numericsep ="/") ### separator currently separating values
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Joining data
Joining is an operation in which columns from data Y are added to data X, matching observations based on shared variables (or “keys”).
In real-world analysis, data is often stored across multiple sources.
Joining brings together related information from separate tables.
Table 1: State election results (state, biden_pct, turnout)
Table 2: State demographics (state, median_income, college_pct)
Here, we might join the tables based on the overlapping state variable.
Goal: Analyze how median_income and college_pct relates to voting patterns.
Types of joins
Given tables X and Y, you have several options for how to join them.
inner_join: only keep observations from X with matching key in Y.
left_join: keep all observations from X, join matching ones from Y.
right-join: keep all observations from Y, join matching ones from X.
full_join: keep all observations in both X and Y.
Simple join example
Suppose we have two tables: table4a contains information about the cases in each country (with different columns for different years), and table4b contains information about the population of each country (again, with different columns for different years).
# A tibble: 6 × 4
country year cases population
<chr> <chr> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Note💭 Check-in
In this case, would it matter whether we use left_join or inner_join? If not, under what circumstances would this distinction matter?
A more complex example
We’ve already loaded information about word concreteness. Now let’s load another dataset about the average age of acquisition at which words are learned, and combine that with the concreteness dataset.
Rows: 31124 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Word
dbl (1): AoA
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nrow(df_aoa)
[1] 31124
head(df_aoa)
# A tibble: 6 × 2
Word AoA
<chr> <dbl>
1 a 2.89
2 aardvark 9.89
3 abacus 8.69
4 abalone 12.2
5 abandon 8.32
6 abandoner 11.9
Note💭 Check-in
How would you join this dataset with the concreteness dataset? What kind of join operation might you want to use?
A more complex example (pt. 2)
The concreteness and AoA datasets are not fully-overlapping. Concreteness contains words not in AoA and vice versa.
### Inner join (keep only intersection)df_inner = df_conc %>%inner_join(df_aoa)
Joining with `by = join_by(Word)`
nrow(df_inner)
[1] 23568
### Left join (keep all concreteness)df_left = df_conc %>%left_join(df_aoa)
Joining with `by = join_by(Word)`
nrow(df_left)
[1] 28612
### Right join (keep all AoA)df_right = df_conc %>%right_join(df_aoa)
Joining with `by = join_by(Word)`
nrow(df_right)
[1] 31124
### full join (keep everything)df_full = df_conc %>%full_join(df_aoa)
Joining with `by = join_by(Word)`
nrow(df_full)
[1] 36168
Joining: the details
When joining, there are a few other details we’ve glossed over:
by: this specifies which keys to join on (will default to all overlapping).
multiple: how to handle cases where rows from X match multiple rows from Y.
unmatched: how to handle cases where rows would be dropped.
Putting it all together
These lectures have covered the basics of data wrangling with the tidyverse.
A typical workflow will involve:
Reading data (read_csv).
Exploring the data (is.na, summary, summarise).
Clean data (mutate, filter).
Reshape if needed (pivot_longer/wider).
Join if needed (inner_join, etc.).
Note
The end result is a dataset ready for further analysis and visualization (coming up!).