Data Wrangling in R

Goals of the lecture

What is data wrangling and why should we learn it?
Principles of tidy data.
Wrangling with the tidyverse.

What is data wrangling?

Data wrangling refers to the processes of transforming or manipulating raw data into a useful format for downstream analysis and visualization.

In CSS, you’ll find yourself needing to:

Import data from various sources, in various formats.
Clean messy variables or recast them to different types.
Filter and transform data.
Address missing values.
Reshape data between different formats.
Join multiple datasets together.

These are all supported by the tidyverse! But what is tidy data?

Tidy data

Wickham (2014) defines tidy data as follows:

“Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.”

Let’s address each of these in turn:

Each variable is a column.
Each observation is a row.
Each type of observational unit is a table.

💭 Check-in

With a partner, discuss what each of these criteria might mean.

Each variable is a column

A variable contains values that measure the same attribute across units.

Examples: height, age, income, test_score, population.
Each variable gets its own column.

Untidy example:

country	gdp_2020	gdp_2021	gdp_2022
USA	21.35	23.368	26.01
China	14.72	18.2	18.32
Germany	3.94	4.35	4.16

💭 Check-in

Why is this not tidy? How would you make it tidy?

Each variable is a column

Now, we have a column for country, year, and gdp.

Tidy example:

country	year	gdp_trillion
USA	2020	21.35
USA	2021	23.368
USA	2022	26.01
China	2020	14.72
China	2021	18.2
China	2022	18.32
Germany	2020	3.94
Germany	2021	4.35
Germany	2022	4.16

Each observation is a row

An observation contains all values measured on the same unit across attributes.

One row = one complete case or measurement.
No splitting observations across multiple rows.

Untidy example:

state	measure	value
California	population	39.5
California	electoral_votes	54
Texas	population	31
Texas	electoral_votes	40

💭 Check-in

Why is this not tidy? How would you make it tidy?

Each observation is a row

population and electoral_votes are separate measures for the same case (observation), so should be in the same row.

Tidy version:

state	population_millions	electoral_votes
California	39.5	54
Texas	31.0	40

Each type of observational unit is a table

Different types of things being measured should be in separate tables.

Example: Record student test scores in one table, and school-level information in another.
Example 2: Record monthly temperatures for cities in one table, and city altitude in another.
This avoids repeating information (like school name for every student).
Once we’re ready, we can join these tables.

💭 Check-in

Can you think of other examples of “observational units” you would want to keep in separate tables?

Tidy data revisited

“Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.” (Wickham, 2014)

These guidelines aren’t arbitrary.
A standard tidy structure is helpful for visualizing and analyzing data.

Note

In fact, I often conceptualize “tidy” data in terms of:

What are the axes in a plot I want to make? (These should be separate columns.)
What are the terms in a regression formula I want to write? (These should be separate columns.)

Welcome to the `tidyverse`

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.”

Fundamentally, the tidyverse rests on the philosophy of tidy data.
Includes a range of functions, from importing data (read_csv) to merging datasets (inner_join).

💭 Installation Help

Install the tidyverse using install.packages("tidyverse").
Then, load it using library(tidyverse).

Loading the `tidyverse`

### Loading the tidyverse
library(tidyverse)
tidyverse::tidyverse_packages()

 [1] "broom"         "conflicted"    "cli"           "dbplyr"       
 [5] "dplyr"         "dtplyr"        "forcats"       "ggplot2"      
 [9] "googledrive"   "googlesheets4" "haven"         "hms"          
[13] "httr"          "jsonlite"      "lubridate"     "magrittr"     
[17] "modelr"        "pillar"        "purrr"         "ragg"         
[21] "readr"         "readxl"        "reprex"        "rlang"        
[25] "rstudioapi"    "rvest"         "stringr"       "tibble"       
[29] "tidyr"         "xml2"          "tidyverse"

We’ll cover some of these packages later in this class, like broom and ggplot2.
This week, we’ll focus on functions from readr, magrittr, dplyr, and tidyr.

The pipe (`%>%`)

The pipe operator (%>%) lets you chain functions together in a readable way.

Takes the output from the left side and feeds it as the first argument to the right side.
Instead of nesting functions, can pipe outputs into each other.
Makes code more readable and easier to follow.
We’ll see examples coming up!

Part 1: Basic data manipulation

Importing, filtering, mutating, and summarizing data.

Reading in a dataset

One of the most fundamental things you’ll need to do is import data, e.g., from a .csv file.

read_csv takes as input a filepath (either locally, or a URL).
Output is a tibble.

df_conc <- read_csv("https://raw.githubusercontent.com/seantrott/ucsd_css211_datasets/main/main/wrangling/concreteness.csv")
nrow(df_conc)

[1] 28612

💭 Check-in

Use of the functions we’ve discussed (nrow, names, head, etc.) to explore the dataset.

Basics of `dplyr`

dplyr is a package in the tidyverse that contains functions for transforming and manipulating dataframes.

Each function is a “verb”, for instance:

filter relevant rows.
mutate the dataset by adding new variables.
pull specific columns by name.
rename specific columns.
group_by specific values or factors.
summarize dataset with summary statistics.

The `filter`

filter works by selecting a subset of rows based on some boolean condition.

df_conc %>%  ### piping!
  filter(Word == "class")

# A tibble: 1 × 4
  Word  Concreteness Frequency Dom_Pos
  <chr>        <dbl>     <dbl> <chr>  
1 class         3.85      5985 Noun

💭 Check-in

Try using filter to return words that:

Have a Concreteness > 4.5.
Have a Dom_Pos of Noun.

Hint: You can chain multiple filter statements together with a %>% operator.

`mutate` your data

mutate works by creating a new variable (or overwriting an existing one) via applying some transformation to an existing variable or set of variables.

Let’s create a log frequency variable, since Frequency is highly right-skewed.

df_conc = df_conc %>%  ### piping!
  mutate(log_freq = log10(Frequency))

summary(df_conc$log_freq)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   -Inf  0.8451  1.4624    -Inf  2.1004  6.3293

💭 Check-in

Why do you think the mininum log_freq value is -Inf?

`mutate` (pt. 2)

Some words in our dataset have a Frequency of 0. One strategy is to add one to those values before taking the log.

df_conc = df_conc %>%  ### piping!
  mutate(log_freq = log10(Frequency + 1))

summary(df_conc$log_freq)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.9031  1.4771  1.5793  2.1038  6.3293

`group_by` specific factors and `summarise`

The group_by function is extremely useful for grouping your data by specific factors.
You can %>% this into summarise to calculate summary statistics for each group.

df_conc %>%
  group_by(Dom_Pos) %>%
  summarise(mean_conc = mean(Concreteness)) %>%
  head(2)

# A tibble: 2 × 2
  Dom_Pos   mean_conc
  <chr>         <dbl>
1 Adjective      2.50
2 Adverb         2.06

💭 Check-in

In addition to calculating mean, try also calculating the sd; additionally, try calculating the mean and sd for log_freq by Dom_Pos.

Counting the number of observations

Often relevant for a group_by calculation is the number of observations per group.

df_conc %>%
  group_by(Dom_Pos) %>%
  summarise(mean_conc = mean(Concreteness),
            count = n()) %>%
  head(2)

# A tibble: 2 × 3
  Dom_Pos   mean_conc count
  <chr>         <dbl> <int>
1 Adjective      2.50  6112
2 Adverb         2.06  1876

Hands-on practice

Now let’s get some hands-on practice applying these new techniques.

Read in the pokemon.csv dataset from the ucsd_211_datasets GitHub using read_csv.
Use group_by to count the number of Pokemon of each Type 1. Which Type 1 is most frequent?
Then, filter the dataset to only include Pokemon of the most frequent type (based on what you learned).
Of this filtered dataset, group_by legendary status and calculate the mean(Attack)

Hands-on practice (solution)

### Part 1
df_pokemon <- read_csv("https://raw.githubusercontent.com/seantrott/ucsd_css211_datasets/main/main/wrangling/pokemon.csv")

### Part 2
max_type = df_pokemon %>% 
  group_by(`Type 1`) %>% 
  summarise(count = n()) %>%
  slice_max(count) %>%
  pull(`Type 1`)

### Parts 3-4
df_pokemon %>%
  filter(`Type 1` == max_type) %>%
  group_by(Legendary) %>%
  summarise(mean_attack = mean(Attack))

# A tibble: 2 × 2
  Legendary mean_attack
  <lgl>           <dbl>
1 FALSE            72.8
2 TRUE            111.

💭 Check-in

Notice any functions we haven’t discussed?

Some new functions

Above, we used a combination of read_csv, filter, group_by, and summarise. But we also used some other new functions:

slice_max: returns rows with the max values of a given column.
- Can also use slice_head, slice_tail, slice_min, and slice_sample.
pull: returns the individual vector/value from a given column.

💭 Check-in

Any questions before we move on to more dplyr functions and more aspects of the tidyverse more generally?

`select` specific columns

The select function can be used to select certain columns for further analysis.

This is especially useful if you want to %>% those columns into, say, a correlation matrix.

df_conc %>%
  select(log_freq, Concreteness, Frequency) %>%
  cor()

              log_freq Concreteness   Frequency
log_freq     1.0000000   0.16983740  0.21158008
Concreteness 0.1698374   1.00000000 -0.01501261
Frequency    0.2115801  -0.01501261  1.00000000

💭 Check-in

Try building a correlation matrix of the numerical columns from the Pokemon dataset (e.g., Attack, Defense, etc.). Are there any conclusions you can draw?

Part 2: More complex wrangling

Missing data, reshaping, and joining.

Missing values

Real-world data often has missing values, which are treated as NA (“Not Available”) by R.

There are a few relevant issues here:

First, you need to identify the presence of missing values.
Second, you need to address the issue.
- Ignore them?
- Remove any rows with missing values?
- Try to impute their values?

Identifying missing values

We can use the is.na function to check for missing values.

### Reading in dataset
df_titanic <- read_csv("https://raw.githubusercontent.com/seantrott/ucsd_css211_datasets/main/main/wrangling/titanic.csv")

### Check for missing values; returns a vector of boolean values
is.na(df_titanic$Age)[0:5]

[1] FALSE FALSE FALSE FALSE FALSE

💭 Check-in

How many missing values are there total in the Age column? What about Survived or Cabin? (Hint: You can sum boolean values like TRUE and FALSE).

Identify missing values (pt. 2)

We can combine some handy functions (across and everything) to sum up the number of NA values per column.

## Get number of NA across all columns
df_titanic %>% 
  summarise(across(everything(), ~ sum(is.na(.))))

# A tibble: 1 × 12
  PassengerId Survived Pclass  Name   Sex   Age SibSp Parch Ticket  Fare Cabin
        <int>    <int>  <int> <int> <int> <int> <int> <int>  <int> <int> <int>
1           0        0      0     0     0   177     0     0      0     0   687
# ℹ 1 more variable: Embarked <int>

Note

NA values seem disproportionately located in the Cabin column.

Addressing missing values

Now that we’ve identified missing values, we have a few options:

We could ignore them, i.e., perform analyses as if they don’t exist.
We could remove missing values from our dataset.
We could impute the missing values (guess).

💭 Check-in

What do you think are the trade-offs of each approach?

The problem with ignoring `NA`

Unsurprisingly, ignoring NA values doesn’t mean they go away.

In some cases, analyses or visualizations may “quietly” ignore them.
- This is bad because it means you’re not fully aware of what goes into an analysis.
In other cases, it’ll interfere with your summary statistics.

## Mean defaults to NA b/c of missing values
mean(df_titanic$Age)

[1] NA

## Temporary fix is to remove them from this calculation...
mean(df_titanic$Age, na.rm = TRUE)

[1] 29.69912

Removing missing values

Another common approach to NA values is simply to remove them.

We can do this selectively in a given analysis, like below:

## Temporary fix is to remove them from this calculation...
mean(df_titanic$Age, na.rm = TRUE)

[1] 29.69912

Or we can actually filter our dataset to remove them entirely (or use drop_na).

df_titanic_subset = df_titanic %>%
  filter(!is.na(Age))

💭 Check-in

Do you think we should always just remove all NA values? Why or why not?

Not all `NA` values matter equally

Whether or not we care that something is NA depends on our research questions!

Age might be relevant for analyses of Survived.
Cabin may not really matter.

### Cabin mostly NA, but maybe that's fine?
sum(is.na(df_titanic$Cabin))

[1] 687

df_titanic$Cabin[1:10]

 [1] NA     "C85"  NA     "C123" NA     NA     "E46"  NA     NA     NA

How do `NA` values distribute?

One factor that might affect our approach is how NA values distribute.

Are they distributed randomly with respect to some other variable (e.g., Survived)?
Or are they systematic?
Either way, it’s helpful to know.

### Lower NA information about Cabins for people who survived
df_titanic %>%
  group_by(Survived) %>%
  summarise(proportion_na = mean(is.na(Cabin)))

# A tibble: 2 × 2
  Survived proportion_na
     <dbl>         <dbl>
1        0         0.876
2        1         0.602

💭 Check-in

What is the relative rate of NA values across levels of Survived for Age?

Imputing missing values

Imputing values means inferring them based on some other set of properties or set of assumption. I.e., an “educated guess”.

Can set to mean or median value (e.g., mean(Age)).
Can “guess” value based on some other property (e.g., mean(Age) for that Pclass).

df_titanic = df_titanic %>%
  group_by(Pclass) %>% ### Group by passenger class
  mutate(Age_imputed = ifelse(
    is.na(Age), mean(Age, na.rm = TRUE),  ### Replace with mean for that Pclass
    Age)) %>% ### otherwise just use original age
  ungroup()


df_titanic %>%
  group_by(Pclass) %>%
  summarise(mean_age = mean(Age_imputed))

# A tibble: 3 × 2
  Pclass mean_age
   <dbl>    <dbl>
1      1     38.2
2      2     29.9
3      3     25.1

Missing values: wrap-up

Real-world data often has missing values.
At minimum, it’s important to identify and characterize NA values in your dataset.
Ideally, you should develop a strategy for dealing with them with intentionality.

Note

Sometimes missing values show up when we join datasets. We’ll discuss joining soon.

Reshaping data

Reshaping data means converting it to either a “longer” format (from a “wide” format) or the other way around.

In many cases, this means making data tidy (long format).
The tidyr package contains key functions for this, such as pivot_longer and pivot_wider.
tidyr also includes functions to separate information from a single column across multiple columns (or unite, to do the opposite.)

Note

The best way to understand this is to look at some examples!

`pivot_longer`: from wide to long

Let’s start by creating a wide dataset, which might be considered “messy”.

df_messy_1 = data.frame(john = c(10, 11),
                        mary = c(20, 25),
                        jane = c(5, 10),
                        treatment = c('a', 'b'))
df_messy_1

  john mary jane treatment
1   10   20    5         a
2   11   25   10         b

💭 Check-in

Why is this considered “messy”? And what would a “long” version look like?

`pivot_longer`: from wide to long

We can use the pivot_longer function to transform this into a long format.

df_messy_1 %>%
  pivot_longer(cols = c(john, mary, jane), ### which columns to pivot
               names_to = 'name', ### what to call column with keys
               values_to = 'result') ### what to call column with values

# A tibble: 6 × 3
  treatment name  result
  <chr>     <chr>  <dbl>
1 a         john      10
2 a         mary      20
3 a         jane       5
4 b         john      11
5 b         mary      25
6 b         jane      10

Crucial argument is cols (which columns to pivot).
names_to and values_to are helpful for making the resulting table more readable.

`pivot_longer`: from wide to long

Let’s look at another example.

df_messy_2 = data.frame(name = c('john', 'mary', 'jane'),
                        a = c(10, 20, 5),
                        b = c(11, 25, 10))

df_messy_2

  name  a  b
1 john 10 11
2 mary 20 25
3 jane  5 10

df_tidy = df_messy_2 %>%
  pivot_longer(cols = c(a, b),
               names_to = "treatment",
                values_to = "result")

df_tidy

# A tibble: 6 × 3
  name  treatment result
  <chr> <chr>      <dbl>
1 john  a             10
2 john  b             11
3 mary  a             20
4 mary  b             25
5 jane  a              5
6 jane  b             10

`pivot_wider`: back to wide format!

pivot_wider reshapes data into wide format.

### To recreate first table
df_tidy %>%
  pivot_wider(names_from = "name",
              values_from = "result")

# A tibble: 2 × 4
  treatment  john  mary  jane
  <chr>     <dbl> <dbl> <dbl>
1 a            10    20     5
2 b            11    25    10

### To recreate second table
df_tidy %>%
  pivot_wider(names_from = "treatment",
              values_from = "result")

# A tibble: 3 × 3
  name      a     b
  <chr> <dbl> <dbl>
1 john     10    11
2 mary     20    25
3 jane      5    10

Your turn! (pt. 1)

If you’ve loaded tidyverse, you should have access to (at least) the following data tables:

table1
table2

💭 Check-in

How would you pivot_wider table2 such that it resembles table1?

Your turn! (pt. 2)

Now let’s try pivot_longer with a more complex example. Load the missing work example from this Full Stack Economics graph.

### Part 1
df_work <- read_csv("https://raw.githubusercontent.com/seantrott/ucsd_css211_datasets/main/main/wrangling/missing_work.csv")
head(df_work, 3)

# A tibble: 3 × 7
   Year `Child care problems` Maternity or paternity le…¹ Other family or pers…²
  <dbl>                 <dbl>                       <dbl>                  <dbl>
1  2012                    18                         313                    246
2  2012                    35                         278                    230
3  2012                    13                         245                    246
# ℹ abbreviated names: ¹`Maternity or paternity leave`,
#   ²`Other family or personal obligations`
# ℹ 3 more variables: `Illness or injury` <dbl>, Vacation <dbl>, Month <dbl>

💭 Check-in

Right now, the table stores different reasons for missing work as different columns.

Instead, let’s pivot_longer so the table has a column for Reason (for missing) and a column for Count (number of days missed).

Other handy `tidyr` functions

Although pivoting is probably the main function enabled by tidyr, it also contains other valuable reshaping functions:

separate “splits” values within a given column into multiple columns.
unite does the opposite.

table3 %>%
  separate(rate,  ### column to separate
           into = c("cases", "population"),  ### columns to split into
           convert = TRUE, ### convert into numeric
           sep = "/") ### separator currently separating values

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Joining data

Joining is an operation in which columns from data Y are added to data X, matching observations based on shared variables (or “keys”).

In real-world analysis, data is often stored across multiple sources.
Joining brings together related information from separate tables.
Key types: left_join(), right_join(), inner_join(), full_join()

Example scenario:

Table 1: State election results (state, biden_pct, turnout)
Table 2: State demographics (state, median_income, college_pct)

Here, we might join the tables based on the overlapping state variable.

Goal: Analyze how median_income and college_pct relates to voting patterns.

Types of joins

Given tables X and Y, you have several options for how to join them.

inner_join: only keep observations from X with matching key in Y.
left_join: keep all observations from X, join matching ones from Y.
right-join: keep all observations from Y, join matching ones from X.
full_join: keep all observations in both X and Y.

Simple join example

Suppose we have two tables: table4a contains information about the cases in each country (with different columns for different years), and table4b contains information about the population of each country (again, with different columns for different years).

Ideally, we’d like to join these tables.

💭 Check-in

What steps will we need to take to join them?

(Hint: It may involve more than joining.)

Simple join example

### First, tidy each table
tidy4a = table4a %>%
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
tidy4b = table4b %>%
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")

### Then, join
left_join(tidy4a, tidy4b)

# A tibble: 6 × 4
  country     year   cases population
  <chr>       <chr>  <dbl>      <dbl>
1 Afghanistan 1999     745   19987071
2 Afghanistan 2000    2666   20595360
3 Brazil      1999   37737  172006362
4 Brazil      2000   80488  174504898
5 China       1999  212258 1272915272
6 China       2000  213766 1280428583

💭 Check-in

In this case, would it matter whether we use left_join or inner_join? If not, under what circumstances would this distinction matter?

A more complex example

We’ve already loaded information about word concreteness. Now let’s load another dataset about the average age of acquisition at which words are learned, and combine that with the concreteness dataset.

df_aoa <- read_csv("https://raw.githubusercontent.com/seantrott/ucsd_css211_datasets/main/main/wrangling/AoA.csv")
nrow(df_aoa)

[1] 31124

head(df_aoa)

# A tibble: 6 × 2
  Word        AoA
  <chr>     <dbl>
1 a          2.89
2 aardvark   9.89
3 abacus     8.69
4 abalone   12.2 
5 abandon    8.32
6 abandoner 11.9

💭 Check-in

How would you join this dataset with the concreteness dataset? What kind of join operation might you want to use?

A more complex example (pt. 2)

The concreteness and AoA datasets are not fully-overlapping. Concreteness contains words not in AoA and vice versa.

### Inner join (keep only intersection)
df_inner = df_conc %>%
  inner_join(df_aoa)
nrow(df_inner)

[1] 23568

### Left join (keep all concreteness)
df_left = df_conc %>%
  left_join(df_aoa)
nrow(df_left)

[1] 28612

### Right join (keep all AoA)
df_right = df_conc %>%
  right_join(df_aoa)
nrow(df_right)

[1] 31124

### full join (keep everything)
df_full = df_conc %>%
  full_join(df_aoa)
nrow(df_full)

[1] 36168

Joining: the details

When joining, there are a few other details we’ve glossed over:

by: this specifies which keys to join on (will default to all overlapping).
multiple: how to handle cases where rows from X match multiple rows from Y.
unmatched: how to handle cases where rows would be dropped.

Putting it all together

These lectures have covered the basics of data wrangling with the tidyverse.

A typical workflow will involve:

Reading data (read_csv).
Exploring the data (is.na, summary, summarise).
Clean data (mutate, filter).
Reshape if needed (pivot_longer/wider).
Join if needed (inner_join, etc.).

Note

The end result is a dataset ready for further analysis and visualization (coming up!).

Data Wrangling in R

Goals of the lecture

What is data wrangling?

Tidy data

Each variable is a column

Each variable is a column

Each observation is a row

Each observation is a row

Each type of observational unit is a table

Tidy data revisited

Welcome to the tidyverse

Loading the tidyverse

The pipe (%>%)

Part 1: Basic data manipulation

Reading in a dataset

Basics of dplyr

The filter

mutate your data

mutate (pt. 2)

group_by specific factors and summarise

Counting the number of observations

Hands-on practice

Hands-on practice (solution)

Some new functions

select specific columns

Part 2: More complex wrangling

Missing values

Identifying missing values

Identify missing values (pt. 2)

Addressing missing values

The problem with ignoring NA

Removing missing values

Not all NA values matter equally

How do NA values distribute?

Imputing missing values

Missing values: wrap-up

Reshaping data

pivot_longer: from wide to long

pivot_longer: from wide to long

pivot_longer: from wide to long

pivot_wider: back to wide format!

Your turn! (pt. 1)

Your turn! (pt. 2)

Other handy tidyr functions

Joining data

Types of joins

Simple join example

Simple join example

A more complex example

A more complex example (pt. 2)

Joining: the details

Putting it all together

Welcome to the `tidyverse`

Loading the `tidyverse`

The pipe (`%>%`)

Basics of `dplyr`

The `filter`

`mutate` your data

`mutate` (pt. 2)

`group_by` specific factors and `summarise`

`select` specific columns

The problem with ignoring `NA`

Not all `NA` values matter equally

How do `NA` values distribute?

`pivot_longer`: from wide to long

`pivot_longer`: from wide to long

`pivot_longer`: from wide to long

`pivot_wider`: back to wide format!

Other handy `tidyr` functions