tidyverse.Data wrangling refers to the processes of transforming or manipulating raw data into a useful format for downstream analysis and visualization.
In CSS, you’ll find yourself needing to:
These are all supported by the tidyverse! But what is tidy data?
Wickham (2014) defines tidy data as follows:
“Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.”
Let’s address each of these in turn:
💭 Check-in
With a partner, discuss what each of these criteria might mean.
A variable contains values that measure the same attribute across units.
height, age, income, test_score, population.Untidy example:
| country | gdp_2020 | gdp_2021 | gdp_2022 |
|---|---|---|---|
| USA | 21.35 | 23.368 | 26.01 |
| China | 14.72 | 18.2 | 18.32 |
| Germany | 3.94 | 4.35 | 4.16 |
💭 Check-in
Why is this not tidy? How would you make it tidy?
Now, we have a column for country, year, and gdp.
Tidy example:
| country | year | gdp_trillion |
|---|---|---|
| USA | 2020 | 21.35 |
| USA | 2021 | 23.368 |
| USA | 2022 | 26.01 |
| China | 2020 | 14.72 |
| China | 2021 | 18.2 |
| China | 2022 | 18.32 |
| Germany | 2020 | 3.94 |
| Germany | 2021 | 4.35 |
| Germany | 2022 | 4.16 |
An observation contains all values measured on the same unit across attributes.
Untidy example:
| state | measure | value |
|---|---|---|
| California | population | 39.5 |
| California | electoral_votes | 54 |
| Texas | population | 31 |
| Texas | electoral_votes | 40 |
💭 Check-in
Why is this not tidy? How would you make it tidy?
population and electoral_votes are separate measures for the same case (observation), so should be in the same row.
Tidy version:
| state | population_millions | electoral_votes |
|---|---|---|
| California | 39.5 | 54 |
| Texas | 31.0 | 40 |
Different types of things being measured should be in separate tables.
💭 Check-in
Can you think of other examples of “observational units” you would want to keep in separate tables?
“Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.” (Wickham, 2014)
Note
In fact, I often conceptualize “tidy” data in terms of:
tidyverseThe
tidyverseis an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.”
tidyverse rests on the philosophy of tidy data.read_csv) to merging datasets (inner_join).💭 Installation Help
tidyverse using install.packages("tidyverse").library(tidyverse).tidyverse [1] "broom" "conflicted" "cli" "dbplyr"
[5] "dplyr" "dtplyr" "forcats" "ggplot2"
[9] "googledrive" "googlesheets4" "haven" "hms"
[13] "httr" "jsonlite" "lubridate" "magrittr"
[17] "modelr" "pillar" "purrr" "ragg"
[21] "readr" "readxl" "reprex" "rlang"
[25] "rstudioapi" "rvest" "stringr" "tibble"
[29] "tidyr" "xml2" "tidyverse"
broom and ggplot2.readr, magrittr, dplyr, and tidyr.%>%)The pipe operator (
%>%) lets you chain functions together in a readable way.
Importing, filtering, mutating, and summarizing data.
One of the most fundamental things you’ll need to do is import data, e.g., from a .csv file.
read_csv takes as input a filepath (either locally, or a URL).tibble.[1] 28612
💭 Check-in
Use of the functions we’ve discussed (nrow, names, head, etc.) to explore the dataset.
dplyr
dplyris a package in thetidyversethat contains functions for transforming and manipulating dataframes.
Each function is a “verb”, for instance:
filter relevant rows.mutate the dataset by adding new variables.pull specific columns by name.rename specific columns.group_by specific values or factors.summarize dataset with summary statistics.filter
filterworks by selecting a subset of rows based on some boolean condition.
# A tibble: 1 × 4
Word Concreteness Frequency Dom_Pos
<chr> <dbl> <dbl> <chr>
1 class 3.85 5985 Noun
💭 Check-in
Try using filter to return words that:
Concreteness > 4.5.Dom_Pos of Noun.Hint: You can chain multiple filter statements together with a %>% operator.
mutate your data
mutateworks by creating a new variable (or overwriting an existing one) via applying some transformation to an existing variable or set of variables.
Let’s create a log frequency variable, since Frequency is highly right-skewed.
Min. 1st Qu. Median Mean 3rd Qu. Max.
-Inf 0.8451 1.4624 -Inf 2.1004 6.3293
💭 Check-in
Why do you think the mininum log_freq value is -Inf?
mutate (pt. 2)Some words in our dataset have a Frequency of 0. One strategy is to add one to those values before taking the log.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.9031 1.4771 1.5793 2.1038 6.3293
group_by specific factors and summarisegroup_by function is extremely useful for grouping your data by specific factors.%>% this into summarise to calculate summary statistics for each group.# A tibble: 2 × 2
Dom_Pos mean_conc
<chr> <dbl>
1 Adjective 2.50
2 Adverb 2.06
💭 Check-in
In addition to calculating mean, try also calculating the sd; additionally, try calculating the mean and sd for log_freq by Dom_Pos.
Often relevant for a group_by calculation is the number of observations per group.
# A tibble: 2 × 3
Dom_Pos mean_conc count
<chr> <dbl> <int>
1 Adjective 2.50 6112
2 Adverb 2.06 1876
Now let’s get some hands-on practice applying these new techniques.
pokemon.csv dataset from the ucsd_211_datasets GitHub using read_csv.group_by to count the number of Pokemon of each Type 1. Which Type 1 is most frequent?filter the dataset to only include Pokemon of the most frequent type (based on what you learned).group_by legendary status and calculate the mean(Attack)### Part 1
df_pokemon <- read_csv("https://raw.githubusercontent.com/seantrott/ucsd_css211_datasets/main/main/wrangling/pokemon.csv")
### Part 2
max_type = df_pokemon %>%
group_by(`Type 1`) %>%
summarise(count = n()) %>%
slice_max(count) %>%
pull(`Type 1`)
### Parts 3-4
df_pokemon %>%
filter(`Type 1` == max_type) %>%
group_by(Legendary) %>%
summarise(mean_attack = mean(Attack))# A tibble: 2 × 2
Legendary mean_attack
<lgl> <dbl>
1 FALSE 72.8
2 TRUE 111.
💭 Check-in
Notice any functions we haven’t discussed?
Above, we used a combination of read_csv, filter, group_by, and summarise. But we also used some other new functions:
slice_max: returns rows with the max values of a given column.
slice_head, slice_tail, slice_min, and slice_sample.pull: returns the individual vector/value from a given column.💭 Check-in
Any questions before we move on to more dplyr functions and more aspects of the tidyverse more generally?
select specific columnsThe
selectfunction can be used to select certain columns for further analysis.
This is especially useful if you want to %>% those columns into, say, a correlation matrix.
log_freq Concreteness Frequency
log_freq 1.0000000 0.16983740 0.21158008
Concreteness 0.1698374 1.00000000 -0.01501261
Frequency 0.2115801 -0.01501261 1.00000000
💭 Check-in
Try building a correlation matrix of the numerical columns from the Pokemon dataset (e.g., Attack, Defense, etc.). Are there any conclusions you can draw?
Missing data, reshaping, and joining.
Real-world data often has missing values, which are treated as
NA(“Not Available”) by R.
There are a few relevant issues here:
We can use the is.na function to check for missing values.
[1] FALSE FALSE FALSE FALSE FALSE
💭 Check-in
How many missing values are there total in the Age column? What about Survived or Cabin? (Hint: You can sum boolean values like TRUE and FALSE).
We can combine some handy functions (across and everything) to sum up the number of NA values per column.
# A tibble: 1 × 12
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 0 0 0 0 0 177 0 0 0 0 687
# ℹ 1 more variable: Embarked <int>
Note
NA values seem disproportionately located in the Cabin column.
Now that we’ve identified missing values, we have a few options:
💭 Check-in
What do you think are the trade-offs of each approach?
NAUnsurprisingly, ignoring NA values doesn’t mean they go away.
Another common approach to NA values is simply to remove them.
We can do this selectively in a given analysis, like below:
[1] 29.69912
Or we can actually filter our dataset to remove them entirely (or use drop_na).
💭 Check-in
Do you think we should always just remove all NA values? Why or why not?
NA values matter equallyWhether or not we care that something is NA depends on our research questions!
Age might be relevant for analyses of Survived.Cabin may not really matter.NA values distribute?One factor that might affect our approach is how NA values distribute.
Survived)?# A tibble: 2 × 2
Survived proportion_na
<dbl> <dbl>
1 0 0.876
2 1 0.602
💭 Check-in
What is the relative rate of NA values across levels of Survived for Age?
Imputing values means inferring them based on some other set of properties or set of assumption. I.e., an “educated guess”.
mean or median value (e.g., mean(Age)).mean(Age) for that Pclass).df_titanic = df_titanic %>%
group_by(Pclass) %>% ### Group by passenger class
mutate(Age_imputed = ifelse(
is.na(Age), mean(Age, na.rm = TRUE), ### Replace with mean for that Pclass
Age)) %>% ### otherwise just use original age
ungroup()
df_titanic %>%
group_by(Pclass) %>%
summarise(mean_age = mean(Age_imputed))# A tibble: 3 × 2
Pclass mean_age
<dbl> <dbl>
1 1 38.2
2 2 29.9
3 3 25.1
NA values in your dataset.Note
Sometimes missing values show up when we join datasets. We’ll discuss joining soon.
Reshaping data means converting it to either a “longer” format (from a “wide” format) or the other way around.
tidyr package contains key functions for this, such as pivot_longer and pivot_wider.tidyr also includes functions to separate information from a single column across multiple columns (or unite, to do the opposite.)Note
The best way to understand this is to look at some examples!
pivot_longer: from wide to longLet’s start by creating a wide dataset, which might be considered “messy”.
john mary jane treatment
1 10 20 5 a
2 11 25 10 b
💭 Check-in
Why is this considered “messy”? And what would a “long” version look like?
pivot_longer: from wide to longWe can use the pivot_longer function to transform this into a long format.
# A tibble: 6 × 3
treatment name result
<chr> <chr> <dbl>
1 a john 10
2 a mary 20
3 a jane 5
4 b john 11
5 b mary 25
6 b jane 10
cols (which columns to pivot).names_to and values_to are helpful for making the resulting table more readable.pivot_longer: from wide to longLet’s look at another example.
pivot_wider: back to wide format!
pivot_widerreshapes data into wide format.
If you’ve loaded tidyverse, you should have access to (at least) the following data tables:
table1table2💭 Check-in
How would you pivot_wider table2 such that it resembles table1?
Now let’s try pivot_longer with a more complex example. Load the missing work example from this Full Stack Economics graph.
# A tibble: 3 × 7
Year `Child care problems` Maternity or paternity le…¹ Other family or pers…²
<dbl> <dbl> <dbl> <dbl>
1 2012 18 313 246
2 2012 35 278 230
3 2012 13 245 246
# ℹ abbreviated names: ¹`Maternity or paternity leave`,
# ²`Other family or personal obligations`
# ℹ 3 more variables: `Illness or injury` <dbl>, Vacation <dbl>, Month <dbl>
💭 Check-in
Right now, the table stores different reasons for missing work as different columns.
Instead, let’s pivot_longer so the table has a column for Reason (for missing) and a column for Count (number of days missed).
tidyr functionsAlthough pivoting is probably the main function enabled by tidyr, it also contains other valuable reshaping functions:
separate “splits” values within a given column into multiple columns.unite does the opposite.# A tibble: 6 × 4
country year cases population
<chr> <dbl> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Joining is an operation in which columns from data
Yare added to dataX, matching observations based on shared variables (or “keys”).
left_join(), right_join(), inner_join(), full_join()Example scenario:
state, biden_pct, turnout)state, median_income, college_pct)Here, we might join the tables based on the overlapping state variable.
Goal: Analyze how median_income and college_pct relates to voting patterns.
Given tables X and Y, you have several options for how to join them.
inner_join: only keep observations from X with matching key in Y.left_join: keep all observations from X, join matching ones from Y.right-join: keep all observations from Y, join matching ones from X.full_join: keep all observations in both X and Y.Suppose we have two tables: table4a contains information about the cases in each country (with different columns for different years), and table4b contains information about the population of each country (again, with different columns for different years).
Ideally, we’d like to join these tables.
💭 Check-in
What steps will we need to take to join them?
(Hint: It may involve more than joining.)
# A tibble: 6 × 4
country year cases population
<chr> <chr> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
💭 Check-in
In this case, would it matter whether we use left_join or inner_join? If not, under what circumstances would this distinction matter?
We’ve already loaded information about word concreteness. Now let’s load another dataset about the average age of acquisition at which words are learned, and combine that with the concreteness dataset.
[1] 31124
# A tibble: 6 × 2
Word AoA
<chr> <dbl>
1 a 2.89
2 aardvark 9.89
3 abacus 8.69
4 abalone 12.2
5 abandon 8.32
6 abandoner 11.9
💭 Check-in
How would you join this dataset with the concreteness dataset? What kind of join operation might you want to use?
The concreteness and AoA datasets are not fully-overlapping. Concreteness contains words not in AoA and vice versa.
[1] 23568
[1] 28612
[1] 31124
[1] 36168
When joining, there are a few other details we’ve glossed over:
by: this specifies which keys to join on (will default to all overlapping).multiple: how to handle cases where rows from X match multiple rows from Y.unmatched: how to handle cases where rows would be dropped.These lectures have covered the basics of data wrangling with the tidyverse.
A typical workflow will involve:
read_csv).is.na, summary, summarise).mutate, filter).(pivot_longer/wider).inner_join, etc.).Note
The end result is a dataset ready for further analysis and visualization (coming up!).
CSS 211 | UC San Diego