High-level goals

The goal of this tutorial is to provide a smattering of useful libraries and tools for data wrangling and data visualization in R.

The term “data wrangling” refers to the processes of transforming or manipulating raw data into a useful format for downstream analysis and visualization. Here, data are conceptualized as some sort of livestock, which must be “herded” and/or organized—in this scenario, you presumably fill the role of the rancher. But regardless of the metaphor employed, the core idea is that working with data frequently requires transforming that data in a manner conducive to the set of analyses, summaries, or visualizations you want to produce. Form must be fit to the desired function.

Data visualization is perhaps a more intuitive term, referring to the process (and result) of representing data graphically. There are (in my view) at least two main reasons to generate data visualizations. First, it is a useful way to communicate information about data; to this end, data visualizations should be clear, accurate (obviously), and relatively concise 1. And second, data visualization can be helpful for exploring trends or pattern in your data. (Though, a brief caveat: data visualization is useful for generating hypotheses in an exploratory way, but you need to watch out for using it to hypothesize after the results are known, i.e., HARKing. That is, generating a bunch of different data visualizations to determine which statistical test to run isn’t much different from running a bunch of t-tests and selecting only the significant results.)

This tutorial is certainly not exhaustive. The point is simply to serve as an introduction or starting point for learning more about some of these useful tools. There are, of course, many additional resources for learning about data wrangling and visualization in R:

I also have a number of other statistics tutorials in case you’re interested.

Load libraries

This tutorial will use examples from the following libraries:

  • tidyverse
  • corrplot
  • forcats
  • ggridges
  • lme4
  • clue

Data wrangling

As noted above, “data wrangling” is a broad term encompassing pretty much any transformations you might need to apply to your data to get it into the proper format for visualization or analysis. I’ll be focusing on some of the common operations found in the tidyverse library (most of which are also summarized in this extremely helpful cheatsheet).

Tidy data: thinking about dataset structure

There are, of course, many philosophies on what makes data “messy” or “clean”. Central to the tidyverse set of packages is the notion of tidy data.

Tidy data, as defined by Hadley Wickham (Wickham, 2014), have the following characteristics:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

Here’s Wickham (2014, pg. 5) on why tidy data is appropriate:

Tidy data makes it easy for an analyst or a computer to extract needed variables because it provides a standard way of structuring a dataset… If you consider how many data analysis operations involve all of the values in a variable (every aggregation function), you can see how important it is to extract these values in a simple, standard way. Tidy data is particularly well suited for vectorised programming languages like R, because the layout ensures that values of different variables from the same observation are always paired.

The way I think about this is: if you imagine the statistical model you want to build (e.g., \(Y = X_1 + X_2\)), the summary statistic you want to calculate (e.g., mean \(Y\) for values of categorical factor \(X_1\)), or the visualization you want to produce (e.g., a boxplot of \(Y\) as a function of categorical factor \(X_1\)), each of the relevant dimensions should be a column in your dataset, and each of the observations you want to model, summarize, or plot should be a row in your dataset. Tidy data makes it easy to summarize, manipulate, visualize, and model your data.

As an example, let’s consider the following three data structures (adapted from Wickham, 2014) all representing the “same” underlying data—three participants (john, mary, and jane) in an experiment with two treatments (a/b/), with each treatment for each participant producing some result. In each case, we want to ask the same question of the data: does the treatment applied produce different a result, controlling for differences across subjects?

The formal specification for such a model might look as follows 2:

\(Y_{result} = X_{treatment} + W_{subject} + \epsilon\)

First up is an example of “messy” data. Each participant gets their own column of results, and the treatment is indicated by the treatment column.

df_messy_1 = data.frame(john = c(10, 11),
                        mary = c(20, 25),
                        jane = c(5, 10),
                        treatment = c('a', 'b'))

df_messy_1
##   john mary jane treatment
## 1   10   20    5         a
## 2   11   25   10         b

This dataset structure makes certain operations easy, like calculating the mean value for a given participant:

mean(df_messy_1$john)
## [1] 10.5

But other operations are much harder, like calculating the mean value for a treatment. And it’s also not clear how you’d easily build a statistical model using standard modeling functions in R (e.g., lm or aov).

Here’s another example of a messy data structure:

df_messy_2 = data.frame(name = c('john', 'mary', 'jane'),
                        a = c(10, 20, 5),
                        b = c(11, 25, 10))

df_messy_2
##   name  a  b
## 1 john 10 11
## 2 mary 20 25
## 3 jane  5 10

Now it’s easy to get the mean of a treatment:

mean(df_messy_2$a)
## [1] 11.66667

But now you can’t easily get the mean of each participant. And if you want to model the differences between a and b, it’s again not clear how you’d specify that model in R—particularly if you wanted to add a grouping factor for each participant.

Now here’s an example of tidy data:

df_tidy = data.frame(name = c('john', 'john', 'mary', 'mary', 'jane', 'jane'),
                     result = c(10, 11, 20, 25, 5, 10),
                     treatment = c('a', 'b', 'a', 'b', 'a', 'b'))

df_tidy
##   name result treatment
## 1 john     10         a
## 2 john     11         b
## 3 mary     20         a
## 4 mary     25         b
## 5 jane      5         a
## 6 jane     10         b

This data structure is well-suited to the question we want to ask: do differences in treatment affect result across different participants (i.e., name)? It translates very well to the formal specification of this question described above:

model_full = lmer(data = df_tidy,
                  result ~ treatment + (1 | name))

It’s also very easy to use other tidyverse functions to summarize this data, and critically, the same functions can be used to summarize different aspects of the data:

df_tidy %>%
  group_by(name) %>%
  summarise(mean_result = mean(result),
            sd_result = sd(result))
## Warning: package 'bindrcpp' was built under R version 3.4.4
## # A tibble: 3 x 3
##   name  mean_result sd_result
##   <fct>       <dbl>     <dbl>
## 1 jane          7.5     3.54 
## 2 john         10.5     0.707
## 3 mary         22.5     3.54
df_tidy %>%
  group_by(treatment) %>%
  summarise(mean_result = mean(result),
            sd_result = sd(result))
## # A tibble: 2 x 3
##   treatment mean_result sd_result
##   <fct>           <dbl>     <dbl>
## 1 a                11.7      7.64
## 2 b                15.3      8.39

Brief note on dataset structure

I recognize that thinking about data structures isn’t necessarily intuitive or even interesting. It certainly isn’t intuitive to me, and it’s hard to find it interesting without understanding it.

But anyone working with data will come up against these issues at some point, and it’s helpful to have a vocabulary with which to identify the problems you’re facing. For a long time, I was frustrated because I often had messy data and wanted it to be tidy, but I didn’t even know that’s what I wanted. Now I’m less frustrated: I still have messy data but I’m a little better at diagnosing exactly what’s messy about it, and that helps make it tidy.

I also think it’s important to point out that the structures in df_messy_1 or df_messy_2 aren’t inherently “bad” representations of data; they might be useful for certain purposes, like presenting a table in a paper. But the point here is that tidy data makes many computational operations very easy, and that’s part of the goal of data wrangling.

Reshaping: how to tidy messy data?

An obvious question at this point is: given that my data isn’t tidy, how do I make it tidy?

Fortunately, the tidyverse has a number of options for this. One of the most relevant is the gather function. From the documentation:

Gather takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed. You use gather() when you notice that you have columns that are not variables.

In other words, gather is a really straightforward way to make data tidy, in which each row consists of an observation.

Let’s use gather on each of our messy dataframes from above.

Tidying messy data no. 1

First, consider the dataframe in which each participant gets their own column.

df_messy_1
##   john mary jane treatment
## 1   10   20    5         a
## 2   11   25   10         b

How could we make this tidy? There are currently four columns and two rows, but in reality, we actually have six observations: one per treatment per participant. This means that ultimately, we want our data structure to have three columns (following the format of df_tidy above): one column indicating the treatment, one column indicating the result for that treatment, and one column indicating the name of the participant from whom that result was obtained.

This is the call to gather that would produce such a data structure:

df_messy_1 %>%
  gather(... = c(john, mary, jane), key = 'name', value = 'result')
##   treatment name result
## 1         a john     10
## 2         b john     11
## 3         a mary     20
## 4         b mary     25
## 5         a jane      5
## 6         b jane     10

What exactly is happening here?

First, note the piping operator %>%: this is a way to chain together multiple functions in R. In this case, we can omit the data argument from gather because %>% is telling R to pass it in already. I like this approach. because each function/operation can be given its own line, making the code clear and readable.

Next, the ... = c(john, mary, jane) argument tells gather which columns to “gather”, essentially. If you leave this blank, it’ll try to gather all the columns, but that’s not what we want here: we don’t want to put treatment and name in the same column. In this case, we’re telling the function to collect the three named columns and put them into a single set of key/value columns.

Finally, we tell gather what we want to name these new columns. You can leave this blank too, in which case they’ll just be named key and value. But here we know that we want to name them name (the participant) and result (the outcome of that treatment).

Tidying messy data no. 2

Now let’s turn to the other messy data structure:

df_messy_2
##   name  a  b
## 1 john 10 11
## 2 mary 20 25
## 3 jane  5 10

This one has the opposite problem. Each name gets its own row, but there are separate columns for the two treatments. We need to gather these two columns and put them into a single set of key/value columns:

df_messy_2 %>%
  gather(... = c(a, b), key = 'treatment', value = 'result')
##   name treatment result
## 1 john         a     10
## 2 mary         a     20
## 3 jane         a      5
## 4 john         b     11
## 5 mary         b     25
## 6 jane         b     10

Just as above, this produces a data structure that’s identical (disregarding the order of the rows) to df_tidy!

Other reshaping operations

There are a number of operations that fall under the broad category of reshaping“, and even many other packages (including reshape) that have their own functions (e.g., melt) to reshape data.

This cheatsheet summarizes some of the tidyverse functions for reshaping very succintly.

Subsetting operations

Another very common operation is subsetting data in some way—either subsetting particular rows based on the values in their cells (i.e., filtering), or subsetting particular columns based on their name.

Subsetting rows

filter is probably the most frequent row subsetting operation I use. Combined with the pipe %>% syntax, it’s a great way to pipe in different subsets of your data into a visualization or aggregation function for exploratory data analysis. It’s also just an integral part of any data processing pipeline in which you need to exclude data on the basis of certain features.

The syntax is straightforward. Here’s a demonstration using the built-in mtcars dataset, which has information about the average miles-per-gallon (mpg), number of cylinders (cyl), etc., about 32 different cars. We can use filter to only consider cars with mpg better than the mean mpg in the dataset.

mean(mtcars$mpg)
## [1] 20.09062
mtcars %>%
  filter(mpg > mean(mtcars$mpg)) 
##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 5  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 6  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 7  32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 8  30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 9  33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 10 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## 11 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## 12 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## 13 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## 14 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Note that for numeric variables, standard mathematical operators all apply for filter: >, >=, <, <=, ==, !=. You can also use other operators like %in%—which, as the name implies, tests whether the value in a given row is contained in a given vector. For nominal (i.e., categorical) variables, you obviously can’t use comparators like >. But you can still test for string equality (!=) or use operations like %in%.

For instance, we can filter our df_tidy dataframe from earlier to only consider observations from a given participant:

df_tidy %>%
  filter(name == "john")
##   name result treatment
## 1 john     10         a
## 2 john     11         b

Another useful subsetting operation is distinct. This will return a dataframe with distinct rows, based on the column names you passed in. If you don’t pass any column names in, it will check for rows that are entirely the same, i.e., have the same value for each column. This is also a useful way to check for duplicates in your data.

You might also want to randomly sample rows from your data, either in terms of an absolute number of rows or in terms of a proportion of your dataset. tidyverse has functions for both of these operations. First, you can sample by an absolute number using sample_n:

df_tidy %>%
  sample_n(2, replace=FALSE)
##   name result treatment
## 4 mary     25