The goal of this tutorial is to provide a smattering of useful libraries and tools for **data wrangling** and **data visualization** in R.

The term “data wrangling” refers to the processes of *transforming* or *manipulating* raw data into a useful format for downstream analysis and visualization. Here, data are conceptualized as some sort of livestock, which must be “herded” and/or organized—in this scenario, you presumably fill the role of the rancher. But regardless of the metaphor employed, the core idea is that working with data frequently requires transforming that data in a manner conducive to the set of analyses, summaries, or visualizations you want to produce. Form must be fit to the desired function.

Data visualization is perhaps a more intuitive term, referring to the process (and result) of representing data graphically. There are (in my view) at least two main reasons to generate data visualizations. First, it is a useful way to **communicate information about data**; to this end, data visualizations should be clear, accurate (obviously), and relatively concise ^{1}. And second, data visualization can be helpful for exploring trends or pattern in your data. (Though, a brief caveat: data visualization is useful for generating hypotheses in an exploratory way, but you need to watch out for using it to hypothesize after the results are known, i.e., HARKing. That is, generating a bunch of different data visualizations to determine which statistical test to run isn’t much different from running a bunch of t-tests and selecting only the significant results.)

This tutorial is certainly not exhaustive. The point is simply to serve as an introduction or starting point for learning more about some of these useful tools. There are, of course, many additional resources for learning about data wrangling and visualization in R:

- data wrangling cheatsheet

- tidy data (Wickham, 2014)
- data visualization cheatsheet

- a layered grammar of graphics (Wickham 2010)

- data visualization tutorial

I also have a number of other statistics tutorials in case you’re interested.

This tutorial will use examples from the following libraries:

`tidyverse`

`corrplot`

`forcats`

`ggridges`

`lme4`

`clue`

As noted above, “data wrangling” is a broad term encompassing pretty much any transformations you might need to apply to your data to get it into the proper format for visualization or analysis. I’ll be focusing on some of the common operations found in the `tidyverse`

library (most of which are also summarized in this extremely helpful cheatsheet).

There are, of course, many philosophies on what makes data “messy” or “clean”. Central to the `tidyverse`

set of packages is the notion of **tidy data**.

Tidy data, as defined by Hadley Wickham (Wickham, 2014), have the following characteristics:

- Each variable forms a column.

- Each observation forms a row.
- Each type of observational unit forms a table.

Here’s Wickham (2014, pg. 5) on why tidy data is appropriate:

Tidy data makes it easy for an analyst or a computer to extract needed variables because it provides a standard way of structuring a dataset… If you consider how many data analysis operations involve all of the values in a variable (every aggregation function), you can see how important it is to extract these values in a simple, standard way. Tidy data is particularly well suited for vectorised programming languages like R, because the layout ensures that values of different variables from the same observation are always paired.

The way I think about this is: if you imagine the statistical model you want to build (e.g., \(Y = X_1 + X_2\)), the summary statistic you want to calculate (e.g., mean \(Y\) for values of categorical factor \(X_1\)), or the visualization you want to produce (e.g., a boxplot of \(Y\) as a function of categorical factor \(X_1\)), **each of the relevant dimensions should be a column in your dataset, and each of the observations you want to model, summarize, or plot should be a row in your dataset**. Tidy data makes it easy to summarize, manipulate, visualize, and model your data.

As an example, let’s consider the following three data structures (adapted from Wickham, 2014) all representing the “same” underlying data—three participants (`john`

, `mary`

, and `jane`

) in an experiment with two `treatments`

(`a/b/`

), with each treatment for each participant producing some `result`

. In each case, we want to ask the same question of the data: does the `treatment`

applied produce different a `result`

, controlling for differences across subjects?

The formal specification for such a model might look as follows ^{2}:

\(Y_{result} = X_{treatment} + W_{subject} + \epsilon\)

First up is an example of “messy” data. Each participant gets their own column of results, and the treatment is indicated by the `treatment`

column.

```
df_messy_1 = data.frame(john = c(10, 11),
mary = c(20, 25),
jane = c(5, 10),
treatment = c('a', 'b'))
df_messy_1
```

```
## john mary jane treatment
## 1 10 20 5 a
## 2 11 25 10 b
```

This dataset structure makes certain operations easy, like calculating the mean value for a given participant:

`mean(df_messy_1$john)`

`## [1] 10.5`

But other operations are much harder, like calculating the mean value for a `treatment`

. And it’s also not clear how you’d easily build a statistical model using standard modeling functions in R (e.g., `lm`

or `aov`

).

Here’s another example of a messy data structure:

```
df_messy_2 = data.frame(name = c('john', 'mary', 'jane'),
a = c(10, 20, 5),
b = c(11, 25, 10))
df_messy_2
```

```
## name a b
## 1 john 10 11
## 2 mary 20 25
## 3 jane 5 10
```

Now it’s easy to get the mean of a `treatment`

:

`mean(df_messy_2$a)`

`## [1] 11.66667`

But now you can’t easily get the mean of each participant. And if you want to model the differences between `a`

and `b`

, it’s again not clear how you’d specify that model in R—particularly if you wanted to add a grouping factor for each participant.

Now here’s an example of **tidy** data:

```
df_tidy = data.frame(name = c('john', 'john', 'mary', 'mary', 'jane', 'jane'),
result = c(10, 11, 20, 25, 5, 10),
treatment = c('a', 'b', 'a', 'b', 'a', 'b'))
df_tidy
```

```
## name result treatment
## 1 john 10 a
## 2 john 11 b
## 3 mary 20 a
## 4 mary 25 b
## 5 jane 5 a
## 6 jane 10 b
```

This data structure is well-suited to the question we want to ask: do differences in `treatment`

affect `result`

across different participants (i.e., `name`

)? It translates very well to the formal specification of this question described above:

```
model_full = lmer(data = df_tidy,
result ~ treatment + (1 | name))
```

It’s also very easy to use other `tidyverse`

functions to summarize this data, and critically, the **same functions** can be used to summarize different aspects of the data:

```
df_tidy %>%
group_by(name) %>%
summarise(mean_result = mean(result),
sd_result = sd(result))
```

`## Warning: package 'bindrcpp' was built under R version 3.4.4`

```
## # A tibble: 3 x 3
## name mean_result sd_result
## <fct> <dbl> <dbl>
## 1 jane 7.5 3.54
## 2 john 10.5 0.707
## 3 mary 22.5 3.54
```

```
df_tidy %>%
group_by(treatment) %>%
summarise(mean_result = mean(result),
sd_result = sd(result))
```

```
## # A tibble: 2 x 3
## treatment mean_result sd_result
## <fct> <dbl> <dbl>
## 1 a 11.7 7.64
## 2 b 15.3 8.39
```

I recognize that thinking about data structures isn’t necessarily **intuitive** or even **interesting**. It certainly isn’t intuitive to me, and it’s hard to find it interesting without understanding it.

But anyone working with data will come up against these issues at some point, and it’s helpful to have a *vocabulary* with which to identify the problems you’re facing. For a long time, I was frustrated because I often had messy data and wanted it to be tidy, but I didn’t even know that’s what I wanted. Now I’m less frustrated: I still have messy data but I’m a *little* better at diagnosing exactly what’s messy about it, and that helps make it tidy.

I also think it’s important to point out that the structures in `df_messy_1`

or `df_messy_2`

aren’t inherently “bad” representations of data; they might be useful for certain purposes, like presenting a table in a paper. But the point here is that **tidy** data makes many computational operations very easy, and that’s part of the goal of data wrangling.

An obvious question at this point is: given that my data isn’t tidy, how do I make it tidy?

Fortunately, the `tidyverse`

has a number of options for this. One of the most relevant is the `gather`

function. From the documentation:

Gather takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed. You use gather() when you notice that you have columns that are not variables.

In other words, `gather`

is a really straightforward way to make data tidy, in which each row consists of an **observation**.

Let’s use `gather`

on each of our messy dataframes from above.

First, consider the dataframe in which each participant gets their own column.

`df_messy_1`

```
## john mary jane treatment
## 1 10 20 5 a
## 2 11 25 10 b
```

How could we make this tidy? There are currently four columns and two rows, but in reality, we actually have *six* observations: one per treatment per participant. This means that ultimately, we want our data structure to have **three** columns (following the format of `df_tidy`

above): one column indicating the `treatment`

, one column indicating the `result`

for that treatment, and one column indicating the `name`

of the participant from whom that result was obtained.

This is the call to `gather`

that would produce such a data structure:

```
df_messy_1 %>%
gather(... = c(john, mary, jane), key = 'name', value = 'result')
```

```
## treatment name result
## 1 a john 10
## 2 b john 11
## 3 a mary 20
## 4 b mary 25
## 5 a jane 5
## 6 b jane 10
```

What exactly is happening here?

First, note the piping operator `%>%`

: this is a way to **chain together** multiple functions in R. In this case, we can omit the `data`

argument from `gather`

because `%>%`

is telling R to pass it in already. I like this approach. because each function/operation can be given its own line, making the code clear and readable.

Next, the `... = c(john, mary, jane)`

argument tells `gather`

which columns to “gather”, essentially. If you leave this blank, it’ll try to gather *all* the columns, but that’s not what we want here: we don’t want to put `treatment`

and `name`

in the same column. In this case, we’re telling the function to collect the three named columns and put them into a single set of `key/value`

columns.

Finally, we tell `gather`

what we want to name these new columns. You can leave this blank too, in which case they’ll just be named `key`

and `value`

. But here we know that we want to name them `name`

(the participant) and `result`

(the outcome of that treatment).

Now let’s turn to the other messy data structure:

`df_messy_2`

```
## name a b
## 1 john 10 11
## 2 mary 20 25
## 3 jane 5 10
```

This one has the opposite problem. Each `name`

gets its own row, but there are separate columns for the two `treatments`

. We need to `gather`

these two columns and put them into a single set of `key/value`

columns:

```
df_messy_2 %>%
gather(... = c(a, b), key = 'treatment', value = 'result')
```

```
## name treatment result
## 1 john a 10
## 2 mary a 20
## 3 jane a 5
## 4 john b 11
## 5 mary b 25
## 6 jane b 10
```

Just as above, this produces a data structure that’s identical (disregarding the order of the rows) to `df_tidy`

!

There are a number of operations that fall under the broad category of reshaping“, and even many other packages (including `reshape`

) that have their own functions (e.g., `melt`

) to reshape data.

This cheatsheet summarizes some of the `tidyverse`

functions for reshaping very succintly.

Another very common operation is **subsetting** data in some way—either subsetting particular *rows* based on the values in their cells (i.e., **filtering**), or subsetting particular *columns* based on their name.

`filter`

is probably the most frequent row subsetting operation I use. Combined with the pipe `%>%`

syntax, it’s a great way to pipe in different subsets of your data into a visualization or aggregation function for exploratory data analysis. It’s also just an integral part of any data processing pipeline in which you need to exclude data on the basis of certain features.

The syntax is straightforward. Here’s a demonstration using the built-in `mtcars`

dataset, which has information about the average miles-per-gallon (`mpg`

), number of cylinders (`cyl`

), etc., about 32 different cars. We can use `filter`

to only consider cars with `mpg`

better than the mean `mpg`

in the dataset.

`mean(mtcars$mpg)`

`## [1] 20.09062`

```
mtcars %>%
filter(mpg > mean(mtcars$mpg))
```

```
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 5 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 6 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## 7 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 8 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 9 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## 10 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## 11 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## 12 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## 13 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## 14 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
```

Note that for numeric variables, standard mathematical operators all apply for `filter`

: `>`

, `>=`

, `<`

, `<=`

, `==`

, `!=`

. You can also use other operators like %in%—which, as the name implies, tests whether the value in a given row is contained in a given vector. For nominal (i.e., categorical) variables, you obviously can’t use comparators like `>`

. But you can still test for string equality (`!=`

) or use operations like `%in%`

.

For instance, we can filter our `df_tidy`

dataframe from earlier to only consider observations from a given participant:

```
df_tidy %>%
filter(name == "john")
```

```
## name result treatment
## 1 john 10 a
## 2 john 11 b
```

Another useful subsetting operation is `distinct`

. This will return a dataframe with **distinct** rows, based on the column names you passed in. If you don’t pass any column names in, it will check for rows that are entirely the same, i.e., have the same value for each column. This is also a useful way to check for duplicates in your data.

You might also want to randomly sample rows from your data, either in terms of an absolute number of rows or in terms of a proportion of your dataset. `tidyverse`

has functions for both of these operations. First, you can sample by an absolute number using `sample_n`

:

```
df_tidy %>%
sample_n(2, replace=FALSE)
```

```
## name result treatment
## 4 mary 25
```