ggplot: theory and practice.ggridges, ggcorrplot).Data visualization is the process (and result) of representing data graphically.
We’ll be focusing on common visualization techniques, such as:
Note
We’ll also discuss why visualization is so crucial.
Data visualization serves (at least) a few different purposes:

Florence Nightingale (1820-1910) was a social reformer, statistician, and founder of modern nursing.
John Snow (1813-1858) was a physician whose visualization of cholera outbreaks helped identify the source and spreading mechanism (water supply).
Edward Tufte argues:
Graphical excellence is the well-designed presentation of interesting data—a matter of substance, of statistics, and of design … [It] consists of complex ideas communicated with clarity, precision, and efficiency. … [It] is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space … [It] is nearly always multivariate … And graphical excellence requires telling the truth about the data. (Tufte, 1983, p. 51).
Some principles:
Tufte’s principle
“Above all else show the data.” - Edward Tufte
Tip
This will be relevant when thinking about the layers in ggplot.
Rule of thumb
Use consistent scales when inviting direct comparison; use free scales when showing patterns within each group.
ggplot2: theory and practice
ggplot2is a system for creating graphics, based on the Grammar of Graphics.
Just like natural language has a grammar (nouns, verbs, adjectives), graphics have a grammar too:
aes): How variables map to visual properties (x, y, color, size).geom): The type of plot (points, lines, bars).Tip
`gplot builds plots by adding layers with the + operator.
Every ggplot needs, at minimum:
A histogram is a visualization of a single continuous, quantitative variable (e.g., income or temperature).
A histogram can be created with geom_histogram.

💭 Check-in
What happens if you modify bins or binwidth?
A histogram is a visualization of a single continuous, quantitative variable (e.g., income or temperature).
A histogram can be created with geom_histogram.
💭 Check-in
What happens if you modify bins or binwidth?
Histograms show important visual information about a distribution:
💭 Check-in
How does the skew of a distribution affect measures of central tendency such as the mean or median?
A density plot is a smoothed alternative to a histogram, created using kernel density estimation (KDE).
A density plot can be created with geom_density.

One benefit of a density plot is that it’s easier to overlay multiple distributions on the same plot. The alpha parameter controls the opacity of each distribution.

A scatterplot shows the relationship between two continuous variables, where each point represents an observation.
A scatterplot can be created with geom_point.

We can further modify the color, size, and shape of individual dots.

Tip
Use categorical variables for shape, categorical or continuous variables for color, and ordinal or continuous variables for size.
We can also use geom_smooth to plot a regression line (or another non-linear function) over our scatterplot.
(If there are multiple colors, etc., a different line will be plotted for each color.)

A barplot visualizes the relationship between one continuous variable and (at least one) categorical variable.
A barplot can be created with geom_bar.
geom_bar will count occurrences (like a histogram for categorical variables).mean using geom_bar(stat = "summary").geom_col or `geom_bar(stat = “identity”).By default, geom_bar will count the occurrences of each class.

💭 Check-in
Create a barplot showing the counts of some other categorical variable in the mpg dataframe.
You can also calculate summary statistics, i.e., the mean of some y variable for each level of the x variable.
Use reorder to reorder the bars in terms of their values.

group_byAlternatively, you can calculate summary statistics using group_by %>% summarise, and pipe the output into a ggplot call.

💭 Check-in
How would you instead plot the mean cty miles per gallon?
stat_summaryOften, you want to display some measure of dispersion in addition to the mean. One approach is to use stat_summary.

group_byAlternatively, with the group_by method, you can use first calculate the standard error, then pipe the result into geom_errorbar.
mpg %>%
group_by(drv) %>%
summarise(
mean_hwy = mean(hwy),
se_hwy = sd(hwy) / sqrt(n())
) %>%
ggplot(aes(x = reorder(drv, mean_hwy), y = mean_hwy)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = mean_hwy - se_hwy,
ymax = mean_hwy + se_hwy),
width = 0.2) +
labs(x = "Drive Type", y = "Mean Highway MPG") +
theme_minimal()
💭 Check-in
How would you plot a confidence interval for two standard errors?
group_byAlternatively, with the group_by method, you can use first calculate the standard error, then pipe the result into geom_errorbar.
💭 Check-in
How would you plot a confidence interval for two standard errors?
fillYou can add further information to a barplot using the fill parameter. Use position_dodge to show the bars side by side (rather than stacked).

Boxplots and violin plots show more detailed information about underlying distribution for a given category.
median, along with the inter-quartile range.geom_boxplot
geom_violin
Let’s revisit a plot we worked on earlier, with better labels and styling.
mpg %>%
ggplot(aes(x = cty, y = hwy, color = drv, size = cyl, shape = drv)) +
geom_point(alpha = .5) +
labs(x = "City mpg",
y = "Highway mpg",
size = "Number of cylinders",
color = "Drive Train",
shape = "Drive Train") +
theme_minimal() +
scale_color_viridis_d() +
theme(
axis.title = element_text(size = 16),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10)
)
Let’s revisit a plot we worked on earlier, with better labels and styling.
mpg_labeled <- mpg %>%
filter(hwy > 42) %>%
filter(class == "compact")
mpg %>%
ggplot(aes(x = cty, y = hwy, color = drv, size = cyl, shape = drv)) +
geom_point(alpha = .5) +
geom_text(data = mpg_labeled,
aes(label = model),
hjust = 1.2, vjust = 0,
size = 4,
color = "black",
show.legend = FALSE) +
theme_minimal(base_size = 14)
ggplot has a ton of useful functions and geom types, and it’s probably all you “need”——but there are other options too.
ggcorrplot: gg-style correlation matrices.ggridges: gg-style “ridge” plots (density plots).ggcorrplot
ggcorrplotis a library (and function) that visualizes a correlation matrix.

ggridges
ggridgesis a library for arranging density plots in a staggered fashion.

ggplot is a versatile and powerful library for creating clear, elegant figures.CSS 211 | UC San Diego