tidyverse.Note
Note that today’s lecture will extend to Friday’s in-class lab slot. Depending on whether we have time on Friday, we can also work on the take-hom lab in Friday’s class.
One of the most frustrating parts of programming is tooling: getting your computer set up to actually do the stuff you want to learn about.
In this class, we’ll be working with the R programming language using a desktop IDE called RStudio.
There are many different programming languages. Why use R?
“Base” R just refers to the set of functions and tools available “out of the box”, without using additional packages like
tidyverse.
Base R includes (but is not limited to):
plot, as well as core types like vectors.lm and anova (which we’ll discuss later).Variables allow us to store information (values, vectors, etc.) so we can use it again later.
Here, we create a variable called account, so we can add to it.
[1] 45
Note
You can also use = for assignment, but <- is the R convention. (For most purposes in this course, it shouldn’t matter which you use, and I sometimes mix them up!)
💭 Check-in
Try adding different numbers to account. What do you think will happen if you add a string like 'CSS'?
Each variable has a certain type or
class.
You can do different things with different types of variables. For instance, you can’t calculate the mean of multiple characters, but you can for numeric types.
| Type | What it is | Example |
|---|---|---|
| numeric | Numbers (integers & decimals) | age <- 25, gpa <- 3.7 |
| character | Text strings | name <- "Alice" |
| logical | TRUE/FALSE values | passed <- TRUE |
| integer | Whole numbers only | count <- 5L |
| factor | Categorical data | grade <- factor("A") |
💭 Check-in
Use the typeof() or class() functions to check the type of different variables or values.
numeric variables allow for a number of arithmetic operations (like a calculator).
A vector is a collection of elements with the same
class.
Vectors can be created with the c(...) function.
💭 Check-in
Create your own vector, this time with character types in it. Try indexing into different parts of the vector.
Like scalars, numeric vectors can be manipulated mathematically.
[1] 26 31 33
[1] 125 150 160
[1] 26 32 35
💭 Check-in
What do you think will happen if you try to add two vectors of different lengths?
A function implements some operation; you can think of it as a verb applied to some input.
In CSS, you’ll often be using functions to summarize your data (like a vector).
💭 Check-in
Create a vector containing possible incomes; then create the mean and median of this vector.
In addition to creating vectors by hand, we can use functions to create random vectors by sampling from some distribution, e.g., a normal distribution (rnorm(x, mean, sd)).

💭 Check-in
Try changing the different parameters of rnorm and then plotting the resulting vector again using hist. What do you notice about changing the mean or sd?
There are also many types of distributions beyond normal distributions.
runif.rbinom.rpois.💭 Check-in
Create a uniform distribution with runif with 100 values ranging from 2 to 3. If you’re not sure how to do this, use ?runif to learn more about the function.
So far, we’ve covered a number of core topics in base R.
hist.💭 Check-in
Any questions before we move on to creating dataframes and other kinds of plots?
The
data.frameclass is a “tightly coupled collection of variables”; it’s also a fundamental data structure in R.
pandas.DataFrame in Python!Note
Note that once we move to the tidyverse, we’ll be working with tibbles, which are basically like a data.frame.
data.framedata.frame can be created using the data.frame function. hours_studied test_score
1 0 70
2 2 85
💭 Check-in
Try creating your own data.frame with custom columns. For example, one column could be movie_title and another could be your rating of that movie. Make sure the columns are the same length!
data.frameWe can use functions like nrow, head, and colnames to learn about our data.frame.
[1] 6
[1] "hours_studied" "test_score"
hours_studied test_score
1 0 70
2 2 85
'data.frame': 6 obs. of 2 variables:
$ hours_studied: num 0 2 2 3 5 8
$ test_score : num 70 85 89 89 94 95
NULL
hours_studied test_score
Min. :0.000 Min. :70.00
1st Qu.:2.000 1st Qu.:86.00
Median :2.500 Median :89.00
Mean :3.333 Mean :87.00
3rd Qu.:4.500 3rd Qu.:92.75
Max. :8.000 Max. :95.00
You can access individual columns using the dataframe$column_name syntax.
data.frameIn base R, you can filter a data.frame using the df[CONDITION] syntax, where CONDITION corresponds to a logical statement.
hours_studied test_score
4 3 89
5 5 94
6 8 95
Note
In the tidyverse, we can use the handy filter function.
Once you have multiple vectors, you can plot the relationship between them, e.g., using a simple scatterplot.

You can also quantify the relationship between variables, e.g., using a Pearson’s r correlation coefficient.
Pearson's product-moment correlation
data: df_example$hours_studied and df_example$test_score
t = 2.8958, df = 4, p-value = 0.0443
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.03391115 0.97998161
sample estimates:
cor
0.8228274
Real data often contains missing values. R represents these as NA (Not Available). We’ll discuss these in more detail next week, but here’s a preview:
You can remove missing data by filtering the data.frame, using the syntax below and the is.na condition.
[1] 85 92 78 88
So far, we’ve discussed a number of useful concepts in R:
rnorm.data.frame objects and plotting or analyzing them.💭 Check-in
Now, let’s simulate data.
rnorm to create a random normal distribution of parent_heights (use parameters that seem reasonable to you).child_heights that’s related to parent_heights, ideally with some random error added. (Hint: Think about how a regression line works.).data.frame.[1] 0.8720213
tidyverseNext week, we’ll discuss the tidyverse: a set of packages and functions developed to make data analysis and visualization in R easier.
This includes (but is not limited to):
filter or mutate.left_join or inner_join.ggplot.This course is not primarily about programming in R, but programming in R is a foundational skill for other parts of this course.
This lecture (and accompanying lab) is intended to give you more comfort with the following concepts:
data.frame objects.CSS 211 | UC San Diego