An introduction into the tidyverse

The aim of this tutorial is to introduce students to the key features of the tidyverse using a public dataset (so there are no requirements other to have a working version of R/Rstudio and installing some packages). The exact versions used in this tutorial are:

The tidyverse is a collection of data science tools to tidy and visualize data and includes several different R libraries suited for a broad range of tasks:

The basic workflow in tidying and visualizing data usually looks as follows:

In this tutorial we will learn about the Tidyverse by exploring the palmerpenguins dataset, a data set containing measurements of three different species of penguins.

Image taken from: https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png

Note

Whenever we ask a question the code solution will be hidden. You can show the code by pressing on <Click me to see an answer>.

1 Some background information

1.1 Shortcuts and other helpful tips

When working in RMarkdown or Quarto documents, shortcuts allow you to use certain key combinations to insert pipes, cell blocks and other useful things without having to type everything and allows to speed up things.

  • Ctrl/Cmd + Shift + M is used to insert a pipe.
  • Ctrl/Cmd + Alt + I is used to insert a new code cell block.
  • Pressing Tab while writing code auto-completes text. Try this by tying ?filt and pressing Tab, you can see all options coming up and you can select the one you want by using the arrow symbols and pressing enter

1.2 A word about Pipes

As we will see in a second the tidyverse consists of a list of verbs that we can use to transform data. Each individual verb (i.e. filter, select, mutate) is quite simple and solving more complex problems requires combining multiple verbs. We can do this by using pipes that combine different verbs by using |> or its older version %>%. Using pipes results in much shorter code compared to using nested code or storing the output of each step in a different variable. There is some differences between the two pipe versions, but for now you can use them interchangeably.

If in R you want to use the newer version by default go to Tools –> Global options –> Code –> Use native pipe operator.

Now, having this out of the way, lets get started with loading our data and learning about some useful verbs in the tidyverse.

Important

You might get an error when using |> if you are working with an R version below 4.1. If you get an error and can’t update or don’t want to update R version use %>% instead. To be able to use either pipe you need to have the tidyverse library loaded.

2 Initial data exploration

Notice:

If you are opening the notebook version (i.e. the qmd file) of this tutorial, you will sometimes see a little comment before the actual code, i.e. #| eval: false. This comment simply controls how the code chunk behaves when rendering the document to a HTML and can be ignored when following the tutorial via the qmd file.

For this tutorial to run we need to install some tools first. Specifically, we need to download the tools of the tidyverse and the data we will explore today. To do this type and execute the following into your R notebook or console:

install.packages("palmerpenguins")
install.packages("tidyverse")

Once you have everything installed you neet to load the libraries. Keep in mind, while you only need to install libraries once you need to load the libraries for every new R session.

#load packages
library(palmerpenguins)  #contains our dataset
library(tidyverse)      #tools for transforming and visualizing data

Let’s start exploring our data first by taking a look at the first few rows of our dataframe using the head() function.

Notice:

If you are viewing the HTML version, you can see all columns of the table above by clicking on the little arrow on the top, right corner of the table.

head(penguins)

We see that we have 8 column of data, with different categorical data (species, sex and island) as well as different numerical measurements.

Next, let’s look a bit in more detail at the structure of the data using the str function:

str(penguins)
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

Some key features we can take from this output:

  • The total amount of data we work with, i.e. 344 rows and 8 columns of data
  • Te names and content of all 8 columns, i.e. species, bill_length and sex
  • The type of data we have for each column, i.e. factors, numeric/integer, character
  • The unique levels for some variables, i.e. we compare 3 different penguin species
  • If we look closely, we can see whether or not we have to worry about NAs (i.e. missing data). For example, we see an NA in the sex column.

The summary() function is another useful way to get a quick overview about our data and gives us some useful summary statistics:

summary(penguins)
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2                                 

For each data column we get the number of observations (for factor data) as well as basic summary statistics for all numerical data. Another useful piece of information we get is the number of missing values for each column.

3 Filter

We can use the filter function if we want to look at only a subset of our observations based on a condition.

To get started, let’s only select data from 2007.
To do this, we use our first operator, i.e. ==, which allows us to only select rows where the year is exactly 2007.

In the syntax below, we use the pipe to input the penguins dataframe into the filter function.

Important

Again a reminder: If you are working in base R you might have an error when using |> if you are working with an R version below 4.1. If you get an error and can’t update or don’t want to update R use %>% instead. For this to work, you need to have the tidyverse library loaded.

#only print data of penguins if the year is (==, a conditional operator) 2007
penguins |>
  filter(year == 2007)

Now we see, that we only have 110 row compared to the 334 rows in our original dataframe.

Something to keep in mind that when doing this the original penguins dataframe is not changed. Filter is simply returning a new dataset with fewer rows to the screen but not storing it in the R environment. If we want to store the modified dataframe in our environment, we need to store it in a new variable like this:

#store the output of the filtering step in a new variable, penguins_2007
penguins_2007 <-
  penguins |>
    filter(year == 2007)

#view the content of our modified dataframe by using the penguins_2017 variable
head(penguins_2007)

There are other operators that might come in handy for filtering dataframes:

Operator Description
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to
!x Not x
x | y x OR y
x & y x AND y

Exercise

  1. How many observations, i.e. rows of data, do we have for the year 2009?
  2. How many observations, i.e. rows of data, do we have if we only consider penguins that have flippers equal to or longer than 200 mm?
Click me to see an answer

Question 1

#we have 120 rows of data
penguins |>
  filter(year == 2009)

Question 2

#we work with 152 rows of data
penguins |>
  filter(flipper_length_mm >= 200)

Question 2 alternative

#we can also use the pipe and use a base R function to count the number of rows instead of looking at the table
penguins |>
  filter(flipper_length_mm >= 200) |> 
  nrow()
[1] 152

3.1 Excluding data

We can also easily filter by excluding data using != symbol. For example, we remove data from the year 2007 like this:

penguins |>
  filter(year != 2007)

3.2 Filtering using characters

We can also filter for characters, i.e. filter the data to only retain observations for a single penguin species. The only thing we have to watch out for is that if we work with characters (i.e. Adelie), we have to surround these characters with quotes.

#only return the data from Adelie penguins
penguins |>
  filter(species == "Adelie")

3.3 Filter using more than one column

We can also filter dataframes by setting conditions for different columns. For example, we might want to select data from 2007 and only from Adelie penguins. To do this we specify multiple conditions that we separate by a comma. Each == expression is called an argument.

#only return the data from Adelie penguins collected in 2007
penguins |>
  filter(year == 2007, species == "Adelie")

Exercise

How many observations do we have when we exclude data from 2008 and only look at penguins observed on the island Biscoe?

Click me to see an answer
penguins |>
  filter(year != 2008, island == "Biscoe")

We work with 104 rows of data.

3.4 Using filter with several targets

How would we filter if we wanted data from different targets, i.e. if we only wanted to select data collected from the islands Biscoe and Dream? There are different options.

  1. We could exclude data not matching our criteria
penguins |>
  filter(island != "Torgersen") |> 
  nrow()
[1] 292

In the code above, we use the pipe |> to filter our dataframe and then pass it into the nrow function to count the number of rows retained in our dataframe.

However, the command above is not ideal since we have to figure out or know by heart what the name of the island is that doesn’t match our criteria, with larger datasets that might not be as easy.

  1. We could use the OR, | , operator:
penguins |>
  filter(island == "Biscoe" | island == "Dream") |> 
  nrow()
[1] 292

This is definitely better, but requires a lot of typing especially if we have more categories.

  1. We use a vector list and filter with it using the %in% operator. With this operator we basically ask whether Biscoe and Dream are found in the same column. If that question is answered with TRUE, then the rows are printed.
list_of_islands <- c("Biscoe", "Dream")
  
penguins |>
  filter(island %in% list_of_islands) |> 
  nrow()
[1] 292

Generally, this is the preferred option, since we can easily re-use list_of_islands for other things and its easy to see how we can add a long list of characters into it.

Exercise

How many observations do we have for Adelie and Gentoo penguins in 2007?

Click me to see an answer
list_of_species <- c("Adelie", "Gentoo")
  
penguins |>
  filter(species %in% list_of_species, year == 2007) |> 
  nrow()
[1] 84

We work with 84 rows of data.

4 Select

The select verbs allows to keep or drop columns using their names and types:

#select two columns of our penguin dataframe
penguins |> 
  select(species, sex)

Again, we can also negate a selection.

#remove a column
penguins |> 
  select(!species)

If we want to exclude more than one column, we have to use a vector list:

penguins |> 
  select(!c(species, sex))

5 Selection helpers

Selection helpers helpers are a powerful feature of the tidyverse that we can use to select data using patterns:

  • starts_with(): Starts with an exact prefix
  • ends_with(): Ends with an exact suffix
  • contains(): Contains a literal string
  • matches(): Matches a regular expression
  • num_range(): Matches a numerical range like x01, x02, x0

For example, what is if we only want to select columns that contain mm measurements? Easy, we can simply search for any column that contain our pattern of interest:

penguins |> 
  select(contains("mm"))

This becomes even more powerful if we use operational arguments to combine different statements. I.e. & and | take the intersection or the union, respectively, of two selections:

#only select columns with mm measurements AND from the bill
penguins |> 
  select(contains("mm") & contains("bill"))
#select everything if it ends with mm OR g
penguins |> 
  select(contains("mm") | ends_with("g"))

Exercise

Select only columns if they contain length measurements or species information.

Click me to see an answer
penguins |> 
  select(contains("length") | contains("species"))

6 Distinct

The distinct verb is a quick way to summarize all distinct values or characters in a column:

penguins |> 
  distinct(island)

Exercise

Over how many different years were measurements taken?

Click me to see an answer
penguins |> 
  distinct(year)

7 Arrange

Arrange sorts the observations in a dataset. This can be useful when you want to know the most extreme data. By default, we sort by starting with the lowest value first.

penguins |> 
  arrange(bill_length_mm)

We can also sort in descending order.

penguins |> 
  arrange(desc(bill_length_mm))

Exercise

What are the three largest body mass measurements?

Click me to see an answer
penguins |> 
  arrange(desc(body_mass_g))

The largest measurements are: 6300, 6050 and 6000 g

8 Mutate

Mutate changes or adds variables in your data set.

For example, we could change the body mass from the measurement in g to kg using basic math operations.
With writing body_mass_kg = we give the new column with the mutated data a descriptive header. Mutated columns can be found at the end of our table.

penguins |> 
  mutate(body_mass_kg = body_mass_g / 1000)

We can also easily combine different columns, i.e. we could multiply the bill length with the depth:

penguins |> 
  mutate(bill_area = bill_length_mm * bill_depth_mm)

Exercise

Add a new column with the bill length in cm.

Click me to see an answer
penguins %>%
  mutate(bill_length_cm = bill_length_mm / 10)

9 Count

The count verb is ideal to count observations for each group:

penguins |> 
  count(species) 

Exercise

Count the number of observations per species and island.

Click me to see an answer
penguins |> 
  count(species, island) 

10 Other useful verbs

There are other verbs that we don’t cover in this tutorial, but they are anyhow useful for data transformation. Below you find a list of verbs with link to a more detailed explanation including some examples.

  • Unite multiple columns into one by pasting strings together
  • Separate a character column into multiple columns with a regular expression or numeric locations
  • Pivot longer and wider to convert data sets from wide and long format and vice versa

11 Combining verbs

As we said in the beginning, we can use the pipe to combine several verbs.

Let’s identify from what island the species with the three highest bill length measurements came from in the year 2007:

penguins |> 
  filter(year == 2007) |> 
  select(species, island, bill_length_mm) |> 
  arrange(desc(bill_length_mm)) |> 
  head(3)

Exercise

Answer the following question:

What is the highest body mass in kg recorded for Gentoo penguins in 2007 and from what island did that specimen come from?

Click me to see an answer
penguins %>%
  mutate(body_mass_kg = body_mass_g / 1000) |> 
  filter(species == "Gentoo", year == 2007) |> 
  arrange(desc(body_mass_kg)) |> 
  select(island, body_mass_kg) |> 
  head(1)

Answer: 6.30 kg and from Biscoe.

12 Missing values

When we initially explored our data frame, we saw some data was missing (i.e. NA = not applicable). With some other datasets you might encounter Nan, which is used for data not representing a number.

As we have seen when we explored our dataframe in the beginning, some columns contain missing data.

There are different ways to deal with missing values and what you do depends on your data. However, there are some general strategies:

  • Impute data, i.e. fill missing values with a computed number, such as the mean
  • Drop rows or columns with missing values

12.1 Replace the missing values with the mean (or median, or …)

To do this we mutate the content of a column by using the replace function in which we provide:

  • The column affected
  • What values we want to replace (NAs)
  • With what we want to replace the NAs with (the mean in our example)

Additionally, since we can not calculate the mean if the values contain NAs, we remove them withna.rm = TRUE

penguins |> 
  mutate(bill_length_mm
         = replace(bill_length_mm, is.na(bill_length_mm), mean(bill_length_mm, na.rm = TRUE)))

12.2 Drop rows with missing values

If we want to drop rows we can use the drop_na() function. drop_na() drops rows where any column contains a missing value and we will go into what that means in a bit.

penguins |> 
  drop_na() 

We see that we went from a dataframe with 344 columns to a dataframe with 333 columns, so 11 rows of data were removed.

13 Summarize data

13.1 Group by and summarize

group_by() takes an existing table and converts it into a grouped table where operations are performed “by group”. This function does not change how the data looks to us, however, it changes the underlying structure of the dataframe. Whenever groups are specified the functions that follow are performed separately on each level of the grouping variable.

summarise() creates a new data frame. It returns one row for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input.

group() removes any grouping structure from our data frame. It is recommended to always ungroup() when we are done with your calculations especially if we store the output of our calculations in a dataframe.

Useful functions we can use with summarize:

  • Center: mean(), median()
  • Spread: sd(), IQR(), mad()
  • Range: min(), max(),
  • Position: first(), last(), nth()
  • Count: n(), n_distinct()
  • Logical: any(), all()

Let’s group our data by species and calculate the mean body mass per species:

penguins |> 
  group_by(species)  |> 
  summarize(avg_weight = mean(body_mass_g)) |> 
  ungroup()

Now we see a problem. We get some values for the Chinstrap but not the other two species… Let’s try to figure out what the reason is.

To figure this out, lets first summarize some things:

penguins |> 
  count(species, sex)

If we count the number of observations, we see that we have missing values only for records for the Adelie and Gentoo penguins, for both of these species we could also not calculate the mean.

This means that to be able to summarize our data, we first have to deal with the NAs. As we discussed there are different options, the easiest one is to drop rows where any column contains a missing value.

In the code below, we drop rows with missing data nd we also add a second summary function in which we count the number of observations using the n() function. We separate different summary functions with a comma:

penguins |> 
  drop_na() |> 
  group_by(species)  |> 
  summarize(avg_weight = mean(body_mass_g), nr_observations = n()) |> 
  ungroup()

Now we get average values for all 3 species and can easily see that the Gentoo penguins are the heaviest.

An alternative way to remove missing values is by using na.rm = TRUE and is.na() functions:

penguins |> 
  group_by(species)  |> 
  summarize(avg_weight = mean(body_mass_g, na.rm = TRUE), 
            nr_observations = sum(!is.na(body_mass_g))) |> 
  ungroup()

We see that the values for the Adelie and Gentoo penguins are slightly different compared to the first calculations, we can also see that the number of observations are different when using the two approaches.

The difference, is that drop_na() drops the full row, regardless where data is missing in our dataframe. In total 11 rows get deleted. In contrast na.rm = TRUE only removes NAs if they are found in the body_mass_g column, so only 2 rows are deleted.

In the end its up to you and the experimental setup how you deal with missing data, however, it is important to understand what happens when removing data a certain way.

Exercise

Compare the average flipper length of male and female penguins, calculate the mean, median and standard deviation.

Click me to see an answer
penguins |> 
  drop_na() |> 
  group_by(sex)  |> 
  summarize(mean_bill = mean(bill_length_mm),
            median_bill = median(bill_length_mm),
            sd_bill = sd(bill_length_mm)) |> 
  ungroup()

Males seem to have slightly longer bills but since the sd’s overlap this might not be significant. Also, we can see there that there is no large difference between the median and mean, so we likely have no outliers.

Notice: in this case it doesn’t make a difference if we use drop_na or na.rm=TRUE.

Notice:

Let us briefly discuss why ungrouping is important. Let’s assume we want to calculate the average body mass for each species and add it as a new column to our dataframe. Additionally, we want to calculate the total number of observations.

penguins |> 
  select(species, body_mass_g) |> 
  group_by(species)  |> 
  #calculate mean body mass
  mutate(mean_bm = mean(body_mass_g, na.rm = TRUE)) |> 
  #calculate nr of observations by species
  mutate(observations = n()) |> 
  ungroup()

Now, let us do the same but instead ungroup before the second mutate statement:

penguins |> 
  select(species, body_mass_g) |> 
  group_by(species)  |> 
  #calculate mean body mass
  mutate(mean_bm = mean(body_mass_g, na.rm = TRUE)) |> 
  ungroup() |> 
  #calculate nr of all species measured
  mutate(observations = n())

We can see how the number of observation changes, the first is calculated on the grouped the second on the ungrouped data. No number is wrong, however, this exemplifies that you need to be aware of how the grouping is applied ESPECIALLY if you save the output in a variable and use it for other calculations.

13.2 Addon on across(): Summarizing multiple columns with arguments

We can also calculate the mean using na.rm = True on more than one data column. One way to do this is is:

penguins |> 
  group_by(species)  |> 
  summarize(avg_weight = mean(body_mass_g, na.rm = TRUE),
            avg_length = mean(bill_length_mm, na.rm = TRUE))

Now, this is fine for a few calculations, but gets tedious very quickly. Luckily, we can use the across() function to make our life easier.

To calculate the mean across all columns, we identify all columns with numeric values using the where function and do the same transformations (here calculating the mean) across multiple columns.

across() takes the following inputs:

  • Columns we want to transform, here all the columns that are numeric
  • The function we want to apply, here we want to calculate the mean
  • Any additional arguments we want to use, i.e. remove NAs
penguins %>% 
  group_by(species) %>% 
  summarize(across(where(is.numeric), mean, na.rm = TRUE))

Of note, when using additional arguments it can be a bit ambiguous how they are used, i.e. is na.rm used once per across() or once per group? A newer way to do the same thing (that however requires some additional syntax we have not covered):

penguins %>% 
  group_by(species) %>% 
  summarize(across(where(is.numeric),  ~ mean(.x, na.rm = TRUE)))

When we use ~ we use what is called a lambda function, or anonymous function, and we use .x to indicate where the variable in across() is used. These two elements come from the purr package and is useful to explore if you want to write effective functions with R.

Explaining the details are out of the scope of this tutorial but if you want to read more about this (and have some good examples for using across() check out this post.