An introduction into Ggplot

1 Introduction

This tutorial will introduce the basics of plotting with ggplot.

Visualizing data is important not only for publishing but also to explore our data. For example, imagine that we do a regression analysis and we get the result that two variables (i.e. height and income) are positively correlated. While this might be an exciting result, without visualizing the data we might not realize that we have outliers or that data points are not distributed normally. That is why plotting the data is so important. For example in the plot below, we have the exact same regression line but the data points behave very differently.

2 Setting up the working directory

This tutorial was prepared in RStudio with the following dependencies:

R version 4.2.
The palmerpenguins_0.1.1 package
The tidyverse_1.3.2 package

Notice:

If you are opening the notebook version (i.e. the qmd file) of this tutorial, you will sometimes see a little comment before the actual code, i.e. #| eval: false. This comment simply controls how the code chunk behaves when rendering the document to a HTML and can be ignored when following the tutorial via the qmd file.

If you want to follow this tutorial, then you need to install some tools first. Specifically, we need to download the data set that we will explore today as well as the tidyverse library. The tidyverse includes several packages such as dplyr for data transformation or ggplot, our plotting library.

If you don’t have the palmerpenguins data set and the tidyverse installed yet then you can type the following into your R notebook or console. You only need to run this kind of code once.

#install required data package and libraries
#the second install might take a bit
install.packages("palmerpenguins")
install.packages("tidyverse")

Once you have everything installed you load the libraries. Keep in mind, while you only need to install libraries once you need to load the libraries for every new R session.

library(tidyverse)
library(palmerpenguins)

3 Data exploration and cleaning

During this tutorial we will work with two datasets:

The palmerpenguins dataset, a data set that records details of 344 penguins. This will be the main data set we will explore.
The beaver dataset, a time series recording the temperature and activity of two beavers. This dataset is part of base R and doesn’t need to be installed. Time-series are best represented in line plots and we will explore this data when talking about these types of plots.

Let’s have a first look at the penguin dataset by looking at the first few rows using the head() function.

Notice:

If you are viewing the HTML version, you can see all columns of the table above by clicking on the little arrow on the top, right corner of the table.

head(penguins)

We see that we have 8 columns of data. The columns contain categorical data (species, sex and island) as well as numerical data (flipper_length_mm, body_mass_g, year).

Next, let’s explore the data structure (str let’s us know if our columns are factors, integers or numbers).

#view the data structure of the penguin data
str(penguins)

tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

Some key features we can take from this:

The total amount of data we work with, i.e. 344 rows and 8 columns of data
Te names and content of all 8 columns, i.e. species, bill_length and sex
The type of data we have for each column, i.e. factors, numeric/integer, character
The unique levels for some variables, i.e. we compare 3 different penguin species
If we look closely, we can see whether or not we have to worry about NAs (i.e. missing data). For example, we see an NA in the sex column

Finally, let’s calculate a basic summary statistics of our data using the summary() function:

summary(penguins)

      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2

After our first data exploration steps, we can see that we have 344 rows and eight columns of data. The columns contain types of categorical or numerical data, such as species, bill length or year. We know how many unique observations we have for categorical data, i.e. we look at three species of penguins. Additionally, for each data column we got a summary of the number of observations (for factor data) as well as basic statistics for all numerical data. Another useful piece of information we get is the number of missing values for each column.

Since we noticed that the penguin dataset has rows with missing data (i.e. NA), our first task is to remove any row with missing data:

#discard rows with NAs and store the resulting dataframe in a new variable
penguins_clean <-
  penguins |> 
  drop_na()

Important

If you are working in base R you might have an error when using |> if you are working with an R version below 4.1. If you get an error and can’t update or don’t want to update R use %>% instead. For this to work, you need to have the tidyverse library loaded.

Before proceeding, lets first check how many rows of data we removed:

#sanity check to control how many rows were dropped
#print the dimensions for our original dataset
print(dim(penguins))

[1] 344   8

#print the dimensions for our cleaned dataset
print(dim(penguins_clean))

[1] 333   8

Note

Sanity checks are an important step when manipulating data and we recommend to always do them.

In the example above, we check the number of dimension of our data frame before and after cleaning and ensure that the number of rows that were removed make sense.

4 Our first scatterplot

Now we can start to generate our first plot. Let’s start with a scatterplot. In the code below we:

Initialize the plot using the ggpplot()function in the first line of code. Inside the function, we provide the input dataframe (i.e. penguins_clean) and we also
map variables (i.e. flipper_length_mm and body_mass_g) to the aes(thetics) in our graph. More specifically, we define what we want to plot on the x- and/or y-axes.
Define how we want to plot our data in the second line of code. For example, we specify that we want to generate a scatterplot via geom_point(). When using the + symbol after the first ggplot call, we add a layer to our plot. We can add as many layers to our plot as we want, more on that later.

ggplot(penguins_clean, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()

By default, ggplot expects the first aesthetic to be mapped to the x and the second aesthetic to the mapped to the y variable. We could also write the code a bit shorter and still get the exact same result as shown below:

ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g)) +
  geom_point()

Note

While omitting the x= and y= is definitely shorter the code becomes a little less easy to understand, therefore, its important to find a good balance between keeping code short but readable.

Exercise

Plot bill length on the x and bill depth on the y-axis.
Plot the body mass on the y-axis and the species on the x-axis and generate a plot. Notice a difference?

Click me to see an answer

Question 1

ggplot(penguins_clean, aes(x = bill_length_mm , y = bill_depth_mm )) + 
  geom_point()

Question 2

ggplot(penguins_clean, aes(x = species, y = body_mass_g)) + 
  geom_point()

Notice:

In the second example we generate a dot instead of a scatter plot. This type of plot is useful if we have a low number of observations and want to show the reader all the data. While we could also just plot the mean and standard errors, plotting all data points can be more informative as it can give a general idea about the spread of the data and the presence of any outliers.

At the moment the plot is not ideal since points are on top of each other but we later learn how we could improve this.

5 Saving plots as variables

Plots can be saved as variables and we can add more layers to these variables using the + operator. This is really useful if you want to make multiple related plots from a common base, i.e. we could store a common theme in one variable and reuse it for different plot types.

For example, we could save the mappings when using the ggplot() function as a variable called plt_pengs and then use this variable whenever we want to add new layers:

#save our first layer in the plt_pengs variable
plt_pengs <- ggplot(penguins_clean, aes(species, flipper_length_mm))

#add the second layer to plt_pengs to generate a scatter plot
plt_pengs +
  geom_point()

By storing our mappings in a variable, we can easily change how the plot looks. For example, instead of generating a dot plot let us generate a boxplot (more about boxplots later). Notice, how we don’t have to retype the mappings but just use our variable instead:

plt_pengs +
  geom_boxplot()

6 Saving plots to your computer

How would we save this plot to our computer?

Important

When saving images it is generally better to save a plot as vector file, such as svg, ai, pdf (in most of the cases), instead of raster files such as png or tiff. Vector files use paths of points and lines to create an image while raster graphics are created using pixels.

Using vector images allows us to modify every line or dot in a plot using tools such as Illustrator or Inkscape and thereby customize plots outside of R.

6.1 Using the export function in RStudio

In Rstudio, we can use the export function in the files panel. In the File panel navigate to the plot tab. You can export a figure if its shown in the panel by clicking on export decide how you want to save your image.

6.2 Using the pdf function

The pdf() function takes as argument file name and if needed the directory we want to save our file in (in the case below the images folder found on the same level as the code directory). We can add additional arguments such as the width or height of our plot or even the target paper size.

#call the pdf command to tell R we want to generate a plot
pdf(file = "../images/ExamplePlot_BaseR.pdf", paper = "a4") 

#create a plot
ggplot(penguins_clean, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()

# Run dev.off() to create the file
dev.off()

6.3 Using ggsave

The ggsave function is part of the ggplot package and can be used to save any plot generated with ggplot. Whenever you use other libraries to generate plots or extend ggplot it might be better to save plots using pdf() or via the files pane. .

#create a plot and store it in a variable
my_plot <- 
ggplot(penguins_clean, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()

ggsave("../images/ExamplePlot_GGplot.pdf", my_plot, width = 21, height = 30, unit = "cm")

Notice:

If you run the code on your own and compare the two plots generated with pdf() and ggsave, you will see the plots have different margins despite both being set to A4. Therefore, it is often useful to play with the width and height a bit to get things right.

Exercise

Plot the bill_length_mm against the bill_depth_mm and save the plot as pdf using pdf() using default settings.
Plot the bill_length_mm against the bill_depth_mm and save the plot as pdf using ggsave() using default settings.
Do the plots look the same?

Click me to see an answer

Question 1

pdf(file = "image1.pdf") 

ggplot(penguins_clean, aes(x = bill_length_mm, y = bill_depth_mm )) +
  geom_point()

dev.off()

Question 2

ggplot(penguins_clean, aes(x = bill_length_mm, y = bill_depth_mm )) +
  geom_point()

ggsave("image2.pdf")

Question 3

The plots might look slightly different as pdf by default uses 7 points for width and height and ggsave uses the size of the current graphics device (i.e. how you see the plot displayed in the plots panel in RStudio). That is why choosing alternative values for the dimensions gives you a bit more control over how the plot looks.

7 Addon: Plotting more than one data set

We can plot different data frames on each layer. To generate an example, lets fit a model on our data, use it to predict some data and plot the results as a regression line on top of our individual data points.

While going into how to fit models is out of the scope of this tutorial, free free to look at the links below if you want to go deeper:

Intro to linear regression
Some examples for the modelbase package

Notice:

If you want to follow examples in Addon sections, make sure you have any additional libraries installed before running the code.

#load some libraries
library(modelbased)

#create linear model and predict data
mod <- lm(flipper_length_mm ~ body_mass_g , data = penguins_clean)
pred_data <- estimate_expectation(mod)

#add the original data to our df
pred_data$flipper_length_mm <- penguins_clean$flipper_length_mm

#lets look at our predictions
head(pred_data)

Now, we can add our predictions as a line to our scatter plot by adding an additional layer in which we provide our second dataset with the predicted values:

ggplot(penguins_clean, aes(x = flipper_length_mm, y=  body_mass_g)) +
  geom_point() +
  geom_line(data = pred_data, aes(x = Predicted, y = body_mass_g ))

You can easily use this approach to add layers to your plot for summary statistics, labels for outliers, etc.

8 Visible aesthetics

We already learned how we map variables onto the x and y axes. For example, we mapped the flipper length onto the x and the body mass onto the y aesthetic. We typically provide such mappings in the aes() function.

We can easily add more mappings:

color: changes the fill of points but in other geoms fills the outlines
fill: changes the fill color
size: changes the area or radius of points as well as the thickness of lines
shape: changes the shape of our data points
alpha: adjusts the transparency
line: changes the dash pattern of a line
labels: allows to change text on a plot or axes

Let’s quickly talk about the distinction between aesthetics and attributes in the world of ggplot syntax: Aesthetics are defined inside aes() and attributes are used outside aes(). For example, we can map the species to the color aesthetic and therefore control how the species are colored. Let’s look at some examples.

8.1 Colors

First, let’s map species to the color aesthetic:

ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g, color = species)) +
  geom_point()

If we do this, we can see that the dots are now colored depending on from what species the data was collected.

In contrast, attributes control how something looks, for example, we can decide to make all dots blue by using a color attribute outside the aes() call:

ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g)) +
  geom_point(color = "blue")

We can define the aesthetic when we initialize the plot or in the individual layer. If we define the aesthetic while initializing the plot, it is used for every layer. In contrast, when we define the aesthetic in individual layers they just apply to that layer.

For example, we can also map species to color inside the geom_point layer:

ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g)) +
  geom_point(aes(color = species))

An example for when it can be useful to control the color in individual layers is shown in a later point in the tutorial but if you want to have a quick look go to Section 11.1.

Exercise

Make a scatterplot comparing the flipper and bill lengths and assign the islands to different colors by changing the color aesthetic. It is up to you in what layer you assign the color aesthetic and feel free to try both options.

Click me to see an answer

ggplot(penguins_clean, aes(flipper_length_mm, bill_length_mm, color = island)) +
  geom_point()

8.2 Sizes

We can also change the size of our data points using the size aesthetic.

ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g, size = bill_depth_mm)) +
  geom_point()

8.3 Shapes

Additionally, for categorical data, we can give different shapes to our data points.

Notice, there are a limited number of shapes available, so this works only for datasets with limited number of categories such as our dataset were we look at three species.

The default geom_point() uses shape = 19: a solid circle. An alternative is shape = 21: a circle that allows you to use both fill for the inside and color for the outline. This let’s you to map two aesthetics to each point. All info on what number refers to which shape can be found here.

ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g, shape = species)) +
  geom_point()

We can only add a shape to categorical data and we would get an error if we want to add shapes for the different years (which, if we remember the output from str(penguins) are stored as integers). If we wanted to add shapes for different years, we first would have to convert the year to a factor.

Feel free to try this without changing the year to a factor first.

ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g, shape = factor(year))) +
  geom_point()

Notice: When we run these, we see how the legend title is renamed to factor(year). This is not how we want this and we will discuss in Section 12 how we can change this.

8.4 Alpha

geom_point() has an alpha argument that controls the opacity of the points:

A value of 1 (the default) means that the points are totally opaque
A value of 0 means the points are totally transparent (and therefore invisible)
Values in between specify transparency.

Changing the alpha is a good way to make overlapping data points better visible.

ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g)) +
  geom_point(alpha = 0.2)

Exercises

Make a scatterplot to compare bill length and bill depth and use different colors for different islands. In the point layer, change the shape attribute to 21.
Do the same as in exercise 1 but instead of mapping the islands to the colors map them to the fill aesthetic instead. Do you understand what is happening?
Make a scatterplot to compare bill length and bill depth. Map the shape aesthethic to the island variable. Also change the size attribute to make the symbols bigger and make the dots more transparent (it is up to you what values you choose).

Click me to see an answer

#question 1
ggplot(penguins_clean, aes(bill_length_mm, bill_depth_mm,  color = island)) +
  geom_point(shape = 21)

#question 2
ggplot(penguins_clean, aes(bill_length_mm, bill_depth_mm,  fill = island)) +
  geom_point(shape = 21)

Notice:

Fill controls how a shape is filled and color controls the color of the outline/border. The default points that are used with geom_point behave a bit differently when using the default shape 19 since this shape does not have an outline while shape 21 has one.

#question 3
ggplot(penguins_clean, aes(bill_length_mm, bill_depth_mm, shape = island)) +
  geom_point(size = 2, alpha = 0.5)

9 Position adjustments

Position adjustments apply minor tweaks to the position of elements within a layer. There are three position adjustments that are primarily useful for geom_point and other geoms can have other adjustments (more on that a bit later):

position_nudge(): moves points by a fixed offset
position_jitter(): adds random noise to every position
position_jitterdodge(): dodges points within groups, then adds a little random noise

By default, ggplot2 uses position = "identity" when we are using geom_point. If you want to know the default values used by functions you can figure this out by typing ?geom_point.

Therefore, if we write geom_point(position = "identity") we will get exactly the same result as if we only would write geom_point():

#use the default but add the default option for the position argument
ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
  geom_point(position = "identity")

If you look closely at the graph, there is a small issue with the data points –> several points overlap but we can not see how many. We have seen before, that changing the alpha helps to show the data better. Another way to visualize this a bit better is to add random noise, a process that is also called jittering. When we use the jitter position adjustment we apply random noise to each position.

The amount of noise added is very small, so you might need to look closely at what is happening.

ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
  geom_point(position = "jitter")

We can also exactly define the level of noise like using the position_jitter() function:

ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
  geom_point(position = position_jitter(width = 0.4))

Exercise

Generate a scatterplot comparing bill depth and bill length. Map the species to the color aesthetic and in geom_point() use position_jitter() to add random noise (adjust the witdh to 0.8). Rerun code several times and have an eye on what the dots do, do you notice something?

Click me to see an answer

ggplot(penguins_clean, aes( x =bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point(position = position_jitter(width = 0.8))

A small issue with jittering is that if we use the code above, we can not exactly reproduce the plot. Every time we run the code and regenerate the plot, the points will be at slightly different positions, since the noise is randomly generated each time we plot. However, we can control this, by setting a fixed value for our random seed. Now, every time you run the plot below, we will get the exact same output.

ggplot(penguins_clean, aes( x =bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point(position = position_jitter(width = 0.8, seed = 136))

10 Geoms

Geometric objects, or geoms for short, perform the actual rendering of the layer, controlling the type of plot that you create. Some useful geoms are:

geom_point() produces a scatterplot
geom_bar() and geom_col make bar charts
geom_boxplot() visualizes five summary statistics: the median, two hinges and two whiskers as well as outliers
geom_violin() shows density of values in each grou.
geom_line() makes a line plot. geom_line() connects points from left to right while geom_path() is similar but connects points in the order they appear in the data
geom_area() draws an area plot, which is a line plot filled to the y-axis (filled lines). Multiple groups will be stacked on top of each other
geom_rect(), geom_tile() and geom_raster() draw rectangles. geom_rect() is parameterised by the four corners of the rectangle, xmin, ymin, xmax and ymax. geom_tile() is exactly the same, but parameterised by the center of the rect and its size, x, y, width and height. geom_raster() is a fast special case of geom_tile() used when all the tiles are the same size
geom_errorbar() allows us to add error bars

Below we will look at some examples starting with the barplot.

10.1 Barplots

Classical barplots have a categorical x-axis (i.e. we can map the species to our x-axis) and we can generate barplots using two geoms:

geom_bar(): counts the number of cases at each x-position
geom_col(): plots the actual values

10.1.1 Geom_bar

Let’s first plot the number of observations we have for each penguin species. Here, geom_bar() does the work for us and does the counting and we only need to map the species to the x-variable:

ggplot(penguins_clean, aes(x = species)) +
  geom_bar()

We can also plot the counts in case they are already part of our data frame. To do this, lets first create a summary for our data using the tidyverse. Specifically, we want to count the number of observations per species.

peng_summary<- penguins_clean |> 
  #select the variables we want to work with
  select(species, flipper_length_mm) |> 
  #group our data by species
  group_by(species) |> 
  #get summary stats
  summarize(observations = n()) |> 
  arrange(desc(observations))

#view the data
peng_summary

Now, lets plot the observations. To do so we need to do some changes:

We not only map the species to the x-axis but the number of observations to our y-axis
We add stat = "identity" to the geom_bar function. A statistical transformation, or stat, transforms the data, typically by summarizing it in some manner. Geom_bar by default uses stat = "count" to automatically count the number of observations without us having to write this out. If we instead want to provide already existing values to geom_bar instead of letting geom_point automatically count for us, we have to change this argument to stat = "identity".

ggplot(peng_summary, aes(x = species, y = observations)) +
  geom_bar(stat = "identity")

10.1.2 Geom_col

Instead of using geom_bar we can use geom_col. geom_col() won’t try to aggregate the data by default and it expects us to already have the y values calculated and uses these values directly.

ggplot(peng_summary, aes(x = species, y = observations)) +
  geom_col()

10.1.3 Addon: Ordering data in plots

Now, another thing to keep in mind. When we generated the summary statistics, we ordered the data based on the number of observations but in the plot we see that the bars are ordered alphabetically by species name.

There are different ways to do change the order of our bar plot but the main thing we need to accomplish is to modify the factor levels of our ordering column. We can easily view the default order of our species by typing:

levels(peng_summary$species)

[1] "Adelie"    "Chinstrap" "Gentoo"

We can re-order the factor levels using baseR or using the forcats library that comes with the tidyverse and allows us to reorder factors. Let us use the forecats library to bring some order into our plot.

We can order the factors in different ways, for now let us use the fct_reorder function that is part of the forcats library. For fct_reorder, we need to tell the function our factor variable (“species”) and the values we want to reorder it by (the column corresponding to the y-axis, i.e. “observations”).

When using fct_reorder, the x-label gets renamed to fct_reorder(species,observations), which is … not very pretty. In order to change this we can use the labs function to change the x-axis label to Species.

If you are unsure what is happening in the code when we use the labs function, run the code without labs(x = "Species"). We will explain how to customize labels and other parts of the plot in more detail in Section 12.

ggplot(peng_summary, aes(x = fct_reorder(species, observations), y = observations)) +
  geom_col() +
  labs(x = "Species")

We can easily reverse the order by adding an extra argument, .desc = TRUE:

ggplot(peng_summary, aes(x = fct_reorder(species, observations, .desc = TRUE), y = observations)) +
  geom_col() +
  labs(x = "Species")

If we don’t calculate the number of observations and want to use geom_bar, we can reorder by using another function fct_infreq as follows:

ggplot(penguins_clean, aes(fct_infreq(species))) +
  geom_bar()

Or in reverse:

ggplot(penguins_clean, aes(fct_rev(fct_infreq(species)))) +
  geom_bar()

There are more ways that you can use the forcats library to order data, but that’s out of the scope of this tutorial. For more, feel free to start by having a look at the forcats documentation.

10.1.4 Positions in barplots

We have three different ways to adjust the positions of barplots:

position_stack(): stack overlapping bars (or areas) on top of each other
position_fill(): stack overlapping bars, scaling so the top is always at 1
position_dodge(): place overlapping bars side-by-side

Lets first generate a barplot comparing the observations per year and mapping the sex to the fill aesthetic:

ggplot(penguins_clean, aes(x = year, fill = sex)) +
  geom_bar()

By default, geom_bar produced a stacked barplot and geom_bar uses position = "stack" in the background. We can change this behavior and plot proportional values by using position=fill instead:

ggplot(penguins_clean, aes(x = year, fill = sex)) +
  geom_bar(position = "fill")

To plot the values next to each other we can “dodge” the bars:

ggplot(penguins_clean, aes(x = year, fill = sex)) +
  geom_bar(position = "dodge")

Exercise

Generate a dodged barplot comparing the number of observations for different penguin species across years.

If yo followed the section on ordering observations consider to also change the order in the plot (i.e. order from large number of counts to small).

Click me to see an answer

ggplot(penguins_clean, aes(x = year, fill = fct_infreq(species))) +
  geom_bar(position = "dodge")

10.2 Boxplots

A boxplot is a standardized way of displaying data based on five summary statistics: the minimum, the maximum, the median, the first and third quartiles. Additionally, geom_boxplot also will show outliers.

Let’s first compare the summary statistics of the body mass across different penguin species:

ggplot(penguins_clean, aes(x = species, y = body_mass_g)) +
  geom_boxplot()

We could easily add the individual datapoints to this plot by adding another layer. To avoid overplotting, we can use geom_point but adjust the position with position_jitter.

ggplot(penguins_clean, aes(x = species, y = body_mass_g)) +
  geom_boxplot() + 
  geom_point(position = position_jitter())

We can easily add another dimension, for example by mapping the year to the fill aesthethic. When we add a factor aesthetic geom_boxplot will automatically generate dodged boxplots.

ggplot(penguins_clean, aes(x = species, y = body_mass_g, fill = factor(year))) +
  geom_boxplot()

If we want to add individual data points to this plot, we need to add the position adjustment position_jitterdodge in geom_point to ensure that we simultaneously dodge and jitter our points:

ggplot(penguins_clean, aes(x = species, y = body_mass_g, fill = factor(year))) +
  geom_boxplot() +
  geom_point(position = position_jitterdodge())

Notice:

When generating a dodged boxplot, the width is automatically calculated by the total width of all elements in a position. For example, if we add an aesthetic for the island (only Adelie is found on 3 different islands) then the individual width of each boxplot is not the same:

ggplot(penguins_clean, aes(x = species, y = body_mass_g, fill = island)) +
  geom_boxplot()

We can adjust this by preserving the width of a single element instead. Don’t forget, if you are ever unsure about the behavior of a plot, every detail of a function can be checked with ?geom_boxplot.

ggplot(penguins_clean, aes(x = species, y = body_mass_g, fill = island)) +
  geom_boxplot(position = position_dodge(preserve = "single"))

Exercise

Boxplots are useful summaries, but hide the shape of the distribution. Try plotting the species on the x, body_mass on the y-axis and dodge by year (beware that years need to be factors for this to work). To show the shape of the distribution, use geom_violin instead of geom_boxplot. Additionally, show the individual data points and change the transparency a bit in order to make the points stand out less:

Click me to see an answer

ggplot(penguins_clean, aes(x = species, y = body_mass_g, fill = factor(year))) +
  geom_violin() + 
  geom_point(alpha = 0.3, position = position_jitterdodge())

Comment:

For violin points, its a bit harder to get the individual points inside the violin plot if you are interested in making this work we can recommend looking at the ggbeeswarm library.

10.3 Histograms

A histogram displays numerical data by grouping data into “bins” of equal width. We need to only provide a single aesthetic, x, which needs to be a continuous variable. For example, lets look at the distribution of the flipper length measurements using a histogram:

ggplot(penguins_clean, aes( x = flipper_length_mm )) +
  geom_histogram()

If we want to have smaller bins, we can change the width like this:

ggplot(penguins_clean, aes( x =flipper_length_mm )) +
  geom_histogram(binwidth = 5)

Some things to keep in mind when visualizing histograms:

Ensure that we set meaningful bin widths for the data
Don’t show spaces between the bars
X-labels should fall between the bars as they represent intervals and not actual values

The last point we can control with the center argument:

ggplot(penguins_clean, aes( x =flipper_length_mm )) +
  geom_histogram(binwidth = 2, center = 0.05)

10.3.1 Use aesthethics in histograms

Same as for other geoms we can map a variable, such as species, to different aesthetics:

ggplot(penguins_clean, aes( x =flipper_length_mm, fill = species)) +
  geom_histogram(binwidth = 2, center = 0.05)

However, a problem with this representation is that it is not immediately clear if the data is overlapping or if they are stacked on top of each other.

We can change the plot by using positional adjustments:

the default position of geom_histogram() is using stacked bars. We can change this with the position argument.
We can also dodge the bars, i.e. offset each data point in a given category.

ggplot(penguins_clean, aes( x =flipper_length_mm, fill = species)) +
  geom_histogram(binwidth = 2, center = 0.05, position = "dodge")

The fill position normalizes each bin to represent the proportion of all observations in each bin”

ggplot(penguins_clean, aes( x =flipper_length_mm, fill = species)) +
  geom_histogram(binwidth = 2, center = 0.05, position = "fill")

Exercise

Generate a histogram looking at the body mass across different islands. Change the transparency to 0.5.

Click me to see an answer

ggplot(penguins_clean, aes( x =body_mass_g, fill = island )) +
  geom_histogram(alpha = 0.5)

10.4 Line plots

Line plots are ideal if we want to plot time series, such as the beaver data. The beaver data comes with two datasets, beaver1 and beaver2. These datasets record the temperature and activity of two different beavers over time.

Let’s first have a look at our data:

head(beaver1)

str(beaver1)

'data.frame':   114 obs. of  4 variables:
 $ day  : num  346 346 346 346 346 346 346 346 346 346 ...
 $ time : num  840 850 900 910 920 930 940 950 1000 1010 ...
 $ temp : num  36.3 36.3 36.4 36.4 36.5 ...
 $ activ: num  0 0 0 0 0 0 0 0 0 0 ...

summary(beaver1)

      day             time             temp           activ        
 Min.   :346.0   Min.   :   0.0   Min.   :36.33   Min.   :0.00000  
 1st Qu.:346.0   1st Qu.: 932.5   1st Qu.:36.76   1st Qu.:0.00000  
 Median :346.0   Median :1415.0   Median :36.87   Median :0.00000  
 Mean   :346.2   Mean   :1312.0   Mean   :36.86   Mean   :0.05263  
 3rd Qu.:346.0   3rd Qu.:1887.5   3rd Qu.:36.96   3rd Qu.:0.00000  
 Max.   :347.0   Max.   :2350.0   Max.   :37.53   Max.   :1.00000

We have 114 temperature and activity observations collected over 2 days and different time intervals. If we do the same for beaver2 we would see a very similar looking dataset.

Let’s start by plotting the temperature records for our first beaver over the whole time interval:

ggplot(beaver1, aes(x = time, y = temp)) +
  geom_line()

Also here, we can add an aesthetic for example by mapping the activity measurements to the color aesthetic.

ggplot(beaver1, aes(x = time, y = temp, color = activ)) +
  geom_line()

10.4.1 Line plots for several species

We can easily plot the data for both our beavers in one plot. To do this, let us first combine the data for beaver 1 and beaver 2:

#add a new colum for the specimen for beaver1 and 2
beaver1$species = "beaver1"
beaver2$species = "beaver2"

#combine the two datasets
beaver_all <- rbind(beaver1, beaver2)

#control the number of observations
dim(beaver_all)

[1] 214   5

Now we can plot again and distinguish beaver1 and 2 by different types of lines:

ggplot(beaver_all, aes(x = time, y = temp, linetype = species)) +
  geom_line()

Exercise

Compare the changes in temperature over time for the two beaver species by mapping the species to the color aesthetic. Make the lines a bit thicker using the linewidth argument:

Click me to see an answer

ggplot(beaver_all, aes(x = time, y = temp, color = species)) +
  geom_line(linewidth = 1.5)

10.5 Geom_smoot and adding trendlines

geom_smooth() adds a smooth trend curve and as such aids the eye in seeing patterns in the presence of overplotting.

Let’s add a trend curve to a scatterplot, where we compare the flipper length and body mass:

ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g)) +
  geom_point() +
  geom_smooth()

We can also change the method on how the trend curve is calculated. By default NULL is chosen, where the smoothing method is chosen based on the size of the largest group.

We should also see a small message appearing when we generate the plot that tells us the method chosen, i.e: geom_smooth() using method = ‘loess’ and formula = ‘y ~ x’.

Loess is a non-parametric smoothing algorithm that usually is used when we have less than 1000 observations. It works by calculating a weighted mean by passing a sliding window along the x-axis.

We can change the function and for example use a linear model like this:

ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g)) +
  geom_point() +
  geom_smooth(method = "lm")

If you want to know more about how to change things and the math used, check the help function with ?geom_smooth.

We can also calculate a trend line whilst using the color aesthetic. In the example below, we remove the standard error that was shown before by using se = FALSE. This makes our graph a little less cluttered.

ggplot(penguins_clean, aes(x =bill_length_mm, y = body_mass_g, color = species )) + 
  geom_point() +
  geom_smooth(se = FALSE, method = "lm")

By default, each model is bound to the values of its own group. We can change this by defining the fullrange to make predictions over the full range of data. The plot below is not very pretty, and we will later see how to improve this.

ggplot(penguins_clean, aes(x =bill_length_mm, y = body_mass_g, color = species )) + 
  geom_point() +
  geom_smooth(method = "lm", fullrange = TRUE)

10.5.1 Addon: Adding math

This is a bit outside of the scope of this tutorial, so we won’t cover details here, but since many of you might want to know how you add a regression equation and R2 you could to this quite easily with the help of another package ggpubr and two of its functions stat_corand stat_regline_equation:

Notice:

If you want to test this, make sure that you have installed ggpubr first.

library(ggpubr)

ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g)) +
  geom_point() +
  geom_smooth(method = "lm") +
  #plot both the r2 (rr) and p-value next to each other, separated by a comma
  #label.y is used to adjust the position of the text in our plot
  stat_cor(label.y = 6200, aes(label = paste(..rr.label.., ..p.label.., sep = "~`,`~"))) +
  #plot the regression equation
  stat_regline_equation(label.y = 6000)

Exercise

Generate a scatterplot and plot the bill_length_mm against the bill depth. Add a trend line using the lm method. If you want and looked at the information in the addon section: add the regression equation and R2. Based on visual inspection and (if you have done this also based on the R2) would you say we see a strong correlation?

Click me to see an answer

ggplot(penguins_clean, aes(bill_length_mm, bill_depth_mm   )) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  stat_cor(label.y = 22, aes(label = paste(..rr.label.., ..p.label.., sep = "~`,`~"))) +
  stat_regline_equation(label.y = 21.5)

If we look at the plot below, we can see that the data points are very far from the line and that the R2 is close to 0 suggesting a weak, negative association.

Going into the stats is outside the scope of this tutorial but feel free to check the links below for more detail:

10.6 Annotations layer

annotate() adds a geom to a plot but unlike a typical geom function, the properties of the geoms are not mapped from variables of a data frame, but are instead passed in as vectors. This is useful for adding small annotations (such as text labels) or if you have your data in vectors, and for some reason don’t want to put them in a data frame.

We can add text to our scatter plots like this:

ggplot(penguins_clean, aes(bill_length_mm, bill_depth_mm   )) +
  geom_point(alpha = 0.5) +
  annotate("text", x = 58, y = 10, label = "some text")

We also can add rectangles to highlight things:

ggplot(penguins_clean, aes(bill_length_mm, bill_depth_mm   )) +
  geom_point(alpha = 0.5) +
  annotate("rect", xmin = 50, xmax = 60, ymin = 10, ymax = 12, alpha = .2)

We can also add lines sections:

ggplot(penguins_clean, aes(bill_length_mm, bill_depth_mm   )) +
  geom_point(alpha = 0.5) +
  annotate("segment",  x = 35, xend = 50, y = 12.5, yend = 22, colour = "purple")

Notice:

We can add both a text and and arrow to highlight a specific data point. To add an arrow we use the segment geom but we need some more syntax to convert this to an arrow. To do this, we can use arrow function from the grid package.

ggplot(penguins_clean, aes(bill_length_mm, bill_depth_mm   )) +
  geom_point(alpha = 0.5) +
  annotate("text", x = 58, y = 10, label = "outlier") +
  annotate("segment", x = 58, y = 10.5, xend = 59.6, yend = 16.8, size = 0.7,
         arrow = arrow(type = "closed", length = unit(0.02, "npc")))

11 Facets

Facets partition a plot into a matrix of panels. Each panel shows a different subset of the data.

facet_grid() will produce a grid of plots for each combination of variables that you specify, even if some plots are empty.
facet_wrap() will only produce plots for the combinations of variables that have values, which means it won’t produce any empty plots.

Let’s start by creating a histogram while mapping the fill aesthetic to the species variable:

ggplot(penguins_clean, aes(bill_length_mm,  fill = species)) +
  geom_histogram()

We have discussed before that this way, we don’t know for sure if there is over plotting and discussed ways to better visualize this. A new one we want to discuss now is faceting.

We can facet with one group in vertical direction, or by rows, by adding another layer with facet_grid():

ggplot(penguins_clean, aes(bill_length_mm)) +
  geom_histogram() + 
  facet_grid(rows = vars(species))

Note

ggplot2 before version 3.0.0 used formulas to specify how plots are faceted. If you encounter facet_grid/wrap() code containing ~ then this has been changed to give the user more flexibility in creating functions.

So the above could also be written as:

ggplot(penguins_clean, aes(bill_length_mm)) +
  geom_histogram() + 
  facet_grid(species ~ .)

While the older version to write the code is a lot shorter it is also not as explicit in terms of what gets plotted where.

We can also change the plot to horizontal direction, i.e. by column:

ggplot(penguins_clean, aes(bill_length_mm)) +
  geom_histogram() + 
  facet_grid(cols = vars(species))

We can also facet using two variables at the same time. For example created faceted rows for both species a year:

ggplot(penguins_clean, aes(bill_length_mm)) +
  geom_histogram() + 
  facet_grid(rows = vars(species, year))

Or we can facet using two columns using both by columns and rows:

ggplot(penguins_clean, aes(bill_length_mm)) +
  geom_histogram() + 
  facet_grid(rows = vars(species), cols = vars(year))

facet_grid by default assumes that we provide the rows and then the columns. So a shorter way to write this is:

ggplot(penguins_clean, aes(bill_length_mm)) +
  geom_histogram() + 
  facet_grid(vars(year), vars(species))

By default, all the panels have the same scales and scales="fixed" is used in the background. While for a lot of things it makes sense to have the same scale, we can also make scales independent, by setting scales to free, free_x, or free_y.

ggplot(penguins_clean, aes(bill_length_mm)) +
  geom_histogram() + 
  facet_grid(cols = vars(year), rows = vars(species), scales = "free")

11.1 Addon: Playing with facets

This is just an example showing the amount of flexibility you have with ggplot. For example, below we can use different layers if we wanted to facet by species BUT still wanted to show all data points in each subplot:

#duplicate our dataframe but remove the column we want to use to facet
#this way facet_grid won't separate the points by species for this dataframe
df2 <- penguins_clean |> 
  select(-species)

ggplot(penguins_clean, aes(x =bill_length_mm, y = body_mass_g )) + 
  #add all datapoints to the facets by using our second dataframe
  geom_point(data = df2, color = "grey", alpha = 0.5) +
  #assign the species to the color aesthetic
  geom_point(aes(color = species)) +
  #facet the data
  facet_grid(cols = vars(species))

Exercise:

Generate a scatterplot comparing the bill length and body mass. Add a trendline (color the trend line in grey) and use facet_grid to create subplots for different species. Its up to you whether you prefer a horizontal or vertical orientation.
Do the same as above but allow each facet to use a different scale.
Generate a histogram showing the body weight. Use facet_grid to generate subplots for different islands and species.
Do the same as in exercise 3 but use facet_wrap instead. Beware, facet_wrap uses a slightly different syntax, try to figure out what you need to change by using the help function. Once you generated the plot, do you see how the behave differently?

Click me to see an answer

#question 1
ggplot(penguins_clean, aes(x =bill_length_mm, y = body_mass_g )) + 
  geom_point() +
  geom_smooth(method = "lm", color = "grey") + 
  facet_grid(cols = vars(species))

#question 2
ggplot(penguins_clean, aes(x =bill_length_mm, y = body_mass_g )) + 
  geom_point() +
  geom_smooth(method = "lm", color = "grey") + 
  facet_grid(cols = vars(species), scales = "free")

#question3
ggplot(penguins_clean, aes(body_mass_g)) +
  geom_histogram() + 
  facet_grid(cols = vars(species), rows = vars(island))

#question4
ggplot(penguins_clean, aes(body_mass_g)) +
  geom_histogram() + 
  facet_wrap(vars(species, island))

12 Modifying scales and axes

12.1 Scale functions and transformations

Scales control the details of how data values are translated to visual properties. We can override the default scales to tweak details like the axis labels or legend keys.

Some options to modify our x- and y-axis are:

scale_x_*()
scale_y_*()
scale_color_*()
scale_fill_*()
scale_shape_*()
scale_linetype_*()
scale_size_*()

Importantly, we need to define the scales based on the type of data we have, continuous or discrete:

Discrete variables represent counts (e.g. the number of observations).
Continuous variables represent measurable amounts (i.e. body weight).

We distinguish between these two options by appending either continuous or discrete to the scale_:

scale_x_continous()
scale_color_discrete()

12.2 Changing the text for axes and legends

Let’s start by changing the text label to the x an y axis as well as for the legend that gets generated when we use the color aesthetic. Notice, how the x and y-axis gets its own scale and how we have to use both a continous and discrete scale?

ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
  geom_point(position = "jitter") +
  scale_y_continuous("Body mass (g)") +
  scale_x_continuous("Flipper length (cm)") +
  scale_color_discrete("Penguin species")

12.3 Scale transformations

When working with continuous data, the default is to map linearly from the data space onto the aesthetic space. It is possible to override this default using transformations. Every continuous scale takes a trans argument by which we can change the transformation. Built in functions for axis transformations are :

scale_x_log10(), scale_y_log10() : for log10 transformation
scale_x_sqrt(), scale_y_sqrt() : for sqrt transformation
scale_x_reverse(), scale_y_reverse() : to reverse coordinates
coord_trans(x ="log10", y="log10") : is different to scale transformations in that it occurs after statistical transformation. Possible values for x and y are “log2”, “log10”, “sqrt”, etc.
scale_x_continuous(trans='log2'), scale_y_continuous(trans='log2') : another allowed value for the argument trans is ‘log10’

In the example below, we reverse the scale for the y-axis:

ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
  geom_point(position = "jitter") +
  scale_y_reverse()

12.4 Changing the axis limits

Limits describes the range of a scale. There are different functions to set axis limits :

xlim() and ylim()
expand_limits()
scale_x_continuous() and scale_y_continuous()

12.4.1 xlim and ylim

Let’s start by changing the range of the y-axis using ylim and let the range go from 0 to 6500:

ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
  geom_point(position = "jitter") +
  ylim(0,6500)

12.4.2 expand_limits

expand_limits() can be used to quickly set the intercept of x and y to 0,0.

ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
  geom_point(position = "jitter") +
  expand_limits(x=0, y=0)

12.4.3 scale_x/y_continuous

We can also use the scale_x_continuous() and scale_y_continuous() to change x and y axis limits, respectively. Using these functions usually is beneficical since we can control more details such as:

name : x or y axis labels
breaks : to control the breaks in the guide (axis ticks, grid lines, …). Among the possible values, there are :
- NULL : hide all breaks
- waiver() : the default break computation
- a character or numeric vector specifying the breaks to display
labels : labels of axis tick marks. Allowed values are :
- NULL for no labels
- waiver() for the default labels
- character vector to be used for break labels
limits : a numeric vector specifying x or y axis limits (min, max)
trans for axis transformations. Possible values are “log2”, “log10”, …

Let’s start by controlling the limits and manually set the lower limit to 0 and the upper limit to the maximum value :

ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
  geom_point(position = "jitter") +
  scale_x_continuous(limits = c(0,250)) +
  scale_y_continuous(limits = c(0,7000))

We can also use basic math operations to automatically create the limits for us. Below, we calculate the maximum body mass and use it as the upper limit. Notice: We add + 1 to give the plot a bit more room to fit the last data point.

ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
  geom_point(position = "jitter") +
  scale_x_continuous(limits = c(0,250)) +
  scale_y_continuous(limits = c(0, max(penguins_clean$body_mass_g) + 1 ))

If we wanted to control how the tick mark positions are broken up, we use the breaks argument. The argument used in the call are: start, end, steps. Let us start by having a finer step size for the x-axis:

ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
  geom_point(position = "jitter") +
  scale_x_continuous(limits = c(0,250), breaks = seq(0,250,25)) +
  scale_y_continuous(limits = c(0,7000))

We can also expand the range of the plot limits. To see what is happening compare the 0,0 position between the plot above and below.

ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
  geom_point(position = "jitter") +
  scale_x_continuous(limits = c(0,250), 
                     breaks = seq(0,250,25),
                     expand = c(0,0)) +
  scale_y_continuous(limits = c(0,7000),
                     expand = c(0,0))

The labels argument adjusts the category names and is an easy way if we want to beautify the legend.

ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
  geom_point(position = "jitter") +
  scale_color_discrete("Penguin species", 
                       labels = c("Ad", "Ch", "Ge"))

12.5 The labs function

Labs is another way that allows us to change the axis and legend labels.

ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
  geom_point(position = "jitter") +
  labs(x = "Flipper length (cm)" , 
       y = "Body mass (g)",
       color = "Penguin species" )

12.6 Adding a plot title with ggtitle

We can also quickly add a title. If your title is extremely long, we can break it into several lines using \n

ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
  geom_point(position = "jitter") +
  ggtitle("Comparing the body mass (g) and \nflipper length (mm) of three different penguin species")

12.7 Defining our own colors

There are many ways to decide how to assign colors. While we will not go into detail, we briefly want to discuss how we can manually change colors or uses existing color palettes.

12.7.1 Using existing color palettes

scale_color_brewer (and others) provide sequential, diverging and qualitative colour schemes. Some examples on how to use them are found here.

ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
  geom_point(position = "jitter") +
  scale_color_brewer(palette = "Spectral")

12.7.2 Manually assigning colors

With scale_color_manual() we can change the properties of the color scale . The first argument sets the legend title and for the values argument we provide a named vector of colors to use.

Below we use simple words, i.e. red and blue, to describe what colors we want to use. However, its recommended to use a hexcode instead to have a bit more control over our color scheme. For example #377EB8 would also give a blue color (and in newer RStudio versions, the hexcode gets highlighted by the color the ID represents).

Let us start by defining some vectors in which we define the colors we want to map our three species too:

palette1 in which we are explicit about what species is mapped to what color
palette2 in which we only create a vector with three written out colors
palette3 in which we create a vector with colors using a hexcode

#lets start by defining some color vectors
palette1 <- c(Gentoo = "#377EB8", Chinstrap = "#E41A1C", Adelie = "purple")
palette2 <- c("blue",  "red", "purple")
palette3 <- c("#377EB8",  "#E41A1C", "#B300B2")

Now, let us use scale_color_manual to assign our species to these colors:

ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
  geom_point(position = "jitter") +
  scale_color_manual("Penguins", values = palette1)

Exercise

Use palette2 and palette3 instead of palette1. Do you see how they behave differently?

Click me to see an answer

ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
  geom_point(position = "jitter") +
  scale_color_manual("Penguins", values = palette2)

We can see that if we don’t add the species names to the vector, we have less control over how colors are changed. If we don’t specify an order, the colors get assigned to the default order of the mapped variable, i.e. species:

Adelie: blue
Chinstrap: red
Gentoo: purple

In contrast, using palette1 we can be explicit and decide that Adelie is purple regardless of the order of the vector.

ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
  geom_point(position = "jitter") +
  scale_color_manual("Penguins", values = palette3)

Palette3 is an example of us using hexcodes. See how we use blue, red and purple as before but by using the hexcode we can exactly specify the intensity of the colors. In the example here, I decide to use a less bright color intensity compared to when we used the other palettes.

When making your own color palettes you need to keep in mind:

You need as many colors as you have categories (3 in our case)
Depending on how you assign the colors, you might not have control over the order of the colors and need to watch out that the colors get assigned to the right variable, i.e. species in the examples above.

13 Themes

Themes allows us to control all non-data ink on your plot as well as all visual elements that are not part of your data,

There are three types that can be modified with:

text, which can be modified using element_text()
line, which can be modified using element_line()
rectangle, which can be modified using element_rect()

13.1 Modifying text elements

We can in general modify all text, titles, plot, legend, axes.

Let’s modify part of the axes, specifically the color of the axis title. For this, we access the axis.title via the theme function and change text theme element via element_text().

ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species  )) +
  geom_point() +
  theme(axis.title = element_text(color = "blue"))

We can change specific parts in an hierarchical manner.

For example, when using axis.title, we can change both, the x- and y-axis. With axis.title.x we only change the x-axis and so on.

13.2 Modify lines

Lines are the axis lines as well as the tick marks or the lines around the plot. Same as for text, this is changed in an hierarchical manner.

As an example, let us change the color of the x axis line:

ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species  )) +
  geom_point() +
  theme(axis.line.x = element_line(color = "blue"))

13.3 element_blank()

We can use element_blank() to remove any item of a plot. If you are unsure what is removed in each theme, feel free to try the code by removing individual themes.

ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species  )) +
  geom_point() + 
  theme(line = element_blank(),
        rect = element_blank(),
        text = element_blank())

13.4 Moving the legend

Legend positions can be changed via legend.position. Here, we can use the following:

“top”, “bottom”, “left”, or “right’”: place it at that side of the plot.
“none”: don’t draw it.
We can provide exact positions, using c(x, y): here, c(0, 0) means the bottom-left and c(1, 1) means the top-right.

Exercise

With the information above, try to remove the legend.
Add the legend at the bottom of the plot
Position the legend with x at 0.6 and y at 0.1

Click me to see an answer

#question 1
ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species  )) +
  geom_jitter(alpha =0.6) + 
  theme(legend.position = "none")

#question 2
ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species  )) +
  geom_jitter(alpha =0.6) + 
  theme(legend.position = "bottom")

#question 3
ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species  )) +
  geom_jitter(alpha =0.6) + 
  theme(legend.position = c(0.6, 0.1))

13.5 Modifying theme elements

Many plot elements have multiple properties that can be set. For example, line elements, such as axes and grid lines, have a color, a thickness (size), and a line type (solid line, dashed, or dotted). To set the style of a line, you use element_line().

Below, we will go through a few examples but for a full list of options, go here.

For example, we can give all rectangles (via panel.background) in the plot a “white” fill and grey line color and remove the legend.keys outline by setting its color to be missing (NA)

ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species  )) +
  geom_jitter(alpha =0.6) + 
  theme(panel.background = element_rect(fill = "white", color = "grey"),
        legend.key = element_rect(fill = NA) )

We can also remove the axis ticks (axis.ticks) by using element_blank(). We can also remove the panel grid lines, panel.grid in the same way.

ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species  )) +
  geom_jitter(alpha =0.6) +
  theme(axis.ticks = element_blank(),
        panel.grid = element_blank())

If we wanted to at least add the major grid lines back to the plot above we would do the following.

ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species  )) +
  geom_jitter(alpha =0.6) +
  theme(axis.ticks = element_blank(),
        panel.grid = element_blank(),
        panel.grid.major = element_line(color = "black", size = 0.5, linetype = "dotted"))

We can also easily change the text. For example, let us make the axis.text, less prominent by changing the color to “grey”. Additionally, lets add a title and increase the plot.title’s, size to 16 and change its font face to “italic”.

ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species  )) +
  geom_jitter(alpha =0.6) +
  ggtitle("Penguins are fun") +
  theme(axis.text = element_text(color = "grey"),
        plot.title = element_text(size = 16, face = "italic"))

Another useful thing: We can increase the distance between axis title and the plot my increasing the margins:

ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species  )) +
  geom_jitter(alpha =0.6) +
  ggtitle("Penguins are fun") +
  theme(axis.title.x = element_text(margin=margin(t=10)),
        axis.title.y = element_text(margin=margin(r=10)),
        axis.text = element_text(color = "grey"),
        plot.title = element_text(size = 16, face = "italic"))

13.6 Modify white space

When talking about Whitespace, we talk about all the non-visible margins and spacing in the plot.

To set a single whitespace value, use unit(x, unit), where x is the amount and unit is the unit of measure. The default unit is “pt” (points), which scales well with text. Other options include “cm”, “in” (inches) and “lines” (of text). If you want to have a list of all possible units type ?grid::unit

For example, we could make longer axis ticks like this:

ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species  )) +
  geom_jitter(alpha =0.6) +
  theme(axis.ticks.length = unit(12, "points"))

Borders require you to set 4 positions, so use margin(top, right, bottom, left, unit). To remember the margin order, think TRouBLe. We set the legend.margin to 20 points (“pt”) on the top, 30 pts on the right, 40 pts on the bottom, and 50 pts on the left like this:

ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species  )) +
  geom_jitter(alpha =0.6) +
  theme(legend.margin = margin(20,30,40,50, "pt"))

Exercise

Give the legend key size, legend.key.size, a unit of 3 centimeters (“cm”).
Set the plot margin,plot.margin, to 10, 30, 50, and 70 millimeters (“mm”).

Click me to see an answer

#question1
ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species  )) +
  geom_jitter(alpha =0.6) +
  theme(legend.key.size = unit(3, "cm"))

#question2
ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species  )) +
  geom_jitter(alpha =0.6) +
  theme(plot.margin = margin(10,30,50,70, "mm"))

13.7 Modifying facets

With strip we can change the appearance of facets. Let’s change the facet text font and the box.

ggplot(penguins_clean, aes(bill_length_mm)) +
  geom_histogram() + 
  facet_grid(species ~ year, scales = "free") +
  theme(strip.text = element_text(size=12, face="bold"),
        strip.background = element_rect(colour="black", fill="white",linetype="solid"))

14 Theme flexibility

Ways to use themes:

From scratch (as shown above)
using theme layer objects
using build-in themes (from ggplot)
using build-in themes (from other packages)

14.1 Make your own themes

For now, let’s have a look at the second point. Making our on themes is useful for consistency across several plots.

Let’s look at one of the plots we have done before and lets store it in the variable z. In the next step, we can add some custom themes to z:

z <-
  ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species  )) +
  geom_jitter(alpha =0.6) +
  scale_x_continuous("Bill_length (cm)") +
  scale_y_continuous("Body weight (g)") +
  scale_color_brewer("Penguins", palette = "Dark2", labels = c("Ad", "Ch", "Ge"))

z

Now, lets change some themes to make our plot a bit more professional looking.

z + theme(text = element_text(family = "serif", size = 14),
          rect = element_blank(),
          panel.grid = element_blank(),
          axis.line = element_line(color = "black"))

Now, lets define a theme that we can reuse.

theme_pengs <- theme(text = element_text(family = "serif", size = 14),
          rect = element_blank(),
          panel.grid = element_blank(),
          axis.line = element_line(color = "black"))

Use our new theme for the plot.

z + theme_pengs

The useful thing of doing things this way, is that we can also apply this theme to any other plot. For example, lets apply it to the plot below

m <- ggplot(penguins_clean, aes(x = body_mass_g)) +
  geom_histogram() +
  scale_x_continuous(expand = c(0,0))+
  scale_y_continuous(expand = c(0,0)) 

m

Now, we can add our theme like this

m + theme_pengs

We still can modify themes by adding another theme layer, which will over-write previous settings.

m + 
  theme_pengs +
  theme(axis.line.x = element_blank())

14.2 Accessing build in themes

Use theme_*() to access built-in themes. A full list of themes can be found here. In general the following build in themes might be useful to get started:

theme_gray() is the default.
theme_bw() is useful when you use transparency.
theme_classic() is more traditional.
theme_void() removes everything but the data.

z +
  theme_classic()

Custom themes have a convenient way to change ALL text and make things a bit bigger (often useful for presentations):

z +
  theme_classic(base_size = 12)

Again, we can modify every specific element we want.

z +
  theme_classic() +
  theme(text = element_text(family = "serif"))

Exercise

Create a scatter plot comparing the bill length and body mass. Map the species variable to the color aesthetic. Add a black and white theme, theme_bw(), to the penguin plot.

Click me to see an answer

ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species  )) +
  geom_jitter(alpha =0.6) +
  theme_bw()

15 Playground

The code below is outside of the tutorial but will list any other package that is can be useful to use in combination with ggplot.

15.1 Cowplot

cowplot provides various features that help with creating publication-quality figures, such as a set of themes, functions to align plots and arrange them into complex compound figures, and functions that make it easy to annotate plots and or mix plots with images.

#load library
library(cowplot)

Let us start by creating some plots first:

hist <- ggplot(penguins_clean, aes( x = flipper_length_mm, color = species )) +
  geom_histogram() 

box <- ggplot(penguins_clean, aes(x = species, y = body_mass_g, color = species)) +
  geom_boxplot() + 
  geom_point(position = position_jitter())

scatter <- ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g, color = species)) +
  geom_point()

15.1.1 Cowplot themes

Cowplot comes with some build-in themes that can be used to beautify the plot:

scatter +
  theme_cowplot(12)

15.1.2 Arranging multiple plots

With the plot_grid function, we can arrange multiple plots in a grid:

plot_grid(scatter, box, labels = c("A", "B"), label_size = 12)

We see that we have redundant legend, we can deal with this as follows:

# extract a legend 
legend <- get_legend(scatter +
                       theme(legend.direction = "vertical",
                             legend.justification="top", 
                             legend.box.just = "top")
                     )

#add legend to the plot
p <-plot_grid(
  scatter + theme(legend.position="none"), 
  box  + theme(legend.position="none"),
  legend,
  nrow = 1,
  rel_widths = c(1,1,0.3),
  labels = c("A", "B", ""), 
  label_size = 12)

p

15.2 ggstatsplot

ggstatsplot is an extension of ggplot2 package for creating graphics with details from statistical tests included in the information-rich plots themselves. More information can be found here.

library(ggstatsplot)

There are different types of plot, for example, we can use violin plots for comparisons between groups/conditions. The following options are available:

Function	Plot	Description
ggbetweenstats()	violin plots	for comparisons between groups/conditions
ggwithinstats()	violin plots	for comparisons within groups/conditions
gghistostats()	histograms	for distribution about numeric variable
ggdotplotstats()	dot plots/charts	for distribution about labeled numeric variable
ggscatterstats()	scatterplots	for correlation between two variables
ggcorrmat()	correlation matrices	for correlations between multiple variables
ggpiestats()	pie charts	for categorical data
ggbarstats()	bar charts	for categorical data
ggcoefstats()	dot-and-whisker plots	for regression models and meta-analysis

We can also use different statistical approaches:

Type	Test	Function used
Parametric	One-sample Student’s t-test	stats::t.test()
Non-parametric	One-sample Wilcoxon test	stats::wilcox.test()
Robust	Bootstrap-t method for one-sample test	WRS2::trimcibt()
Bayesian	One-sample Student’s t-test	BayesFactor::ttestBF()

An example for a scatterplot:

ggscatterstats(data = penguins_clean, x = bill_length_mm, y = body_mass_g)

A simple example for a violinplot:

ggbetweenstats(data = penguins, x = species, y = body_mass_g,  type = "robust")

gghistostats() histograms for distribution about numeric variable:

gghistostats(data = penguins, x = body_mass_g,  type = "np",
             normal.curve = TRUE, normal.curve.args = list(color = "black", size = 1))

We can also extract the stats like this:

p <- gghistostats(data = penguins, x = body_mass_g,  type = "np")

extract_stats(p)

$subtitle_data
# A tibble: 1 × 12
  statistic  p.value method                    alternative effectsize       
      <dbl>    <dbl> <chr>                     <chr>       <chr>            
1     58653 8.19e-58 Wilcoxon signed rank test two.sided   r (rank biserial)
  estimate conf.level conf.low conf.high conf.method n.obs expression
     <dbl>      <dbl>    <dbl>     <dbl> <chr>       <int> <list>    
1        1       0.95        1         1 normal        342 <language>

$caption_data
NULL

$pairwise_comparisons_data
NULL

$descriptive_data
NULL

$one_sample_data
NULL

$tidy_data
NULL

$glance_data
NULL

We can also do a grouped comparison, i.e. lets compare the body mass of different species grouped by sex. Since we do several comparisons, lets also a bonferroni correction here:

grouped_ggbetweenstats(data = penguins_clean, x = species, y= body_mass_g,  grouping.var = sex,
                       p.adjust.method  = "bonferroni", centrality.type = "parametric")

We can also use this tool to tag outliers:

ggbetweenstats(data = penguins, x = species, y = body_mass_g,  outlier.tagging = TRUE, outlier.color = "pink")

Finally, we can also generate a correlalogram (a matrix of correlation coefficients):

ggcorrmat(data = penguins_clean)

15.3 GGPP

ggpp is a relatively new package that allows us to further modify plots using tibbles:

We can add summary plots into a figure by installing and loading two new libraries first:

library(tibble)
library(ggpp)
library(ggpubr)

#store a custom theme that we want to use for all plots in a variable
#theme_pupr is part of the ggpupr package
theme_manual <- 
  theme_pubr(base_size = 10) +
  theme(legend.position = "right",
        axis.title.x = element_text(margin=margin(t=10)),
        axis.title.y = element_text(margin=margin(r=10)))

We can easily add a table on top of our plot:

#generate summary statistics
#we use rename to prettify the column header
#and use round to not display any digits when we calculate the mean
summary <- penguins_clean |> 
  group_by(species) |> 
  summarise(avg_body_mass = round(mean(body_mass_g),0)) |> 
  rename(Species = species, `Mean body\n mass (g)` = avg_body_mass)

#generate a tibble from our dataframe
#the positions will later be used to place our table in the plot 
data.tb <- tibble(x = 30, y = 6400, tb = list(summary))

#generate a dot plot and add summary stats
ggplot(penguins_clean, aes(bill_length_mm, body_mass_g)) +
  #geom_table allows to plot a df or tibble
  geom_table(data = data.tb, aes(x, y, label = tb)) +
  geom_point(alpha = 0.5)  +
  theme_manual +
  ylim(NA,6500) + 
  xlim(NA,65)

Or a plot:

#generate a boxplot and store it in a variable
boxplot <-
ggplot(penguins_clean, aes(species, body_mass_g, fill = species )) +
  geom_boxplot() +
  theme_bw(base_size = 8) + 
  theme(legend.position = "none",
        panel.grid = element_blank(),
        axis.title = element_blank())

#convert to tibple
data.tb <- tibble(x = 20, y = 6400, 
                  plot = list(boxplot))

#generate a scatter plot and add the boxplot in an additional layer
ggplot(penguins_clean, aes(bill_length_mm,body_mass_g , color = species)) + 
  geom_point() +
  geom_plot(data = data.tb, aes(x, y, label = plot)) +
  theme_manual

We also can zoom into plots:

p <-
ggplot(penguins_clean, aes(bill_length_mm, body_mass_g)) +
  geom_point(alpha = 0.5) +
  ylim(NA,6500) + 
  xlim(NA,65)

data.tb <- 
  tibble(x = 90, y = 2500, 
         plot = list(p + 
                       coord_cartesian(xlim = c(49, 61), 
                                       ylim = c(5000, 6400)) +
                       labs(x = NULL, y = NULL) +
                       theme_bw(8) +
                       scale_colour_discrete(guide = "none")))

ggplot(penguins_clean, aes(bill_length_mm, body_mass_g)) +
  geom_plot(data = data.tb, aes(x, y, label = plot)) +
  annotate(geom = "rect", 
           xmin = 49, xmax = 61, ymin = 5000, ymax = 6400,
           linetype = "dotted", fill = NA, colour = "black") +
  geom_point() +
  annotate("segment",  x = 49, xend = 65, y = 5000, yend = 2700, linetype = "dotted") +
  annotate("segment",  x = 61, xend = 89, y = 6400, yend = 4200, linetype = "dotted") +
  theme_manual + 
  scale_x_continuous(breaks = seq(0,60,10))