An introduction into Ggplot
1 Introduction
This tutorial will introduce the basics of plotting with ggplot.
Visualizing data is important not only for publishing but also to explore our data. For example, imagine that we do a regression analysis and we get the result that two variables (i.e. height and income) are positively correlated. While this might be an exciting result, without visualizing the data we might not realize that we have outliers or that data points are not distributed normally. That is why plotting the data is so important. For example in the plot below, we have the exact same regression line but the data points behave very differently.
2 Setting up the working directory
This tutorial was prepared in RStudio with the following dependencies:
- R version 4.2.
- The palmerpenguins_0.1.1 package
- The tidyverse_1.3.2 package
If you are opening the notebook version (i.e. the qmd file) of this tutorial, you will sometimes see a little comment before the actual code, i.e. #| eval: false
. This comment simply controls how the code chunk behaves when rendering the document to a HTML and can be ignored when following the tutorial via the qmd file.
If you want to follow this tutorial, then you need to install some tools first. Specifically, we need to download the data set that we will explore today as well as the tidyverse library. The tidyverse includes several packages such as dplyr for data transformation or ggplot, our plotting library.
If you don’t have the palmerpenguins data set and the tidyverse installed yet then you can type the following into your R notebook or console. You only need to run this kind of code once.
#install required data package and libraries
#the second install might take a bit
install.packages("palmerpenguins")
install.packages("tidyverse")
Once you have everything installed you load the libraries. Keep in mind, while you only need to install libraries once you need to load the libraries for every new R session.
library(tidyverse)
library(palmerpenguins)
3 Data exploration and cleaning
During this tutorial we will work with two datasets:
- The palmerpenguins dataset, a data set that records details of 344 penguins. This will be the main data set we will explore.
- The beaver dataset, a time series recording the temperature and activity of two beavers. This dataset is part of base R and doesn’t need to be installed. Time-series are best represented in line plots and we will explore this data when talking about these types of plots.
Let’s have a first look at the penguin dataset by looking at the first few rows using the head()
function.
If you are viewing the HTML version, you can see all columns of the table above by clicking on the little arrow on the top, right corner of the table.
head(penguins)
We see that we have 8 columns of data. The columns contain categorical data (species, sex and island) as well as numerical data (flipper_length_mm, body_mass_g, year).
Next, let’s explore the data structure (str
let’s us know if our columns are factors, integers or numbers).
#view the data structure of the penguin data
str(penguins)
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
Some key features we can take from this:
- The total amount of data we work with, i.e. 344 rows and 8 columns of data
- Te names and content of all 8 columns, i.e. species, bill_length and sex
- The type of data we have for each column, i.e. factors, numeric/integer, character
- The unique levels for some variables, i.e. we compare 3 different penguin species
- If we look closely, we can see whether or not we have to worry about NAs (i.e. missing data). For example, we see an NA in the sex column
Finally, let’s calculate a basic summary statistics of our data using the summary()
function:
summary(penguins)
species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 female:165 Min. :2007
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Median :197.0 Median :4050 NA's : 11 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2
After our first data exploration steps, we can see that we have 344 rows and eight columns of data. The columns contain types of categorical or numerical data, such as species, bill length or year. We know how many unique observations we have for categorical data, i.e. we look at three species of penguins. Additionally, for each data column we got a summary of the number of observations (for factor data) as well as basic statistics for all numerical data. Another useful piece of information we get is the number of missing values for each column.
Since we noticed that the penguin dataset has rows with missing data (i.e. NA), our first task is to remove any row with missing data:
#discard rows with NAs and store the resulting dataframe in a new variable
<-
penguins_clean |>
penguins drop_na()
If you are working in base R you might have an error when using |>
if you are working with an R version below 4.1. If you get an error and can’t update or don’t want to update R use %>%
instead. For this to work, you need to have the tidyverse library loaded.
Before proceeding, lets first check how many rows of data we removed:
#sanity check to control how many rows were dropped
#print the dimensions for our original dataset
print(dim(penguins))
[1] 344 8
#print the dimensions for our cleaned dataset
print(dim(penguins_clean))
[1] 333 8
Sanity checks are an important step when manipulating data and we recommend to always do them.
In the example above, we check the number of dimension of our data frame before and after cleaning and ensure that the number of rows that were removed make sense.
4 Our first scatterplot
Now we can start to generate our first plot. Let’s start with a scatterplot. In the code below we:
- Initialize the plot using the
ggpplot()
function in the first line of code. Inside the function, we provide the input dataframe (i.e. penguins_clean) and we also
- map variables (i.e. flipper_length_mm and body_mass_g) to the aes(thetics) in our graph. More specifically, we define what we want to plot on the x- and/or y-axes.
- Define how we want to plot our data in the second line of code. For example, we specify that we want to generate a scatterplot via
geom_point()
. When using the+
symbol after the first ggplot call, we add a layer to our plot. We can add as many layers to our plot as we want, more on that later.
ggplot(penguins_clean, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point()
By default, ggplot expects the first aesthetic to be mapped to the x and the second aesthetic to the mapped to the y variable. We could also write the code a bit shorter and still get the exact same result as shown below:
ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g)) +
geom_point()
While omitting the x=
and y=
is definitely shorter the code becomes a little less easy to understand, therefore, its important to find a good balance between keeping code short but readable.
Exercise
- Plot bill length on the x and bill depth on the y-axis.
- Plot the body mass on the y-axis and the species on the x-axis and generate a plot. Notice a difference?
Click me to see an answer
Question 1
ggplot(penguins_clean, aes(x = bill_length_mm , y = bill_depth_mm )) +
geom_point()
Question 2
ggplot(penguins_clean, aes(x = species, y = body_mass_g)) +
geom_point()
In the second example we generate a dot instead of a scatter plot. This type of plot is useful if we have a low number of observations and want to show the reader all the data. While we could also just plot the mean and standard errors, plotting all data points can be more informative as it can give a general idea about the spread of the data and the presence of any outliers.
At the moment the plot is not ideal since points are on top of each other but we later learn how we could improve this.
5 Saving plots as variables
Plots can be saved as variables and we can add more layers to these variables using the +
operator. This is really useful if you want to make multiple related plots from a common base, i.e. we could store a common theme in one variable and reuse it for different plot types.
For example, we could save the mappings when using the ggplot()
function as a variable called plt_pengs
and then use this variable whenever we want to add new layers:
#save our first layer in the plt_pengs variable
<- ggplot(penguins_clean, aes(species, flipper_length_mm))
plt_pengs
#add the second layer to plt_pengs to generate a scatter plot
+
plt_pengs geom_point()
By storing our mappings in a variable, we can easily change how the plot looks. For example, instead of generating a dot plot let us generate a boxplot (more about boxplots later). Notice, how we don’t have to retype the mappings but just use our variable instead:
+
plt_pengs geom_boxplot()
6 Saving plots to your computer
How would we save this plot to our computer?
When saving images it is generally better to save a plot as vector file, such as svg, ai, pdf (in most of the cases), instead of raster files such as png or tiff. Vector files use paths of points and lines to create an image while raster graphics are created using pixels.
Using vector images allows us to modify every line or dot in a plot using tools such as Illustrator or Inkscape and thereby customize plots outside of R.
6.1 Using the export function in RStudio
In Rstudio, we can use the export function in the files panel. In the File panel navigate to the plot tab. You can export a figure if its shown in the panel by clicking on export decide how you want to save your image.
6.2 Using the pdf function
The pdf()
function takes as argument file name and if needed the directory we want to save our file in (in the case below the images folder found on the same level as the code directory). We can add additional arguments such as the width or height of our plot or even the target paper size.
#call the pdf command to tell R we want to generate a plot
pdf(file = "../images/ExamplePlot_BaseR.pdf", paper = "a4")
#create a plot
ggplot(penguins_clean, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point()
# Run dev.off() to create the file
dev.off()
6.3 Using ggsave
The ggsave
function is part of the ggplot package and can be used to save any plot generated with ggplot. Whenever you use other libraries to generate plots or extend ggplot it might be better to save plots using pdf()
or via the files pane. .
#create a plot and store it in a variable
<-
my_plot ggplot(penguins_clean, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point()
ggsave("../images/ExamplePlot_GGplot.pdf", my_plot, width = 21, height = 30, unit = "cm")
If you run the code on your own and compare the two plots generated with pdf()
and ggsave
, you will see the plots have different margins despite both being set to A4. Therefore, it is often useful to play with the width and height a bit to get things right.
Exercise
- Plot the bill_length_mm against the bill_depth_mm and save the plot as pdf using
pdf()
using default settings. - Plot the bill_length_mm against the bill_depth_mm and save the plot as pdf using
ggsave()
using default settings. - Do the plots look the same?
Click me to see an answer
Question 1
pdf(file = "image1.pdf")
ggplot(penguins_clean, aes(x = bill_length_mm, y = bill_depth_mm )) +
geom_point()
dev.off()
Question 2
ggplot(penguins_clean, aes(x = bill_length_mm, y = bill_depth_mm )) +
geom_point()
ggsave("image2.pdf")
Question 3
The plots might look slightly different as pdf by default uses 7 points for width and height and ggsave uses the size of the current graphics device (i.e. how you see the plot displayed in the plots panel in RStudio). That is why choosing alternative values for the dimensions gives you a bit more control over how the plot looks.
7 Addon: Plotting more than one data set
We can plot different data frames on each layer. To generate an example, lets fit a model on our data, use it to predict some data and plot the results as a regression line on top of our individual data points.
While going into how to fit models is out of the scope of this tutorial, free free to look at the links below if you want to go deeper:
- Intro to linear regression
- Some examples for the modelbase package
If you want to follow examples in Addon sections, make sure you have any additional libraries installed before running the code.
#load some libraries
library(modelbased)
#create linear model and predict data
<- lm(flipper_length_mm ~ body_mass_g , data = penguins_clean)
mod <- estimate_expectation(mod)
pred_data
#add the original data to our df
$flipper_length_mm <- penguins_clean$flipper_length_mm
pred_data
#lets look at our predictions
head(pred_data)
Now, we can add our predictions as a line to our scatter plot by adding an additional layer in which we provide our second dataset with the predicted values:
ggplot(penguins_clean, aes(x = flipper_length_mm, y= body_mass_g)) +
geom_point() +
geom_line(data = pred_data, aes(x = Predicted, y = body_mass_g ))
You can easily use this approach to add layers to your plot for summary statistics, labels for outliers, etc.
8 Visible aesthetics
We already learned how we map variables onto the x and y axes. For example, we mapped the flipper length onto the x and the body mass onto the y aesthetic. We typically provide such mappings in the aes() function.
We can easily add more mappings:
- color: changes the fill of points but in other geoms fills the outlines
- fill: changes the fill color
- size: changes the area or radius of points as well as the thickness of lines
- shape: changes the shape of our data points
- alpha: adjusts the transparency
- line: changes the dash pattern of a line
- labels: allows to change text on a plot or axes
Let’s quickly talk about the distinction between aesthetics and attributes in the world of ggplot syntax: Aesthetics are defined inside aes() and attributes are used outside aes(). For example, we can map the species to the color aesthetic and therefore control how the species are colored. Let’s look at some examples.
8.1 Colors
First, let’s map species to the color aesthetic:
ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g, color = species)) +
geom_point()
If we do this, we can see that the dots are now colored depending on from what species the data was collected.
In contrast, attributes control how something looks, for example, we can decide to make all dots blue by using a color attribute outside the aes()
call:
ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g)) +
geom_point(color = "blue")
We can define the aesthetic when we initialize the plot or in the individual layer. If we define the aesthetic while initializing the plot, it is used for every layer. In contrast, when we define the aesthetic in individual layers they just apply to that layer.
For example, we can also map species to color inside the geom_point layer:
ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g)) +
geom_point(aes(color = species))
An example for when it can be useful to control the color in individual layers is shown in a later point in the tutorial but if you want to have a quick look go to Section 11.1.
Exercise
Make a scatterplot comparing the flipper and bill lengths and assign the islands to different colors by changing the color aesthetic. It is up to you in what layer you assign the color aesthetic and feel free to try both options.
Click me to see an answer
ggplot(penguins_clean, aes(flipper_length_mm, bill_length_mm, color = island)) +
geom_point()
8.2 Sizes
We can also change the size of our data points using the size aesthetic.
ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g, size = bill_depth_mm)) +
geom_point()
8.3 Shapes
Additionally, for categorical data, we can give different shapes to our data points.
Notice, there are a limited number of shapes available, so this works only for datasets with limited number of categories such as our dataset were we look at three species.
The default geom_point() uses shape = 19: a solid circle. An alternative is shape = 21: a circle that allows you to use both fill for the inside and color for the outline. This let’s you to map two aesthetics to each point. All info on what number refers to which shape can be found here.
ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g, shape = species)) +
geom_point()
We can only add a shape to categorical data and we would get an error if we want to add shapes for the different years (which, if we remember the output from str(penguins)
are stored as integers). If we wanted to add shapes for different years, we first would have to convert the year to a factor.
Feel free to try this without changing the year to a factor first.
ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g, shape = factor(year))) +
geom_point()
Notice: When we run these, we see how the legend title is renamed to factor(year)
. This is not how we want this and we will discuss in Section 12 how we can change this.
8.4 Alpha
geom_point() has an alpha argument that controls the opacity of the points:
- A value of 1 (the default) means that the points are totally opaque
- A value of 0 means the points are totally transparent (and therefore invisible)
- Values in between specify transparency.
Changing the alpha is a good way to make overlapping data points better visible.
ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g)) +
geom_point(alpha = 0.2)
Exercises
- Make a scatterplot to compare bill length and bill depth and use different colors for different islands. In the point layer, change the shape attribute to 21.
- Do the same as in exercise 1 but instead of mapping the islands to the colors map them to the fill aesthetic instead. Do you understand what is happening?
- Make a scatterplot to compare bill length and bill depth. Map the shape aesthethic to the island variable. Also change the size attribute to make the symbols bigger and make the dots more transparent (it is up to you what values you choose).
Click me to see an answer
#question 1
ggplot(penguins_clean, aes(bill_length_mm, bill_depth_mm, color = island)) +
geom_point(shape = 21)
#question 2
ggplot(penguins_clean, aes(bill_length_mm, bill_depth_mm, fill = island)) +
geom_point(shape = 21)
Fill controls how a shape is filled and color controls the color of the outline/border. The default points that are used with geom_point
behave a bit differently when using the default shape 19 since this shape does not have an outline while shape 21 has one.
#question 3
ggplot(penguins_clean, aes(bill_length_mm, bill_depth_mm, shape = island)) +
geom_point(size = 2, alpha = 0.5)
9 Position adjustments
Position adjustments apply minor tweaks to the position of elements within a layer. There are three position adjustments that are primarily useful for geom_point
and other geoms can have other adjustments (more on that a bit later):
position_nudge()
: moves points by a fixed offsetposition_jitter()
: adds random noise to every positionposition_jitterdodge()
: dodges points within groups, then adds a little random noise
By default, ggplot2 uses position = "identity"
when we are using geom_point
. If you want to know the default values used by functions you can figure this out by typing ?geom_point
.
Therefore, if we write geom_point(position = "identity")
we will get exactly the same result as if we only would write geom_point()
:
#use the default but add the default option for the position argument
ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
geom_point(position = "identity")
If you look closely at the graph, there is a small issue with the data points –> several points overlap but we can not see how many. We have seen before, that changing the alpha helps to show the data better. Another way to visualize this a bit better is to add random noise, a process that is also called jittering. When we use the jitter position adjustment we apply random noise to each position.
The amount of noise added is very small, so you might need to look closely at what is happening.
ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
geom_point(position = "jitter")
We can also exactly define the level of noise like using the position_jitter()
function:
ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
geom_point(position = position_jitter(width = 0.4))
Exercise
Generate a scatterplot comparing bill depth and bill length. Map the species to the color aesthetic and in geom_point()
use position_jitter() to add random noise (adjust the witdh to 0.8). Rerun code several times and have an eye on what the dots do, do you notice something?
Click me to see an answer
ggplot(penguins_clean, aes( x =bill_length_mm, y = bill_depth_mm, color = species)) +
geom_point(position = position_jitter(width = 0.8))
A small issue with jittering is that if we use the code above, we can not exactly reproduce the plot. Every time we run the code and regenerate the plot, the points will be at slightly different positions, since the noise is randomly generated each time we plot. However, we can control this, by setting a fixed value for our random seed. Now, every time you run the plot below, we will get the exact same output.
ggplot(penguins_clean, aes( x =bill_length_mm, y = bill_depth_mm, color = species)) +
geom_point(position = position_jitter(width = 0.8, seed = 136))
10 Geoms
Geometric objects, or geoms for short, perform the actual rendering of the layer, controlling the type of plot that you create. Some useful geoms are:
geom_point()
produces a scatterplotgeom_bar()
andgeom_col
make bar chartsgeom_boxplot()
visualizes five summary statistics: the median, two hinges and two whiskers as well as outliersgeom_violin()
shows density of values in each grou.geom_line()
makes a line plot.geom_line()
connects points from left to right whilegeom_path()
is similar but connects points in the order they appear in the datageom_area(
) draws an area plot, which is a line plot filled to the y-axis (filled lines). Multiple groups will be stacked on top of each othergeom_rect()
,geom_tile()
andgeom_raster()
draw rectangles. geom_rect() is parameterised by the four corners of the rectangle, xmin, ymin, xmax and ymax. geom_tile() is exactly the same, but parameterised by the center of the rect and its size, x, y, width and height. geom_raster() is a fast special case of geom_tile() used when all the tiles are the same sizegeom_errorbar()
allows us to add error bars
Below we will look at some examples starting with the barplot.
10.1 Barplots
Classical barplots have a categorical x-axis (i.e. we can map the species to our x-axis) and we can generate barplots using two geoms:
geom_bar()
: counts the number of cases at each x-positiongeom_col()
: plots the actual values
10.1.1 Geom_bar
Let’s first plot the number of observations we have for each penguin species. Here, geom_bar()
does the work for us and does the counting and we only need to map the species to the x-variable:
ggplot(penguins_clean, aes(x = species)) +
geom_bar()
We can also plot the counts in case they are already part of our data frame. To do this, lets first create a summary for our data using the tidyverse. Specifically, we want to count the number of observations per species.
<- penguins_clean |>
peng_summary#select the variables we want to work with
select(species, flipper_length_mm) |>
#group our data by species
group_by(species) |>
#get summary stats
summarize(observations = n()) |>
arrange(desc(observations))
#view the data
peng_summary
Now, lets plot the observations. To do so we need to do some changes:
- We not only map the species to the x-axis but the number of observations to our y-axis
- We add
stat = "identity"
to thegeom_bar
function. A statistical transformation, or stat, transforms the data, typically by summarizing it in some manner. Geom_bar by default usesstat = "count"
to automatically count the number of observations without us having to write this out. If we instead want to provide already existing values to geom_bar instead of lettinggeom_point
automatically count for us, we have to change this argument tostat = "identity"
.
ggplot(peng_summary, aes(x = species, y = observations)) +
geom_bar(stat = "identity")
10.1.2 Geom_col
Instead of using geom_bar
we can use geom_col
. geom_col()
won’t try to aggregate the data by default and it expects us to already have the y values calculated and uses these values directly.
ggplot(peng_summary, aes(x = species, y = observations)) +
geom_col()
10.1.3 Addon: Ordering data in plots
Now, another thing to keep in mind. When we generated the summary statistics, we ordered the data based on the number of observations but in the plot we see that the bars are ordered alphabetically by species name.
There are different ways to do change the order of our bar plot but the main thing we need to accomplish is to modify the factor levels of our ordering column. We can easily view the default order of our species by typing:
levels(peng_summary$species)
[1] "Adelie" "Chinstrap" "Gentoo"
We can re-order the factor levels using baseR or using the forcats library that comes with the tidyverse and allows us to reorder factors. Let us use the forecats library to bring some order into our plot.
We can order the factors in different ways, for now let us use the fct_reorder
function that is part of the forcats library. For fct_reorder
, we need to tell the function our factor variable (“species”) and the values we want to reorder it by (the column corresponding to the y-axis, i.e. “observations”).
When using fct_reorder
, the x-label gets renamed to fct_reorder(species,observations), which is … not very pretty. In order to change this we can use the labs function to change the x-axis label to Species.
If you are unsure what is happening in the code when we use the labs
function, run the code without labs(x = "Species")
. We will explain how to customize labels and other parts of the plot in more detail in Section 12.
ggplot(peng_summary, aes(x = fct_reorder(species, observations), y = observations)) +
geom_col() +
labs(x = "Species")
We can easily reverse the order by adding an extra argument, .desc = TRUE
:
ggplot(peng_summary, aes(x = fct_reorder(species, observations, .desc = TRUE), y = observations)) +
geom_col() +
labs(x = "Species")
If we don’t calculate the number of observations and want to use geom_bar
, we can reorder by using another function fct_infreq
as follows:
ggplot(penguins_clean, aes(fct_infreq(species))) +
geom_bar()
Or in reverse:
ggplot(penguins_clean, aes(fct_rev(fct_infreq(species)))) +
geom_bar()
There are more ways that you can use the forcats library to order data, but that’s out of the scope of this tutorial. For more, feel free to start by having a look at the forcats documentation.
10.1.4 Positions in barplots
We have three different ways to adjust the positions of barplots:
position_stack()
: stack overlapping bars (or areas) on top of each otherposition_fill()
: stack overlapping bars, scaling so the top is always at 1position_dodge()
: place overlapping bars side-by-side
Lets first generate a barplot comparing the observations per year and mapping the sex to the fill aesthetic:
ggplot(penguins_clean, aes(x = year, fill = sex)) +
geom_bar()
By default, geom_bar
produced a stacked barplot and geom_bar uses position = "stack"
in the background. We can change this behavior and plot proportional values by using position=fill
instead:
ggplot(penguins_clean, aes(x = year, fill = sex)) +
geom_bar(position = "fill")
To plot the values next to each other we can “dodge” the bars:
ggplot(penguins_clean, aes(x = year, fill = sex)) +
geom_bar(position = "dodge")
Exercise
Generate a dodged barplot comparing the number of observations for different penguin species across years.
If yo followed the section on ordering observations consider to also change the order in the plot (i.e. order from large number of counts to small).
Click me to see an answer
ggplot(penguins_clean, aes(x = year, fill = fct_infreq(species))) +
geom_bar(position = "dodge")
10.2 Boxplots
A boxplot is a standardized way of displaying data based on five summary statistics: the minimum, the maximum, the median, the first and third quartiles. Additionally, geom_boxplot
also will show outliers.
Let’s first compare the summary statistics of the body mass across different penguin species:
ggplot(penguins_clean, aes(x = species, y = body_mass_g)) +
geom_boxplot()
We could easily add the individual datapoints to this plot by adding another layer. To avoid overplotting, we can use geom_point but adjust the position with position_jitter
.
ggplot(penguins_clean, aes(x = species, y = body_mass_g)) +
geom_boxplot() +
geom_point(position = position_jitter())
We can easily add another dimension, for example by mapping the year to the fill aesthethic. When we add a factor aesthetic geom_boxplot
will automatically generate dodged boxplots.
ggplot(penguins_clean, aes(x = species, y = body_mass_g, fill = factor(year))) +
geom_boxplot()
If we want to add individual data points to this plot, we need to add the position adjustment position_jitterdodge
in geom_point
to ensure that we simultaneously dodge and jitter our points:
ggplot(penguins_clean, aes(x = species, y = body_mass_g, fill = factor(year))) +
geom_boxplot() +
geom_point(position = position_jitterdodge())
When generating a dodged boxplot, the width is automatically calculated by the total width of all elements in a position. For example, if we add an aesthetic for the island (only Adelie is found on 3 different islands) then the individual width of each boxplot is not the same:
ggplot(penguins_clean, aes(x = species, y = body_mass_g, fill = island)) +
geom_boxplot()
We can adjust this by preserving the width of a single element instead. Don’t forget, if you are ever unsure about the behavior of a plot, every detail of a function can be checked with ?geom_boxplot
.
ggplot(penguins_clean, aes(x = species, y = body_mass_g, fill = island)) +
geom_boxplot(position = position_dodge(preserve = "single"))
Exercise
Boxplots are useful summaries, but hide the shape of the distribution. Try plotting the species on the x, body_mass on the y-axis and dodge by year (beware that years need to be factors for this to work). To show the shape of the distribution, use geom_violin
instead of geom_boxplot
. Additionally, show the individual data points and change the transparency a bit in order to make the points stand out less:
Click me to see an answer
ggplot(penguins_clean, aes(x = species, y = body_mass_g, fill = factor(year))) +
geom_violin() +
geom_point(alpha = 0.3, position = position_jitterdodge())
Comment:
For violin points, its a bit harder to get the individual points inside the violin plot if you are interested in making this work we can recommend looking at the ggbeeswarm library.
10.3 Histograms
A histogram displays numerical data by grouping data into “bins” of equal width. We need to only provide a single aesthetic, x, which needs to be a continuous variable. For example, lets look at the distribution of the flipper length measurements using a histogram:
ggplot(penguins_clean, aes( x = flipper_length_mm )) +
geom_histogram()
If we want to have smaller bins, we can change the width like this:
ggplot(penguins_clean, aes( x =flipper_length_mm )) +
geom_histogram(binwidth = 5)
Some things to keep in mind when visualizing histograms:
- Ensure that we set meaningful bin widths for the data
- Don’t show spaces between the bars
- X-labels should fall between the bars as they represent intervals and not actual values
The last point we can control with the center argument:
ggplot(penguins_clean, aes( x =flipper_length_mm )) +
geom_histogram(binwidth = 2, center = 0.05)
10.3.1 Use aesthethics in histograms
Same as for other geoms we can map a variable, such as species, to different aesthetics:
ggplot(penguins_clean, aes( x =flipper_length_mm, fill = species)) +
geom_histogram(binwidth = 2, center = 0.05)
However, a problem with this representation is that it is not immediately clear if the data is overlapping or if they are stacked on top of each other.
We can change the plot by using positional adjustments:
- the default position of geom_histogram() is using stacked bars. We can change this with the position argument.
- We can also dodge the bars, i.e. offset each data point in a given category.
ggplot(penguins_clean, aes( x =flipper_length_mm, fill = species)) +
geom_histogram(binwidth = 2, center = 0.05, position = "dodge")
The fill position normalizes each bin to represent the proportion of all observations in each bin”
ggplot(penguins_clean, aes( x =flipper_length_mm, fill = species)) +
geom_histogram(binwidth = 2, center = 0.05, position = "fill")
Exercise
Generate a histogram looking at the body mass across different islands. Change the transparency to 0.5.
Click me to see an answer
ggplot(penguins_clean, aes( x =body_mass_g, fill = island )) +
geom_histogram(alpha = 0.5)
10.4 Line plots
Line plots are ideal if we want to plot time series, such as the beaver data. The beaver data comes with two datasets, beaver1 and beaver2. These datasets record the temperature and activity of two different beavers over time.
Let’s first have a look at our data:
head(beaver1)
str(beaver1)
'data.frame': 114 obs. of 4 variables:
$ day : num 346 346 346 346 346 346 346 346 346 346 ...
$ time : num 840 850 900 910 920 930 940 950 1000 1010 ...
$ temp : num 36.3 36.3 36.4 36.4 36.5 ...
$ activ: num 0 0 0 0 0 0 0 0 0 0 ...
summary(beaver1)
day time temp activ
Min. :346.0 Min. : 0.0 Min. :36.33 Min. :0.00000
1st Qu.:346.0 1st Qu.: 932.5 1st Qu.:36.76 1st Qu.:0.00000
Median :346.0 Median :1415.0 Median :36.87 Median :0.00000
Mean :346.2 Mean :1312.0 Mean :36.86 Mean :0.05263
3rd Qu.:346.0 3rd Qu.:1887.5 3rd Qu.:36.96 3rd Qu.:0.00000
Max. :347.0 Max. :2350.0 Max. :37.53 Max. :1.00000
We have 114 temperature and activity observations collected over 2 days and different time intervals. If we do the same for beaver2
we would see a very similar looking dataset.
Let’s start by plotting the temperature records for our first beaver over the whole time interval:
ggplot(beaver1, aes(x = time, y = temp)) +
geom_line()
Also here, we can add an aesthetic for example by mapping the activity measurements to the color aesthetic.
ggplot(beaver1, aes(x = time, y = temp, color = activ)) +
geom_line()
10.4.1 Line plots for several species
We can easily plot the data for both our beavers in one plot. To do this, let us first combine the data for beaver 1 and beaver 2:
#add a new colum for the specimen for beaver1 and 2
$species = "beaver1"
beaver1$species = "beaver2"
beaver2
#combine the two datasets
<- rbind(beaver1, beaver2)
beaver_all
#control the number of observations
dim(beaver_all)
[1] 214 5
Now we can plot again and distinguish beaver1 and 2 by different types of lines:
ggplot(beaver_all, aes(x = time, y = temp, linetype = species)) +
geom_line()
Exercise
Compare the changes in temperature over time for the two beaver species by mapping the species to the color aesthetic. Make the lines a bit thicker using the linewidth argument:
Click me to see an answer
ggplot(beaver_all, aes(x = time, y = temp, color = species)) +
geom_line(linewidth = 1.5)
10.5 Geom_smoot and adding trendlines
geom_smooth()
adds a smooth trend curve and as such aids the eye in seeing patterns in the presence of overplotting.
Let’s add a trend curve to a scatterplot, where we compare the flipper length and body mass:
ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g)) +
geom_point() +
geom_smooth()
We can also change the method on how the trend curve is calculated. By default NULL is chosen, where the smoothing method is chosen based on the size of the largest group.
We should also see a small message appearing when we generate the plot that tells us the method chosen, i.e: geom_smooth()
using method = ‘loess’ and formula = ‘y ~ x’.
Loess is a non-parametric smoothing algorithm that usually is used when we have less than 1000 observations. It works by calculating a weighted mean by passing a sliding window along the x-axis.
We can change the function and for example use a linear model like this:
ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g)) +
geom_point() +
geom_smooth(method = "lm")
If you want to know more about how to change things and the math used, check the help function with ?geom_smooth
.
We can also calculate a trend line whilst using the color aesthetic. In the example below, we remove the standard error that was shown before by using se = FALSE
. This makes our graph a little less cluttered.
ggplot(penguins_clean, aes(x =bill_length_mm, y = body_mass_g, color = species )) +
geom_point() +
geom_smooth(se = FALSE, method = "lm")
By default, each model is bound to the values of its own group. We can change this by defining the fullrange
to make predictions over the full range of data. The plot below is not very pretty, and we will later see how to improve this.
ggplot(penguins_clean, aes(x =bill_length_mm, y = body_mass_g, color = species )) +
geom_point() +
geom_smooth(method = "lm", fullrange = TRUE)
10.5.1 Addon: Adding math
This is a bit outside of the scope of this tutorial, so we won’t cover details here, but since many of you might want to know how you add a regression equation and R2 you could to this quite easily with the help of another package ggpubr and two of its functions stat_cor
and stat_regline_equation
:
If you want to test this, make sure that you have installed ggpubr first.
library(ggpubr)
ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g)) +
geom_point() +
geom_smooth(method = "lm") +
#plot both the r2 (rr) and p-value next to each other, separated by a comma
#label.y is used to adjust the position of the text in our plot
stat_cor(label.y = 6200, aes(label = paste(..rr.label.., ..p.label.., sep = "~`,`~"))) +
#plot the regression equation
stat_regline_equation(label.y = 6000)
Exercise
Generate a scatterplot and plot the bill_length_mm against the bill depth. Add a trend line using the lm method. If you want and looked at the information in the addon section: add the regression equation and R2. Based on visual inspection and (if you have done this also based on the R2) would you say we see a strong correlation?
Click me to see an answer
ggplot(penguins_clean, aes(bill_length_mm, bill_depth_mm )) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
stat_cor(label.y = 22, aes(label = paste(..rr.label.., ..p.label.., sep = "~`,`~"))) +
stat_regline_equation(label.y = 21.5)
If we look at the plot below, we can see that the data points are very far from the line and that the R2 is close to 0 suggesting a weak, negative association.
Going into the stats is outside the scope of this tutorial but feel free to check the links below for more detail:
10.6 Annotations layer
annotate()
adds a geom to a plot but unlike a typical geom function, the properties of the geoms are not mapped from variables of a data frame, but are instead passed in as vectors. This is useful for adding small annotations (such as text labels) or if you have your data in vectors, and for some reason don’t want to put them in a data frame.
We can add text to our scatter plots like this:
ggplot(penguins_clean, aes(bill_length_mm, bill_depth_mm )) +
geom_point(alpha = 0.5) +
annotate("text", x = 58, y = 10, label = "some text")
We also can add rectangles to highlight things:
ggplot(penguins_clean, aes(bill_length_mm, bill_depth_mm )) +
geom_point(alpha = 0.5) +
annotate("rect", xmin = 50, xmax = 60, ymin = 10, ymax = 12, alpha = .2)
We can also add lines sections:
ggplot(penguins_clean, aes(bill_length_mm, bill_depth_mm )) +
geom_point(alpha = 0.5) +
annotate("segment", x = 35, xend = 50, y = 12.5, yend = 22, colour = "purple")
We can add both a text and and arrow to highlight a specific data point. To add an arrow we use the segment geom but we need some more syntax to convert this to an arrow. To do this, we can use arrow function from the grid package.
ggplot(penguins_clean, aes(bill_length_mm, bill_depth_mm )) +
geom_point(alpha = 0.5) +
annotate("text", x = 58, y = 10, label = "outlier") +
annotate("segment", x = 58, y = 10.5, xend = 59.6, yend = 16.8, size = 0.7,
arrow = arrow(type = "closed", length = unit(0.02, "npc")))
11 Facets
Facets partition a plot into a matrix of panels. Each panel shows a different subset of the data.
facet_grid()
will produce a grid of plots for each combination of variables that you specify, even if some plots are empty.facet_wrap()
will only produce plots for the combinations of variables that have values, which means it won’t produce any empty plots.
Let’s start by creating a histogram while mapping the fill aesthetic to the species variable:
ggplot(penguins_clean, aes(bill_length_mm, fill = species)) +
geom_histogram()
We have discussed before that this way, we don’t know for sure if there is over plotting and discussed ways to better visualize this. A new one we want to discuss now is faceting.
We can facet with one group in vertical direction, or by rows, by adding another layer with facet_grid()
:
ggplot(penguins_clean, aes(bill_length_mm)) +
geom_histogram() +
facet_grid(rows = vars(species))
ggplot2 before version 3.0.0 used formulas to specify how plots are faceted. If you encounter facet_grid/wrap() code containing ~
then this has been changed to give the user more flexibility in creating functions.
So the above could also be written as:
ggplot(penguins_clean, aes(bill_length_mm)) +
geom_histogram() +
facet_grid(species ~ .)
While the older version to write the code is a lot shorter it is also not as explicit in terms of what gets plotted where.
We can also change the plot to horizontal direction, i.e. by column:
ggplot(penguins_clean, aes(bill_length_mm)) +
geom_histogram() +
facet_grid(cols = vars(species))
We can also facet using two variables at the same time. For example created faceted rows for both species a year:
ggplot(penguins_clean, aes(bill_length_mm)) +
geom_histogram() +
facet_grid(rows = vars(species, year))
Or we can facet using two columns using both by columns and rows:
ggplot(penguins_clean, aes(bill_length_mm)) +
geom_histogram() +
facet_grid(rows = vars(species), cols = vars(year))
facet_grid
by default assumes that we provide the rows and then the columns. So a shorter way to write this is:
ggplot(penguins_clean, aes(bill_length_mm)) +
geom_histogram() +
facet_grid(vars(year), vars(species))
By default, all the panels have the same scales and scales="fixed"
is used in the background. While for a lot of things it makes sense to have the same scale, we can also make scales independent, by setting scales to free, free_x, or free_y.
ggplot(penguins_clean, aes(bill_length_mm)) +
geom_histogram() +
facet_grid(cols = vars(year), rows = vars(species), scales = "free")
11.1 Addon: Playing with facets
This is just an example showing the amount of flexibility you have with ggplot. For example, below we can use different layers if we wanted to facet by species BUT still wanted to show all data points in each subplot:
#duplicate our dataframe but remove the column we want to use to facet
#this way facet_grid won't separate the points by species for this dataframe
<- penguins_clean |>
df2 select(-species)
ggplot(penguins_clean, aes(x =bill_length_mm, y = body_mass_g )) +
#add all datapoints to the facets by using our second dataframe
geom_point(data = df2, color = "grey", alpha = 0.5) +
#assign the species to the color aesthetic
geom_point(aes(color = species)) +
#facet the data
facet_grid(cols = vars(species))
Exercise:
- Generate a scatterplot comparing the bill length and body mass. Add a trendline (color the trend line in grey) and use facet_grid to create subplots for different species. Its up to you whether you prefer a horizontal or vertical orientation.
- Do the same as above but allow each facet to use a different scale.
- Generate a histogram showing the body weight. Use facet_grid to generate subplots for different islands and species.
- Do the same as in exercise 3 but use facet_wrap instead. Beware, facet_wrap uses a slightly different syntax, try to figure out what you need to change by using the help function. Once you generated the plot, do you see how the behave differently?
Click me to see an answer
#question 1
ggplot(penguins_clean, aes(x =bill_length_mm, y = body_mass_g )) +
geom_point() +
geom_smooth(method = "lm", color = "grey") +
facet_grid(cols = vars(species))
#question 2
ggplot(penguins_clean, aes(x =bill_length_mm, y = body_mass_g )) +
geom_point() +
geom_smooth(method = "lm", color = "grey") +
facet_grid(cols = vars(species), scales = "free")
#question3
ggplot(penguins_clean, aes(body_mass_g)) +
geom_histogram() +
facet_grid(cols = vars(species), rows = vars(island))
#question4
ggplot(penguins_clean, aes(body_mass_g)) +
geom_histogram() +
facet_wrap(vars(species, island))
12 Modifying scales and axes
12.1 Scale functions and transformations
Scales control the details of how data values are translated to visual properties. We can override the default scales to tweak details like the axis labels or legend keys.
Some options to modify our x- and y-axis are:
- scale_x_*()
- scale_y_*()
- scale_color_*()
- scale_fill_*()
- scale_shape_*()
- scale_linetype_*()
- scale_size_*()
Importantly, we need to define the scales based on the type of data we have, continuous or discrete:
- Discrete variables represent counts (e.g. the number of observations).
- Continuous variables represent measurable amounts (i.e. body weight).
We distinguish between these two options by appending either continuous or discrete to the scale_:
- scale_x_continous()
- scale_color_discrete()
12.2 Changing the text for axes and legends
Let’s start by changing the text label to the x an y axis as well as for the legend that gets generated when we use the color aesthetic. Notice, how the x and y-axis gets its own scale and how we have to use both a continous and discrete scale?
ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
geom_point(position = "jitter") +
scale_y_continuous("Body mass (g)") +
scale_x_continuous("Flipper length (cm)") +
scale_color_discrete("Penguin species")
12.3 Scale transformations
When working with continuous data, the default is to map linearly from the data space onto the aesthetic space. It is possible to override this default using transformations. Every continuous scale takes a trans
argument by which we can change the transformation. Built in functions for axis transformations are :
scale_x_log10()
,scale_y_log10()
: for log10 transformationscale_x_sqrt()
,scale_y_sqrt()
: for sqrt transformationscale_x_reverse()
,scale_y_reverse()
: to reverse coordinatescoord_trans(x ="log10", y="log10")
: is different to scale transformations in that it occurs after statistical transformation. Possible values for x and y are “log2”, “log10”, “sqrt”, etc.scale_x_continuous(trans='log2')
,scale_y_continuous(trans='log2')
: another allowed value for the argument trans is ‘log10’
In the example below, we reverse the scale for the y-axis:
ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
geom_point(position = "jitter") +
scale_y_reverse()
12.4 Changing the axis limits
Limits describes the range of a scale. There are different functions to set axis limits :
xlim()
andylim()
expand_limits()
scale_x_continuous()
andscale_y_continuous()
12.4.1 xlim and ylim
Let’s start by changing the range of the y-axis using ylim
and let the range go from 0 to 6500:
ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
geom_point(position = "jitter") +
ylim(0,6500)
12.4.2 expand_limits
expand_limits()
can be used to quickly set the intercept of x and y to 0,0.
ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
geom_point(position = "jitter") +
expand_limits(x=0, y=0)
12.4.3 scale_x/y_continuous
We can also use the scale_x_continuous()
and scale_y_continuous()
to change x and y axis limits, respectively. Using these functions usually is beneficical since we can control more details such as:
- name : x or y axis labels
- breaks : to control the breaks in the guide (axis ticks, grid lines, …). Among the possible values, there are :
- NULL : hide all breaks
- waiver() : the default break computation
- a character or numeric vector specifying the breaks to display
- labels : labels of axis tick marks. Allowed values are :
- NULL for no labels
- waiver() for the default labels
- character vector to be used for break labels
- limits : a numeric vector specifying x or y axis limits (min, max)
- trans for axis transformations. Possible values are “log2”, “log10”, …
Let’s start by controlling the limits and manually set the lower limit to 0 and the upper limit to the maximum value :
ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
geom_point(position = "jitter") +
scale_x_continuous(limits = c(0,250)) +
scale_y_continuous(limits = c(0,7000))
We can also use basic math operations to automatically create the limits for us. Below, we calculate the maximum body mass and use it as the upper limit. Notice: We add + 1
to give the plot a bit more room to fit the last data point.
ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
geom_point(position = "jitter") +
scale_x_continuous(limits = c(0,250)) +
scale_y_continuous(limits = c(0, max(penguins_clean$body_mass_g) + 1 ))
If we wanted to control how the tick mark positions are broken up, we use the breaks argument. The argument used in the call are: start, end, steps. Let us start by having a finer step size for the x-axis:
ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
geom_point(position = "jitter") +
scale_x_continuous(limits = c(0,250), breaks = seq(0,250,25)) +
scale_y_continuous(limits = c(0,7000))
We can also expand
the range of the plot limits. To see what is happening compare the 0,0 position between the plot above and below.
ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
geom_point(position = "jitter") +
scale_x_continuous(limits = c(0,250),
breaks = seq(0,250,25),
expand = c(0,0)) +
scale_y_continuous(limits = c(0,7000),
expand = c(0,0))
The labels
argument adjusts the category names and is an easy way if we want to beautify the legend.
ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
geom_point(position = "jitter") +
scale_color_discrete("Penguin species",
labels = c("Ad", "Ch", "Ge"))
12.5 The labs function
Labs is another way that allows us to change the axis and legend labels.
ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
geom_point(position = "jitter") +
labs(x = "Flipper length (cm)" ,
y = "Body mass (g)",
color = "Penguin species" )
12.6 Adding a plot title with ggtitle
We can also quickly add a title. If your title is extremely long, we can break it into several lines using \n
ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
geom_point(position = "jitter") +
ggtitle("Comparing the body mass (g) and \nflipper length (mm) of three different penguin species")
12.7 Defining our own colors
There are many ways to decide how to assign colors. While we will not go into detail, we briefly want to discuss how we can manually change colors or uses existing color palettes.
12.7.1 Using existing color palettes
scale_color_brewer
(and others) provide sequential, diverging and qualitative colour schemes. Some examples on how to use them are found here.
ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
geom_point(position = "jitter") +
scale_color_brewer(palette = "Spectral")
12.7.2 Manually assigning colors
With scale_color_manual()
we can change the properties of the color scale . The first argument sets the legend title and for the values argument we provide a named vector of colors to use.
Below we use simple words, i.e. red and blue, to describe what colors we want to use. However, its recommended to use a hexcode instead to have a bit more control over our color scheme. For example #377EB8 would also give a blue color (and in newer RStudio versions, the hexcode gets highlighted by the color the ID represents).
Let us start by defining some vectors in which we define the colors we want to map our three species too:
- palette1 in which we are explicit about what species is mapped to what color
- palette2 in which we only create a vector with three written out colors
- palette3 in which we create a vector with colors using a hexcode
#lets start by defining some color vectors
<- c(Gentoo = "#377EB8", Chinstrap = "#E41A1C", Adelie = "purple")
palette1 <- c("blue", "red", "purple")
palette2 <- c("#377EB8", "#E41A1C", "#B300B2") palette3
Now, let us use scale_color_manual
to assign our species to these colors:
ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
geom_point(position = "jitter") +
scale_color_manual("Penguins", values = palette1)
Exercise
Use palette2 and palette3 instead of palette1. Do you see how they behave differently?
Click me to see an answer
ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
geom_point(position = "jitter") +
scale_color_manual("Penguins", values = palette2)
We can see that if we don’t add the species names to the vector, we have less control over how colors are changed. If we don’t specify an order, the colors get assigned to the default order of the mapped variable, i.e. species:
- Adelie: blue
- Chinstrap: red
- Gentoo: purple
In contrast, using palette1 we can be explicit and decide that Adelie is purple regardless of the order of the vector.
ggplot(penguins_clean, aes( x =flipper_length_mm , y = body_mass_g , color = species)) +
geom_point(position = "jitter") +
scale_color_manual("Penguins", values = palette3)
Palette3 is an example of us using hexcodes. See how we use blue, red and purple as before but by using the hexcode we can exactly specify the intensity of the colors. In the example here, I decide to use a less bright color intensity compared to when we used the other palettes.
When making your own color palettes you need to keep in mind:
- You need as many colors as you have categories (3 in our case)
- Depending on how you assign the colors, you might not have control over the order of the colors and need to watch out that the colors get assigned to the right variable, i.e. species in the examples above.
13 Themes
Themes allows us to control all non-data ink on your plot as well as all visual elements that are not part of your data,
There are three types that can be modified with:
- text, which can be modified using
element_text()
- line, which can be modified using
element_line()
- rectangle, which can be modified using
element_rect()
13.1 Modifying text elements
We can in general modify all text, titles, plot, legend, axes.
Let’s modify part of the axes, specifically the color of the axis title. For this, we access the axis.title via the theme function and change text theme element via element_text()
.
ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species )) +
geom_point() +
theme(axis.title = element_text(color = "blue"))
We can change specific parts in an hierarchical manner.
For example, when using axis.title, we can change both, the x- and y-axis. With axis.title.x we only change the x-axis and so on.
13.2 Modify lines
Lines are the axis lines as well as the tick marks or the lines around the plot. Same as for text, this is changed in an hierarchical manner.
As an example, let us change the color of the x axis line:
ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species )) +
geom_point() +
theme(axis.line.x = element_line(color = "blue"))
13.3 element_blank()
We can use element_blank()
to remove any item of a plot. If you are unsure what is removed in each theme, feel free to try the code by removing individual themes.
ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species )) +
geom_point() +
theme(line = element_blank(),
rect = element_blank(),
text = element_blank())
13.4 Moving the legend
Legend positions can be changed via legend.position
. Here, we can use the following:
- “top”, “bottom”, “left”, or “right’”: place it at that side of the plot.
- “none”: don’t draw it.
- We can provide exact positions, using c(x, y): here, c(0, 0) means the bottom-left and c(1, 1) means the top-right.
Exercise
- With the information above, try to remove the legend.
- Add the legend at the bottom of the plot
- Position the legend with x at 0.6 and y at 0.1
Click me to see an answer
#question 1
ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species )) +
geom_jitter(alpha =0.6) +
theme(legend.position = "none")
#question 2
ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species )) +
geom_jitter(alpha =0.6) +
theme(legend.position = "bottom")
#question 3
ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species )) +
geom_jitter(alpha =0.6) +
theme(legend.position = c(0.6, 0.1))
13.5 Modifying theme elements
Many plot elements have multiple properties that can be set. For example, line elements, such as axes and grid lines, have a color, a thickness (size), and a line type (solid line, dashed, or dotted). To set the style of a line, you use element_line()
.
Below, we will go through a few examples but for a full list of options, go here.
For example, we can give all rectangles (via panel.background
) in the plot a “white” fill and grey line color and remove the legend.key
s outline by setting its color to be missing (NA)
ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species )) +
geom_jitter(alpha =0.6) +
theme(panel.background = element_rect(fill = "white", color = "grey"),
legend.key = element_rect(fill = NA) )
We can also remove the axis ticks (axis.ticks
) by using element_blank()
. We can also remove the panel grid lines, panel.grid
in the same way.
ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species )) +
geom_jitter(alpha =0.6) +
theme(axis.ticks = element_blank(),
panel.grid = element_blank())
If we wanted to at least add the major grid lines back to the plot above we would do the following.
ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species )) +
geom_jitter(alpha =0.6) +
theme(axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.grid.major = element_line(color = "black", size = 0.5, linetype = "dotted"))
We can also easily change the text. For example, let us make the axis.text
, less prominent by changing the color to “grey”. Additionally, lets add a title and increase the plot.title
’s, size to 16 and change its font face to “italic”.
ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species )) +
geom_jitter(alpha =0.6) +
ggtitle("Penguins are fun") +
theme(axis.text = element_text(color = "grey"),
plot.title = element_text(size = 16, face = "italic"))
Another useful thing: We can increase the distance between axis title and the plot my increasing the margins:
ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species )) +
geom_jitter(alpha =0.6) +
ggtitle("Penguins are fun") +
theme(axis.title.x = element_text(margin=margin(t=10)),
axis.title.y = element_text(margin=margin(r=10)),
axis.text = element_text(color = "grey"),
plot.title = element_text(size = 16, face = "italic"))
13.6 Modify white space
When talking about Whitespace, we talk about all the non-visible margins and spacing in the plot.
To set a single whitespace value, use unit(x, unit), where x is the amount and unit is the unit of measure. The default unit is “pt” (points), which scales well with text. Other options include “cm”, “in” (inches) and “lines” (of text). If you want to have a list of all possible units type ?grid::unit
For example, we could make longer axis ticks like this:
ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species )) +
geom_jitter(alpha =0.6) +
theme(axis.ticks.length = unit(12, "points"))
Borders require you to set 4 positions, so use margin(top, right, bottom, left, unit). To remember the margin order, think TRouBLe. We set the legend.margin
to 20 points (“pt”) on the top, 30 pts on the right, 40 pts on the bottom, and 50 pts on the left like this:
ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species )) +
geom_jitter(alpha =0.6) +
theme(legend.margin = margin(20,30,40,50, "pt"))
Exercise
- Give the legend key size,
legend.key.size
, a unit of 3 centimeters (“cm”). - Set the plot margin,
plot.margin
, to 10, 30, 50, and 70 millimeters (“mm”).
Click me to see an answer
#question1
ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species )) +
geom_jitter(alpha =0.6) +
theme(legend.key.size = unit(3, "cm"))
#question2
ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species )) +
geom_jitter(alpha =0.6) +
theme(plot.margin = margin(10,30,50,70, "mm"))
13.7 Modifying facets
With strip we can change the appearance of facets. Let’s change the facet text font and the box.
ggplot(penguins_clean, aes(bill_length_mm)) +
geom_histogram() +
facet_grid(species ~ year, scales = "free") +
theme(strip.text = element_text(size=12, face="bold"),
strip.background = element_rect(colour="black", fill="white",linetype="solid"))
14 Theme flexibility
Ways to use themes:
- From scratch (as shown above)
- using theme layer objects
- using build-in themes (from ggplot)
- using build-in themes (from other packages)
14.1 Make your own themes
For now, let’s have a look at the second point. Making our on themes is useful for consistency across several plots.
Let’s look at one of the plots we have done before and lets store it in the variable z. In the next step, we can add some custom themes to z:
<-
z ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species )) +
geom_jitter(alpha =0.6) +
scale_x_continuous("Bill_length (cm)") +
scale_y_continuous("Body weight (g)") +
scale_color_brewer("Penguins", palette = "Dark2", labels = c("Ad", "Ch", "Ge"))
z
Now, lets change some themes to make our plot a bit more professional looking.
+ theme(text = element_text(family = "serif", size = 14),
z rect = element_blank(),
panel.grid = element_blank(),
axis.line = element_line(color = "black"))
Now, lets define a theme that we can reuse.
<- theme(text = element_text(family = "serif", size = 14),
theme_pengs rect = element_blank(),
panel.grid = element_blank(),
axis.line = element_line(color = "black"))
Use our new theme for the plot.
+ theme_pengs z
The useful thing of doing things this way, is that we can also apply this theme to any other plot. For example, lets apply it to the plot below
<- ggplot(penguins_clean, aes(x = body_mass_g)) +
m geom_histogram() +
scale_x_continuous(expand = c(0,0))+
scale_y_continuous(expand = c(0,0))
m
Now, we can add our theme like this
+ theme_pengs m
We still can modify themes by adding another theme layer, which will over-write previous settings.
+
m +
theme_pengs theme(axis.line.x = element_blank())
14.2 Accessing build in themes
Use theme_*()
to access built-in themes. A full list of themes can be found here. In general the following build in themes might be useful to get started:
- theme_gray() is the default.
- theme_bw() is useful when you use transparency.
- theme_classic() is more traditional.
- theme_void() removes everything but the data.
+
z theme_classic()
Custom themes have a convenient way to change ALL text and make things a bit bigger (often useful for presentations):
+
z theme_classic(base_size = 12)
Again, we can modify every specific element we want.
+
z theme_classic() +
theme(text = element_text(family = "serif"))
Exercise
- Create a scatter plot comparing the bill length and body mass. Map the species variable to the color aesthetic. Add a black and white theme,
theme_bw()
, to the penguin plot.
Click me to see an answer
ggplot(penguins_clean, aes( x = bill_length_mm, y = body_mass_g, color = species )) +
geom_jitter(alpha =0.6) +
theme_bw()
15 Playground
The code below is outside of the tutorial but will list any other package that is can be useful to use in combination with ggplot.
15.1 Cowplot
cowplot provides various features that help with creating publication-quality figures, such as a set of themes, functions to align plots and arrange them into complex compound figures, and functions that make it easy to annotate plots and or mix plots with images.
#load library
library(cowplot)
Let us start by creating some plots first:
<- ggplot(penguins_clean, aes( x = flipper_length_mm, color = species )) +
hist geom_histogram()
<- ggplot(penguins_clean, aes(x = species, y = body_mass_g, color = species)) +
box geom_boxplot() +
geom_point(position = position_jitter())
<- ggplot(penguins_clean, aes(flipper_length_mm, body_mass_g, color = species)) +
scatter geom_point()
15.1.1 Cowplot themes
Cowplot comes with some build-in themes that can be used to beautify the plot:
+
scatter theme_cowplot(12)
15.1.2 Arranging multiple plots
With the plot_grid
function, we can arrange multiple plots in a grid:
plot_grid(scatter, box, labels = c("A", "B"), label_size = 12)
We see that we have redundant legend, we can deal with this as follows:
# extract a legend
<- get_legend(scatter +
legend theme(legend.direction = "vertical",
legend.justification="top",
legend.box.just = "top")
)
#add legend to the plot
<-plot_grid(
p + theme(legend.position="none"),
scatter + theme(legend.position="none"),
box
legend,nrow = 1,
rel_widths = c(1,1,0.3),
labels = c("A", "B", ""),
label_size = 12)
p
15.2 ggstatsplot
ggstatsplot
is an extension of ggplot2
package for creating graphics with details from statistical tests included in the information-rich plots themselves. More information can be found here.
library(ggstatsplot)
There are different types of plot, for example, we can use violin plots for comparisons between groups/conditions. The following options are available:
Function | Plot | Description |
---|---|---|
ggbetweenstats() | violin plots | for comparisons between groups/conditions |
ggwithinstats() | violin plots | for comparisons within groups/conditions |
gghistostats() | histograms | for distribution about numeric variable |
ggdotplotstats() | dot plots/charts | for distribution about labeled numeric variable |
ggscatterstats() | scatterplots | for correlation between two variables |
ggcorrmat() | correlation matrices | for correlations between multiple variables |
ggpiestats() | pie charts | for categorical data |
ggbarstats() | bar charts | for categorical data |
ggcoefstats() | dot-and-whisker plots | for regression models and meta-analysis |
We can also use different statistical approaches:
Type | Test | Function used |
---|---|---|
Parametric | One-sample Student’s t-test | stats::t.test() |
Non-parametric | One-sample Wilcoxon test | stats::wilcox.test() |
Robust | Bootstrap-t method for one-sample test | WRS2::trimcibt() |
Bayesian | One-sample Student’s t-test | BayesFactor::ttestBF() |
An example for a scatterplot:
ggscatterstats(data = penguins_clean, x = bill_length_mm, y = body_mass_g)
A simple example for a violinplot:
ggbetweenstats(data = penguins, x = species, y = body_mass_g, type = "robust")
gghistostats() histograms for distribution about numeric variable:
gghistostats(data = penguins, x = body_mass_g, type = "np",
normal.curve = TRUE, normal.curve.args = list(color = "black", size = 1))
We can also extract the stats like this:
<- gghistostats(data = penguins, x = body_mass_g, type = "np")
p
extract_stats(p)
$subtitle_data
# A tibble: 1 × 12
statistic p.value method alternative effectsize
<dbl> <dbl> <chr> <chr> <chr>
1 58653 8.19e-58 Wilcoxon signed rank test two.sided r (rank biserial)
estimate conf.level conf.low conf.high conf.method n.obs expression
<dbl> <dbl> <dbl> <dbl> <chr> <int> <list>
1 1 0.95 1 1 normal 342 <language>
$caption_data
NULL
$pairwise_comparisons_data
NULL
$descriptive_data
NULL
$one_sample_data
NULL
$tidy_data
NULL
$glance_data
NULL
We can also do a grouped comparison, i.e. lets compare the body mass of different species grouped by sex. Since we do several comparisons, lets also a bonferroni correction here:
grouped_ggbetweenstats(data = penguins_clean, x = species, y= body_mass_g, grouping.var = sex,
p.adjust.method = "bonferroni", centrality.type = "parametric")
We can also use this tool to tag outliers:
ggbetweenstats(data = penguins, x = species, y = body_mass_g, outlier.tagging = TRUE, outlier.color = "pink")
Finally, we can also generate a correlalogram (a matrix of correlation coefficients):
ggcorrmat(data = penguins_clean)
15.3 GGPP
ggpp is a relatively new package that allows us to further modify plots using tibbles:
We can add summary plots into a figure by installing and loading two new libraries first:
library(tibble)
library(ggpp)
library(ggpubr)
#store a custom theme that we want to use for all plots in a variable
#theme_pupr is part of the ggpupr package
<-
theme_manual theme_pubr(base_size = 10) +
theme(legend.position = "right",
axis.title.x = element_text(margin=margin(t=10)),
axis.title.y = element_text(margin=margin(r=10)))
We can easily add a table on top of our plot:
#generate summary statistics
#we use rename to prettify the column header
#and use round to not display any digits when we calculate the mean
<- penguins_clean |>
summary group_by(species) |>
summarise(avg_body_mass = round(mean(body_mass_g),0)) |>
rename(Species = species, `Mean body\n mass (g)` = avg_body_mass)
#generate a tibble from our dataframe
#the positions will later be used to place our table in the plot
<- tibble(x = 30, y = 6400, tb = list(summary))
data.tb
#generate a dot plot and add summary stats
ggplot(penguins_clean, aes(bill_length_mm, body_mass_g)) +
#geom_table allows to plot a df or tibble
geom_table(data = data.tb, aes(x, y, label = tb)) +
geom_point(alpha = 0.5) +
+
theme_manual ylim(NA,6500) +
xlim(NA,65)
Or a plot:
#generate a boxplot and store it in a variable
<-
boxplot ggplot(penguins_clean, aes(species, body_mass_g, fill = species )) +
geom_boxplot() +
theme_bw(base_size = 8) +
theme(legend.position = "none",
panel.grid = element_blank(),
axis.title = element_blank())
#convert to tibple
<- tibble(x = 20, y = 6400,
data.tb plot = list(boxplot))
#generate a scatter plot and add the boxplot in an additional layer
ggplot(penguins_clean, aes(bill_length_mm,body_mass_g , color = species)) +
geom_point() +
geom_plot(data = data.tb, aes(x, y, label = plot)) +
theme_manual
We also can zoom into plots:
<-
p ggplot(penguins_clean, aes(bill_length_mm, body_mass_g)) +
geom_point(alpha = 0.5) +
ylim(NA,6500) +
xlim(NA,65)
<-
data.tb tibble(x = 90, y = 2500,
plot = list(p +
coord_cartesian(xlim = c(49, 61),
ylim = c(5000, 6400)) +
labs(x = NULL, y = NULL) +
theme_bw(8) +
scale_colour_discrete(guide = "none")))
ggplot(penguins_clean, aes(bill_length_mm, body_mass_g)) +
geom_plot(data = data.tb, aes(x, y, label = plot)) +
annotate(geom = "rect",
xmin = 49, xmax = 61, ymin = 5000, ymax = 6400,
linetype = "dotted", fill = NA, colour = "black") +
geom_point() +
annotate("segment", x = 49, xend = 65, y = 5000, yend = 2700, linetype = "dotted") +
annotate("segment", x = 61, xend = 89, y = 6400, yend = 4200, linetype = "dotted") +
+
theme_manual scale_x_continuous(breaks = seq(0,60,10))