Section 18 Making plots with ggplot2
R has different choices available for using plots. Two commonly used options are ‘base graphics’ (e.g. using commands such as plot()
, hist()
, boxplot()
) and the package ggplot2
(Wickham, Chang, et al. 2020), (Wickham 2016). It’s a matter of personal preference which you should use; I like to use both: base graphics for drawing diagrams, and ggplot2
for plotting data from data frames. In these notes we will discuss ggplot2
.
(Artwork by @allison_horst)
In this section, we’ll concentrate on how to make different types of plots. You’ll almost certainly need to customise the appearance of your plot, but we’ll leave that until the next section.
We’ll work through some examples in the next sections. To try these out on your own computer, you’ll need to load the tidyverse and import the maths.csv
data, and create the wealthiest
column. The commands to do this all are below.
library(tidyverse)
maths <- read_csv("https://oakleyj.github.io/exampledata/maths.csv") %>%
mutate(wealthiest = gdp > 17000)
18.1 The general syntax
As an example, we’ll first produce a scatter plot:
ggplot(data = maths, aes(x = gdp, y = score)) +
geom_point(size = 2, alpha = 0.5) +
labs(x = "GDP per capita (US$)",
y = "Mean PISA mathematics score, 2015")
The syntax can look a little complicated to start but you should get used to it! Some general points are as follows.
There is quite a lot to take in here, so just skim read for now, and then have another look once you seen some examples in the following sections.
- The data we want to plot must be in a data frame.
- All plots begin with a
ggplot()
command, specifying the data frame we are using (maths
in this case). - We use the
aes()
any time we want something on the plot to represent values in a data frame column, e.g- the position of a point on the \(x\)-axis or \(y\)-axis;
- the colour of a point (different colours depending on variable values). This is sometimes referred to as ‘mapping a column to an aesthetic’. In the example, we mapped the
gdp
column onto the \(x\)-axis, and thescore
column to the \(y\)-axis.
- Any feature of the points not related to columns in the data frame are specified outside of an
aes()
command. Here we usedsize
to make the points a little largeralpha
to make the points transparent (helpful when points are overlapping.)
- The actual type of plot is specified as a “geom”. There will be different
geom
commands for scatter plots, box plots, histograms etc. We usedgeom_point()
to make a scatter plot. - We use the
+
symbol to combine commands and make a single plot. In the example we’ve added a command to give axes labels.
18.2 Scatter plots
We’ve seen a scatter plot already, but we’ll repeat the basic syntax here:
Note that, without the labs()
command, the axes labels are just the column names.
In any report that you write, you will almost certainly need to specify your own axes labels; the column names will not be suitable. In your labels, include the units of measurement, if appropriate.
18.2.1 Annotating a scatter plot
We may wish to annotate a plot, e.g. labelling a single observation. Here’s an example of labelling the UK (GDP = 39899, score = 492). We’ll also include proper axes labels.
ggplot(data = maths, aes(x = gdp, y = score)) +
geom_point() +
labs(x = "GDP per capita (US$)",
y = "Mean PISA mathematics score, 2015") +
annotate("point", x = 39899, y = 492,
colour = "red", size = 2) +
annotate("text", label = "UK", x = 39899,
y = 492, colour = "red",
hjust = -0.2, vjust = 1)
- The first
annotate()
command adds a red circle ("point"
) at the coordinates \(x=39899\), \(y=492\) (corresponding to the UK), with thesize
set to 2 to make it a little larger. - The second
annotate()
command adds some red text (UK) at the coordinates \(x=39899\), \(y=492\), with the argumentshjust
andvjust
shifting the text slightly horizontally and vertically, so that it appears next to, rather than on top of the red dot. It can take a little trial and error to find the values forhjust
andvjust
that you are happy with.
Adding text labels to points can be awkward! Fortunately there is a
nice package ggrepel
(Slowikowski 2020) that does various useful things,
such as automatically pushing text labels away from each other, whilst
still making clear which labels go with which points. Some examples of using ggrepel
are
described here.
18.2.2 Adding a trend
We can see clearly that maths scores tend to increase as GDP per capita increases, but the relationship doesn’t look linear. If we want to emphasise such a relationship, we can add the trend to the plot, using the extra line geom_smooth()
in the plot command:
ggplot(data = maths, aes(x = gdp, y = score)) +
geom_point() +
labs(x = "GDP per capita (US$)",
y = "Mean PISA mathematics score, 2015") +
annotate("point", x = 39899,
y = 492, colour = "red",
size = 2) +
annotate("text", label = "UK",
x = 39899, y = 492,
colour = "red",
hjust = -0.2, vjust = 1) +
geom_smooth()
The blue line shows the estimated trend. The grey shaded area indicates uncertainty about this trend (it’s wider on the right hand side, because we have less data there).
If we want to plot a linear trend, we add the argument method = "lm"
to geom_smooth()
(we can use various regression methods within geom_smooth()
):
18.2.3 Representing more than two variables on a scatter plot
We can represent a third variable using colour. Additional variables can be presented using size and shape, although care is needed here; the plot could get difficult to read.
18.2.3.1 Colours for qualitative variables
Here’s an example using colour to represent continent
ggplot(data = maths,
aes(x = gdp, y = score)) +
geom_point(aes(colour = continent)) +
labs(x = "GDP per capita (US$)",
y = "Mean PISA mathematics score, 2015")
- Note that the mapping of
continent
tocolour
has to sit inside theaes()
command. - Colours can be specified manually.
ggplot2
will attempt to use distinctive colours by default, but this can be hard if there are too many groups.
18.2.3.2 Colours for quantitative variables
Here’s an example using colour to represent start.age
ggplot(data = maths,
aes(x = gdp, y = score)) +
geom_point(aes(colour = start.age)) +
labs(x = "GDP per capita (US$)",
y = "Mean PISA mathematics score, 2015")
Note how a ‘continuous’ colour scale has been used. As there were only three distinct starting ages in the data, a ‘qualitative’ colour scale might be better here. To do this, we can convert start.age
to a factor variable with the plot commands:
18.3 Histograms
Here’s an example of a histogram of the score
variable. We only map one column to an axis, and produce the histogram with geom_histogram()
:
We can scale the histogram so that the total area equals 1:
18.3.1 Customising a histogram plot in R
We’ll redraw the plot with different colours, add a better axis label, specify a histogram bin-width of size 10, and indicate the UK’s score with a red dot:
ggplot(data = maths, aes(x = score)) +
geom_histogram(colour = "blue", fill = "white", binwidth = 10) +
labs(x = "Mean PISA mathematics score, 2015") +
annotate("point", x = 492, y = 0, size = 4, colour = "red")
- The second line now includes extra arguments:
fill
sets the colour of the interior of the bars, andcolour
sets the colour of the bar edges.binwidth
sets how wide each bar is on the \(x\)-axis; - the third line (
labs
) specifies the label on the \(x\)-axis; - the fourth line (
annotate
) draws a red circle at the coordinates \(x=492, y=0\), andsize = 4
increases the size of the circle (the default value forsize
is 1.)
18.4 Box plots
Box plots can be useful for comparing multiple distributions, although you may need to explain to your reader how to interpret them! Here’s an example of a box plot to compare scores between the different continents:
ggplot(data = maths, aes(x = continent, y = score)) +
geom_boxplot() +
labs(y = "Mean PISA mathematics score, 2015")
It can be difficult to fit all the labels in on the \(x\)-axis. A simple solution is to draw the box plot ‘horizontally’ (and we’ll specify an empty \(y\)-axis label as we don’t really need one)
ggplot(data = maths, aes(y = continent, x = score)) +
geom_boxplot() +
labs(x = "Mean PISA mathematics score, 2015",
y = "")
If we wanted to do a box plot of score
by start.age
, we have to convert start.age
to a factor variable using as.factor()
ggplot(data = maths, aes(x = as.factor(start.age), y = score)) +
geom_boxplot() +
labs(x = "School starting age",
y = "Mean PISA mathematics score, 2015")
18.4.1 Adding the observations to a box plot
It may be helpful to show the individual observations as well, for example, if the distribution of points within a group is bimodal; this wouldn’t be apparent from a box plot. It can help to ‘jitter’ the points on the group axis so that each point can be seen clearly. This can be done as follows.
18.5 ‘Global’ aesthetics
Aesthetics can be defined in different places, and this affects the appearance of the plot. Here’s an example scatter plot, where we map colour to wealthiest
within the geom_point()
command
ggplot(data = maths,
aes(x = gini, y = score)) +
geom_point(aes(colour = wealthiest)) +
labs(x = "Income inequality (Gini coefficient)",
y = "Mean PISA mathematics score, 2015") +
geom_smooth(method = "lm")
Note that the trend specified in geom_smooth()
refers to a single trend for all the data. If we instead map wealthiest
to colour in the ggplot()
command, this now applied ‘globally’ in all the subsequent geom
commands:
18.6 Facets
In some cases, we may wish to create a figure with multiple plots side-by-side, or arranged in a grid. If the plots are of the same type, with the same axes, we can use facets, which will ensure that the scales on each axes are the same, making comparison between the plots easier.
For example, suppose we want a separate scatter plot of score
against gdp
, for each value of start.age
. We can do this as follows.
See ?facet_grid
and also ?facet_wrap
for more details (look at the examples in particular, for the different ways these plots can be laid out.)
18.7 Exercises
Exercise 18.1 For these exercises you will need the Brexit
data frame created in Section 12.10.
- Produce a scatter plot of the percentage voting to remain in the EU (within each local authority), against the percentage of adults with Level 4 qualifications.
- Specify suitably descriptive axes labels;
- add a linear trend to your plot;
- colour each point by its
Region
variable.
- Produce a histogram showing the distribution of the percentage voting to remain in the EU across the different local authorities.
- Specify a suitably descriptive axis label;
- annotate your plot to indicate the vote in Sheffield.
- Produce a box plot for the percentage voting to remain in the EU within each region.
- Experiment with ‘horizontal’ and ‘vertical’ box plots.
18.8 Further reading
There is an RStudio Cheatsheet on
ggplot2
: go to Help > Cheatsheets > Data Visualization with ggplot2.The R Graphics Cookbook (2nd edition) is a good reference for making plots with
ggplot2
.There is an online tutorial available here: RStudio Primers (Visualize Data)
See also R for Data Science (Chapter 3)