Section 19 Presentation of plots
You should think carefully about how you present your plots to others. With minimal plotting commands, you can obtain a plot quickly and easily, but it is unlikely it will be suitable for including in a report.
For this chapter, you will need to install the MAS6005
package, which is available on GitHub only. Install it with the commands
19.1 The basics
Using the mtcars
data frame, we’ll make a scatter plot of miles per gallon against weight. For more information about these variables, use ?mtcars
. Suppose we use the following code, and the figure below is used in a written report
This standard of presentation would be unacceptable in any report! To improve the figure, we should
include proper axes labels: never simply use the R variable names;
specify the units;
give sufficiently detailed captions so that the figure can be understood on its own, and include a conclusion: what do we learn from the figure?
The first two points are obvious, but the third perhaps less so, so we’ll discuss this a little more.
19.4 Customising the appearance of a plot
You can change just about any aspect of the appearance of a plot. If you have a legend in your plot, it’s likely you’ll need to modify it, as the default will use data frame column names. You may also wish to change the grey background used by default in ggplot2
. Font sizes will need changing if they are too small in your final report.
The R Graphics Cookbook is an excellent reference here. (The format used throughout is to state a “problem”: something you want to do with your plot, and then provide the code solution and discussion). Some chapters in particular to look at are
19.5 Refining a plot: an example
We’ll now give an example of creating a plot, and then thinking about how we might improve it (assuming we’ve already got ‘the basics’ right, in that the axes titles and caption are satisfactory.)
19.5.1 The data to plot
We consider the mvscores
data set from the MAS6005
package. The aim is to compare test match batting scores for one player, Michael Vaughan, between the matches he played as captain, and the matches where he was not captain. The hypothesis is players tend to perform less well, if they have the added burden of captaincy.
The data set also records whether each score was in the first or second innings; batting can be more difficult in a second innings, due to wear of the pitch.
To summaries for those with no interest in/knowledge of cricket:
- we want to illustrate how/if the values in
runs
differ depending on thecaptain
variable (a 2-level factor:yes
orno
) innings
(a 2-level factor:first
orsecond
) is a ‘blocking variable’: we are more interested in comparing captain/not captain scores within the same innings type than between different innings types.
19.5.2 A first attempt: four histograms
We’ll first try producing four histograms of scores: one for each combination of captain
and innings
. Note that
- we use the
gridExtra
package (Auguie 2017) to arrange the plots in a 2x2 grid; - plots produces by
ggplot2
can be assigned to variables, and used later in other functions.
p1 <- mvscores %>%
filter(innings == "first", captain == "yes") %>%
ggplot(aes(x = runs))+
labs(x = "First innings runs, captain") +
geom_histogram()
p2 <- mvscores %>%
filter(innings == "first", captain == "no") %>%
ggplot(aes(x = runs))+
labs(x = "First innings runs, not captain") +
geom_histogram()
p3 <- mvscores %>%
filter(innings == "second", captain == "yes") %>%
ggplot(aes(x = runs))+
labs(x = "Second innings runs, captain") +
geom_histogram()
p4 <- mvscores %>%
filter(innings == "second", captain == "no") %>%
ggplot(aes(x = runs))+
labs(x = "Second innings runs, not captain") +
geom_histogram()
gridExtra::grid.arrange(p1, p2, p3, p4, nrow = 2)
The main thing I don’t like about this plot is that the \(x\)-axis scales are different for each histogram, which makes comparing the histograms harder. We could set the scale manually (see ?ggplot2::xlim
) but facets might work well here.
The \(y\)-axis scales are also different. This issue is slightly more complicated, in that the numbers of observations used for each histogram are different, so it’s really the shapes of the histograms that we want to compare. One options is to scale each histogram to have total area 1 (so it’s like a density plot.)
It’s also worth thinking about the arrangement of the plots within the grid. I would use rows rather than columns to represent the main factor of interest (captain
), so that the main comparisons involve looking at histograms aligned vertically, not horizontally: any ‘shift’ in distribution is easier to see.
An alternative plot, which I prefer, is as follows.
ggplot(mvscores, aes(x = runs, y = ..density..))+
labs(x = "Runs scored") +
geom_histogram() +
facet_grid(rows = vars(captain),
col = vars(innings),
labeller = label_both)
19.5.3 Using a box plot
Another option is to present the data using a box plot. Although we lose some information compared with the histogram point, comparing summaries of the distributions of scores is easier.
I prefer this to the histogram plot. Note that it’s more helpful to map captain
to colour and innings
to the \(x\)-axis than vice-versa, as this makes it easier to compare the effect of ‘treatment’ (captain
) within each ‘block’ (innings
).
19.6 Exercises
Exercise 19.1 Below are three plots. For each plot,
- run the code in R to reproduce the plot.
- Think about how each plot might be improved.
- Modify the R code to achieve a better plot.
Hints are given for each plot, but try not to read them until you have had your own ideas!
- This plot uses the built in data set
airquality
. Type?airquality
for more details.
This plot fails on all levels regarding The basics and The caption test! Use the help file so you can specify more informative labels.
Later, we will be using R Markdown to add captions to plots, so for now, just suggest some text that would be more suitable for the caption.
- This following plot uses the
medals
data set from theMAS6005
package, and shows number of gold medals against population size.
ggplot(MAS6005::medals, aes(x = population, y = gold)) +
geom_point() +
labs(x = "population size", y = "gold medals won")
- The caption refers to India and China. Although the reader might guess which points are these two countries, they shouldn’t have to! Annotations would help.
- The bunching of most of the points in the bottom left corner doesn’t look very nice. A log-scale \(x\)-axis is worth trying.
- The scientific notation used for the \(x\)-axis scale is unfriendly for the general reader. It might help to express population size in units of millions.
- The following plot uses the
inequality
data set from theMAS6005
package, and shows income inequality for different countries.
- Visualising the rank order is difficult here, as the bars are arranged in alphabetical order of country. Ordering them by Gini coefficient would help. See this example in the R Graphics Cookbook
- The \(y\)-axis label “country” is unnecessary here, and can be removed.
- As the caption refers to the UK, we could try to make the UK observation more distinctive in the plot. Here, you could try using the
fill
argument ingeom_cols()
, specifying it to be a vector of 36 colours: 35 the same, and one different for the UK.
Bonus challenge! Tufte (2013)2 suggests having gaps within the bars to create a nice grid effect (“data-ink maximisation”). Try a Google image search for “Tufte bar chart”. Can you create this effect?
19.7 Data sources
Air quality data obtained the New York State Department of Conservation (ozone data) and the National Weather Service (temperature data), provided in the R
datasets
package.Inequality data obtained from OECD (2016), Income inequality (indicator). doi: 10.1787/459aa7f1-en [Accessed on 17 August 2016]
Population data obtained from The World Bank. Accessed 6th October 2015.
Medal table obtained from https://www.rio2016.com/en/medal-count-country [Accessed on 6th October 2016, but this link is no longer active.]
References
Tufte, Edward R. 2013. The Visual Display of Quantitative Information. Second edition. Graphics Press.↩︎