Section 13 Data handling with dplyr
Once we have our data in a single data frame, we might wish to obtain some simple summary statistics, perhaps within subgroups of the data, and/or understand the structure of the data better (e.g. counting how many observations we have for particular combinations of factors.) This might involve creating new columns, with values dependent on the other columns.
The package dplyr
(Wickham, François, et al. 2020) has various functions for working with data frames, and we will illustrate some here.
(Artwork by @allison_horst)
13.1 Chaining commands together with the pipe operator %>%
We’ll first introduce the ‘pipe operator’ %>%
. (It’s a matter of personal preference whether you use it in your code or not, but you may find it makes your code easier to read.)
The pipe operator %>%
takes whatever the output is from the left hand side, and uses it as the first argument in the function on the next line. In general,
can be written using the pipe operator as
so, for example, from the previous chapter
would be written as
Code with pipes can be easier to read, particularly when multiple functions are chained together. For example, this code
would be read as
- start with
x0
, - then apply function
func1()
to it with additional argumentx1
, - then apply the function
func2()
to the result, using additional argumentx2
, - and so on
Alternatives without pipes would be
which would be harder to read, or
which may be undesirable if we don’t want to store all the intermediate variables y1
, y2
, y3
.
13.2 Example
We’ll be using the data maths.csv
, which is included in the data sets zip file:
This is a data set of maths scores from the PISA 2015 maths tests, with data obtained from OECD and the World Bank1. In addition to country
and continent
, there are the following columns.
score
: the mean mathematics score in the 2015 PISA test;gdp
: the gross domestic product per capita (GDP divided by the estimated population size), measured in US\(\$\);gini
: the Gini coefficient (as a percentage). This is an estimate of income inequality, with larger values indicating more income inequality;homework
: an estimate of the average number of hours per week spent on homework by 15 year-olds, from a survey in 2012;start.age
: the age (in years) in which children start school.
13.3 Ordering the rows by a variable with the arrange()
command
Suppose we want to see which countries got the highest score: we want to arrange the rows in the data frame maths
in order according to the values in the column score
. To do this we use the command
## # A tibble: 70 × 7
## country continent score gdp gini homework start.age
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Dominican Republic North America 328 6722 44.9 NA 6
## 2 Algeria Africa 360 3844 27.6 NA 6
## 3 Tunisia Africa 367 3689 35.8 3.5 6
## 4 Macedonia, FYR Europe 371 5237 35.6 NA 6
## 5 Brazil South America 377 8650 51.3 3.3 6
## 6 Jordan Asia 380 4088 33.7 4.2 6
## 7 Indonesia Asia 386 3570 39.5 4.9 7
## 8 Peru South America 387 6046 44.3 5.5 6
## 9 Colombia South America 390 5806 51.1 5.3 6
## 10 Lebanon Asia 396 7914 31.8 3.3 6
## # ℹ 60 more rows
This has arranged the rows in ascending order of score
. To see them in descending order, we include the desc()
command:
## # A tibble: 70 × 7
## country continent score gdp gini homework start.age
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Singapore Asia 564 52961 NA 9.4 6
## 2 Hong Kong SAR, China Asia 548 43681 NA 6 6
## 3 Macao SAR, China Asia 544 73187 NA 5.9 6
## 4 Japan Asia 532 38894 32.1 3.8 6
## 5 B-S-J-G (China) Asia 531 8123 42.2 13.8 6
## 6 Korea, Rep. Asia 524 27539 31.6 2.9 6
## 7 Switzerland Europe 521 78813 32.5 4 7
## 8 Estonia Europe 520 17575 34.6 6.9 7
## 9 Canada North America 516 42158 34 5.5 6
## 10 Netherlands Europe 512 45295 28.6 5.8 6
## # ℹ 60 more rows
13.4 Selecting rows with the filter()
command
If we want to view a subset of the rows, we can use the filter()
command. For example, if we want the rows in the data frame maths
where start.age
takes the value 5 (i.e. children start school at age 5), we can do
## # A tibble: 6 × 7
## country continent score gdp gini homework start.age
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Australia Oceanea 494 49928 34.7 6 5
## 2 Ireland Europe 504 61606 31.9 7.3 5
## 3 Malta Europe 479 25058 NA NA 5
## 4 New Zealand Oceanea 495 39427 NA 4.2 5
## 5 Trinidad and Tobago North America 417 15377 40.3 NA 5
## 6 United Kingdom Europe 492 39899 34.1 4.9 5
Note the double equals sign ==
. This is used to test whether the left and right hand sides are equal: each country is included if its corresponding start.age
is equal to 5.
The UK is included above, but we’ll give an example of selecting it anyway:
## # A tibble: 1 × 7
## country continent score gdp gini homework start.age
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 United Kingdom Europe 492 39899 34.1 4.9 5
13.5 Viewing and extracting data from a column
For larger data frames (with many columns), we may wish to view a subset only. For example, to select the score
and country
columns only from the maths
data frame, we do
## # A tibble: 70 × 2
## score country
## <dbl> <chr>
## 1 413 Albania
## 2 360 Algeria
## 3 409 Argentina
## 4 494 Australia
## 5 497 Austria
## 6 531 B-S-J-G (China)
## 7 507 Belgium
## 8 377 Brazil
## 9 441 Bulgaria
## 10 516 Canada
## # ℹ 60 more rows
If we want to extract the values from a column, we can use the syntax dataframe-name$column-name
. For example, to extract the column score
from the data frame maths
, we do
## [1] 413 360 409 494 497 531 507 377 441 516 423 390 400 464 437 492 511 328 520
## [20] 511 493 404 506 454 548 477 488 386 504 470 490 532 380 460 524 482 396 478
## [39] 486 544 371 446 479 408 420 418 512 495 502 387 504 492 402 444 494 564 475
## [58] 510 486 494 521 415 417 367 420 427 492 470 418 495
We could then, for example, calculate the average (mean) of all the scores:
## [1] 460.9714
13.6 Creating new columns in a data frame with the mutate()
command
About half the countries have a GDP per capita greater than $17000. If we try the command
## [1] FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [13] FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE
## [25] TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE
## [37] FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
## [49] TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE
## [61] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE
this creates a new vector, in which the \(i\)th element will be TRUE
if the gdp
value for country \(i\) is greater than 17000, and FALSE
otherwise. We will add this vector to the data frame, under the column name wealthiest
. The command to create the new column is
but this doesn’t store the result. To put the new column in the maths
data frame, we do
You may now have too many columns to see in your console, so to check this has worked, we will do
## # A tibble: 70 × 3
## country gdp wealthiest
## <chr> <dbl> <lgl>
## 1 Albania 4147 FALSE
## 2 Algeria 3844 FALSE
## 3 Argentina 12449 FALSE
## 4 Australia 49928 TRUE
## 5 Austria 44177 TRUE
## 6 B-S-J-G (China) 8123 FALSE
## 7 Belgium 41096 TRUE
## 8 Brazil 8650 FALSE
## 9 Bulgaria 7351 FALSE
## 10 Canada 42158 TRUE
## # ℹ 60 more rows
13.7 Computing summaries per group
Suppose we want to know the mean score within particular groups, for example, continents. We can do this by chaining together the group_by()
and summarise()
commands.
## # A tibble: 6 × 2
## continent meanscore
## <chr> <dbl>
## 1 Africa 364.
## 2 Asia 471.
## 3 Europe 476.
## 4 North America 423.
## 5 Oceanea 494.
## 6 South America 401.
We read the command as, “Start with the maths
data frame, organise into groups based on the continent
column, then create a new variable called meanscore
, which is the mean of the score
variable within each group.”
We can group by multiple variables:
## `summarise()` has grouped output by 'continent'. You can override
## using the `.groups` argument.
## # A tibble: 9 × 3
## # Groups: continent [6]
## continent wealthiest meanscore
## <chr> <lgl> <dbl>
## 1 Africa FALSE 364.
## 2 Asia FALSE 436.
## 3 Asia TRUE 501.
## 4 Europe FALSE 448.
## 5 Europe TRUE 495.
## 6 North America FALSE 388.
## 7 North America TRUE 493
## 8 Oceanea TRUE 494.
## 9 South America FALSE 401.
and as the output of the command is another data frame, we can usepivot_wider()
to convert the data frame into ‘wide’ format, making the results a little easier to read:
maths %>%
group_by(continent, wealthiest) %>%
summarise(meanscore = mean(score)) %>%
pivot_wider(names_from = wealthiest, values_from = meanscore)
## `summarise()` has grouped output by 'continent'. You can override
## using the `.groups` argument.
## # A tibble: 6 × 3
## # Groups: continent [6]
## continent `FALSE` `TRUE`
## <chr> <dbl> <dbl>
## 1 Africa 364. NA
## 2 Asia 436. 501.
## 3 Europe 448. 495.
## 4 North America 388. 493
## 5 Oceanea NA 494.
## 6 South America 401. NA
13.8 Counting observations within groups
Suppose we want to know how many observations we have in each continent. We can do
## # A tibble: 6 × 2
## continent n
## <chr> <int>
## 1 Africa 2
## 2 Asia 15
## 3 Europe 39
## 4 North America 6
## 5 Oceanea 2
## 6 South America 6
This can be extended to groups defined by combinations of variables, e.g
## # A tibble: 10 × 3
## continent start.age n
## <chr> <dbl> <int>
## 1 Africa 6 2
## 2 Asia 6 14
## 3 Asia 7 1
## 4 Europe 5 3
## 5 Europe 6 23
## 6 Europe 7 13
## 7 North America 5 1
## 8 North America 6 5
## 9 Oceanea 5 2
## 10 South America 6 6
Note that this doesn’t explicitly report combinations of start.age
and continent
that have 0 observations. Again, pivot_wider()
can be used to convert the resulting data frame into ‘wide’ format, in effect giving us a contingency table:
## # A tibble: 6 × 4
## continent `6` `7` `5`
## <chr> <int> <int> <int>
## 1 Africa 2 NA NA
## 2 Asia 14 1 NA
## 3 Europe 23 13 3
## 4 North America 5 NA 1
## 5 Oceanea NA NA 2
## 6 South America 6 NA NA
(If we were going to model the data, and wanted to include some sort of interaction effect between continent
and start.age
, the above table would reveal a problem: that we don’t have sufficient data.)
Exercise 13.1 For this exercise, you will need the combined data from Q1 in Exercise ??.
- Find the 10 areas (the
Area
column) with the highest remain percentage votes. - Find the mean percentage of remain votes within each
Region
. - Create a new column called
majorityRemain
, which indicates (usingTRUE
orFALSE
) whether or not the percentage of remain votes in eachArea
was above 50%. Then, find the mean percentage of adults with level 4 qualifications (or higher), separately for areas with a majority in favour of remaining, and with a majority in favour of leaving.
13.9 Further reading
From RStudio, if you are online, you can access various “Cheatsheets”, including one on
dplyr
: go to Help > Cheatsheets > Data Transformation with dplyr.See also Chapter 5 of R for Data Science.
References
Data sources: test scores and homework data from [http://pisadataexplorer.oecd.org/ide/idepisa/], accessed 16/11/2017. GDP, Gini coefficient and school start age data from [https://data.worldbank.org/indicator], accessed 16/11/2027.↩︎