Section 13 Data handling with dplyr

Once we have our data in a single data frame, we might wish to obtain some simple summary statistics, perhaps within subgroups of the data, and/or understand the structure of the data better (e.g. counting how many observations we have for particular combinations of factors.) This might involve creating new columns, with values dependent on the other columns.

The package dplyr (Wickham, François, et al. 2020) has various functions for working with data frames, and we will illustrate some here.

(Artwork by @allison_horst)

13.1 Chaining commands together with the pipe operator %>%

We’ll first introduce the ‘pipe operator’ %>%. (It’s a matter of personal preference whether you use it in your code or not, but you may find it makes your code easier to read.)

The pipe operator %>% takes whatever the output is from the left hand side, and uses it as the first argument in the function on the next line. In general,

myfunction(x, y)

can be written using the pipe operator as

x %>%
  myfunction(y)

so, for example, from the previous chapter

inner_join(medals, 
           population,
           by = c("code" = "Country Code"))

would be written as

medals %>%
  inner_join(population, 
             by = c("code" = "Country Code"))

Code with pipes can be easier to read, particularly when multiple functions are chained together. For example, this code

x0 %>%
  func1(x1) %>%
  func2(x2) %>%
  func3(x3) %>%
  func4(x4)

would be read as

  • start with x0,
  • then apply function func1() to it with additional argument x1,
  • then apply the function func2() to the result, using additional argument x2,
  • and so on

Alternatives without pipes would be

func4(func3(func2(func1(x0, x1), x2), x3), x4)

which would be harder to read, or

y1 <- func1(x0, x1)
y2 <- func2(y1, x2)
y3 <- func3(y2, x3)
func4(y3, x4)

which may be undesirable if we don’t want to store all the intermediate variables y1, y2, y3.

13.2 Example

We’ll be using the data maths.csv, which is included in the data sets zip file:

maths <- read_csv("data/maths.csv")

This is a data set of maths scores from the PISA 2015 maths tests, with data obtained from OECD and the World Bank1. In addition to country and continent, there are the following columns.

  1. score: the mean mathematics score in the 2015 PISA test;
  2. gdp: the gross domestic product per capita (GDP divided by the estimated population size), measured in US\(\$\);
  3. gini: the Gini coefficient (as a percentage). This is an estimate of income inequality, with larger values indicating more income inequality;
  4. homework: an estimate of the average number of hours per week spent on homework by 15 year-olds, from a survey in 2012;
  5. start.age: the age (in years) in which children start school.

13.3 Ordering the rows by a variable with the arrange() command

Suppose we want to see which countries got the highest score: we want to arrange the rows in the data frame maths in order according to the values in the column score. To do this we use the command

maths %>%
  arrange(score)
## # A tibble: 70 × 7
##    country            continent     score   gdp  gini homework start.age
##    <chr>              <chr>         <dbl> <dbl> <dbl>    <dbl>     <dbl>
##  1 Dominican Republic North America   328  6722  44.9     NA           6
##  2 Algeria            Africa          360  3844  27.6     NA           6
##  3 Tunisia            Africa          367  3689  35.8      3.5         6
##  4 Macedonia, FYR     Europe          371  5237  35.6     NA           6
##  5 Brazil             South America   377  8650  51.3      3.3         6
##  6 Jordan             Asia            380  4088  33.7      4.2         6
##  7 Indonesia          Asia            386  3570  39.5      4.9         7
##  8 Peru               South America   387  6046  44.3      5.5         6
##  9 Colombia           South America   390  5806  51.1      5.3         6
## 10 Lebanon            Asia            396  7914  31.8      3.3         6
## # … with 60 more rows

This has arranged the rows in ascending order of score. To see them in descending order, we include the desc() command:

maths %>%
  arrange(desc(score))
## # A tibble: 70 × 7
##    country              continent     score   gdp  gini homework start.age
##    <chr>                <chr>         <dbl> <dbl> <dbl>    <dbl>     <dbl>
##  1 Singapore            Asia            564 52961  NA        9.4         6
##  2 Hong Kong SAR, China Asia            548 43681  NA        6           6
##  3 Macao SAR, China     Asia            544 73187  NA        5.9         6
##  4 Japan                Asia            532 38894  32.1      3.8         6
##  5 B-S-J-G (China)      Asia            531  8123  42.2     13.8         6
##  6 Korea, Rep.          Asia            524 27539  31.6      2.9         6
##  7 Switzerland          Europe          521 78813  32.5      4           7
##  8 Estonia              Europe          520 17575  34.6      6.9         7
##  9 Canada               North America   516 42158  34        5.5         6
## 10 Netherlands          Europe          512 45295  28.6      5.8         6
## # … with 60 more rows

13.4 Selecting rows with the filter() command

If we want to view a subset of the rows, we can use the filter() command. For example, if we want the rows in the data frame maths where start.age takes the value 5 (i.e. children start school at age 5), we can do

maths %>%
  filter(start.age == 5)
## # A tibble: 6 × 7
##   country             continent     score   gdp  gini homework start.age
##   <chr>               <chr>         <dbl> <dbl> <dbl>    <dbl>     <dbl>
## 1 Australia           Oceanea         494 49928  34.7      6           5
## 2 Ireland             Europe          504 61606  31.9      7.3         5
## 3 Malta               Europe          479 25058  NA       NA           5
## 4 New Zealand         Oceanea         495 39427  NA        4.2         5
## 5 Trinidad and Tobago North America   417 15377  40.3     NA           5
## 6 United Kingdom      Europe          492 39899  34.1      4.9         5

Note the double equals sign ==. This is used to test whether the left and right hand sides are equal: each country is included if its corresponding start.age is equal to 5. The UK is included above, but we’ll give an example of selecting it anyway:

maths %>%
  filter(country == "United Kingdom")
## # A tibble: 1 × 7
##   country        continent score   gdp  gini homework start.age
##   <chr>          <chr>     <dbl> <dbl> <dbl>    <dbl>     <dbl>
## 1 United Kingdom Europe      492 39899  34.1      4.9         5

13.5 Viewing and extracting data from a column

For larger data frames (with many columns), we may wish to view a subset only. For example, to select the score and country columns only from the maths data frame, we do

maths %>%
  select(score, country)
## # A tibble: 70 × 2
##    score country        
##    <dbl> <chr>          
##  1   413 Albania        
##  2   360 Algeria        
##  3   409 Argentina      
##  4   494 Australia      
##  5   497 Austria        
##  6   531 B-S-J-G (China)
##  7   507 Belgium        
##  8   377 Brazil         
##  9   441 Bulgaria       
## 10   516 Canada         
## # … with 60 more rows

If we want to extract the values from a column, we can use the syntax dataframe-name$column-name. For example, to extract the column score from the data frame maths, we do

maths$score
##  [1] 413 360 409 494 497 531 507 377 441 516 423 390 400 464 437 492 511 328 520
## [20] 511 493 404 506 454 548 477 488 386 504 470 490 532 380 460 524 482 396 478
## [39] 486 544 371 446 479 408 420 418 512 495 502 387 504 492 402 444 494 564 475
## [58] 510 486 494 521 415 417 367 420 427 492 470 418 495

We could then, for example, calculate the average (mean) of all the scores:

mean(maths$score)
## [1] 460.9714

13.6 Creating new columns in a data frame with the mutate() command

About half the countries have a GDP per capita greater than $17000. If we try the command

maths$gdp > 17000
##  [1] FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
## [13] FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## [25]  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE
## [37] FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE
## [49]  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
## [61]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE

this creates a new vector, in which the \(i\)th element will be TRUE if the gdp value for country \(i\) is greater than 17000, and FALSE otherwise. We will add this vector to the data frame, under the column name wealthiest. The command to create the new column is

mutate(wealthiest = maths$gdp > 17000)

but this doesn’t store the result. To put the new column in the maths data frame, we do

maths <- maths %>%
  mutate(wealthiest = maths$gdp > 17000)

You may now have too many columns to see in your console, so to check this has worked, we will do

maths %>%
  select(country, gdp, wealthiest)
## # A tibble: 70 × 3
##    country           gdp wealthiest
##    <chr>           <dbl> <lgl>     
##  1 Albania          4147 FALSE     
##  2 Algeria          3844 FALSE     
##  3 Argentina       12449 FALSE     
##  4 Australia       49928 TRUE      
##  5 Austria         44177 TRUE      
##  6 B-S-J-G (China)  8123 FALSE     
##  7 Belgium         41096 TRUE      
##  8 Brazil           8650 FALSE     
##  9 Bulgaria         7351 FALSE     
## 10 Canada          42158 TRUE      
## # … with 60 more rows

13.7 Computing summaries per group

Suppose we want to know the mean score within particular groups, for example, continents. We can do this by chaining together the group_by() and summarise() commands.

maths %>%
  group_by(continent) %>%
  summarise(meanscore = mean(score))
## # A tibble: 6 × 2
##   continent     meanscore
##   <chr>             <dbl>
## 1 Africa             364.
## 2 Asia               471.
## 3 Europe             476.
## 4 North America      423.
## 5 Oceanea            494.
## 6 South America      401.

We read the command as, “Start with the maths data frame, organise into groups based on the continent column, then create a new variable called meanscore, which is the mean of the score variable within each group.”

We can group by multiple variables:

maths %>%
  group_by(continent, wealthiest) %>%
  summarise(meanscore = mean(score)) 
## `summarise()` has grouped output by
## 'continent'. You can override using
## the `.groups` argument.
## # A tibble: 9 × 3
## # Groups:   continent [6]
##   continent     wealthiest meanscore
##   <chr>         <lgl>          <dbl>
## 1 Africa        FALSE           364.
## 2 Asia          FALSE           436.
## 3 Asia          TRUE            501.
## 4 Europe        FALSE           448.
## 5 Europe        TRUE            495.
## 6 North America FALSE           388.
## 7 North America TRUE            493 
## 8 Oceanea       TRUE            494.
## 9 South America FALSE           401.

and as the output of the command is another data frame, we can usepivot_wider() to convert the data frame into ‘wide’ format, making the results a little easier to read:

maths %>%
  group_by(continent, wealthiest) %>%
  summarise(meanscore = mean(score)) %>%
  pivot_wider(names_from = wealthiest, values_from = meanscore)
## `summarise()` has grouped output by
## 'continent'. You can override using
## the `.groups` argument.
## # A tibble: 6 × 3
## # Groups:   continent [6]
##   continent     `FALSE` `TRUE`
##   <chr>           <dbl>  <dbl>
## 1 Africa           364.    NA 
## 2 Asia             436.   501.
## 3 Europe           448.   495.
## 4 North America    388.   493 
## 5 Oceanea           NA    494.
## 6 South America    401.    NA

13.8 Counting observations within groups

Suppose we want to know how many observations we have in each continent. We can do

maths %>%
  count(continent)
## # A tibble: 6 × 2
##   continent         n
##   <chr>         <int>
## 1 Africa            2
## 2 Asia             15
## 3 Europe           39
## 4 North America     6
## 5 Oceanea           2
## 6 South America     6

This can be extended to groups defined by combinations of variables, e.g

maths %>%
  count(continent, start.age)
## # A tibble: 10 × 3
##    continent     start.age     n
##    <chr>             <dbl> <int>
##  1 Africa                6     2
##  2 Asia                  6    14
##  3 Asia                  7     1
##  4 Europe                5     3
##  5 Europe                6    23
##  6 Europe                7    13
##  7 North America         5     1
##  8 North America         6     5
##  9 Oceanea               5     2
## 10 South America         6     6

Note that this doesn’t explicitly report combinations of start.age and continent that have 0 observations. Again, pivot_wider() can be used to convert the resulting data frame into ‘wide’ format, in effect giving us a contingency table:

maths %>%
  count(continent, start.age) %>%
  pivot_wider(names_from = start.age, values_from = n)
## # A tibble: 6 × 4
##   continent       `6`   `7`   `5`
##   <chr>         <int> <int> <int>
## 1 Africa            2    NA    NA
## 2 Asia             14     1    NA
## 3 Europe           23    13     3
## 4 North America     5    NA     1
## 5 Oceanea          NA    NA     2
## 6 South America     6    NA    NA

(If we were going to model the data, and wanted to include some sort of interaction effect between continent and start.age, the above table would reveal a problem: that we don’t have sufficient data.)

Exercise 13.1 For this exercise, you will need the combined data from Q1 in Exercise ??.

  1. Find the 10 areas (the Area column) with the highest remain percentage votes.
  2. Find the mean percentage of remain votes within each Region.
  3. Create a new column called majorityRemain, which indicates (using TRUE or FALSE) whether or not the percentage of remain votes in each Area was above 50%. Then, find the mean percentage of adults with level 4 qualifications (or higher), separately for areas with a majority in favour of remaining, and with a majority in favour of leaving.

13.9 Further reading

  • From RStudio, if you are online, you can access various “Cheatsheets”, including one on dplyr: go to Help > Cheatsheets > Data Transformation with dplyr.

  • See also Chapter 5 of R for Data Science.

References

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2020. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

  1. Data sources: test scores and homework data from [http://pisadataexplorer.oecd.org/ide/idepisa/], accessed 16/11/2017. GDP, Gini coefficient and school start age data from [https://data.worldbank.org/indicator], accessed 16/11/2027.↩︎