Section 12 Importing, combining, and reshaping data

In this section we look at how to

import data such as Excel spreadsheets and .csv files;
combine different data sets into a single data frame;
reshape data frames between ‘long’ and ‘wide’ formats (this is useful for plots with ggplot2)

When working with someone else’s data, the data almost certainly won’t be in the format that you need; you’ll need to do some editing/re-arranging of the data. If possible, do not make any edits to the original data file: do any editing you need inside R. That way, you will have a record of how you got from the original data file to the data you actually worked with in your analysis.

As an example, we’ll import the two files RIOolympics.csv and population.xlsx, and make a single data frame with medal results and population size for each country.

We suppose that in your current working directory, you have a folder called data, and the two files we want to import are in that folder. First load the tidyverse library

library(tidyverse)

12.1 Importing a .csv file

Use the command read_csv(), with the path to your target file in quotes (if you can, use the path relative to your project directory). This command is from the readr package (Wickham, Hester, and Francois 2018). We’ll import the RIOolympics.csv file and store it as a data frame called medals.

medals <- read_csv("data/RIOolympics.csv")

## Rows: 87 Columns: 6
## ── Column specification ───────────────────────────────────────────────
## Delimiter: ","
## chr (2): code, country
## dbl (4): gold, silver, bronze, total
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

To view the first few rows:

head(medals)

## # A tibble: 6 × 6
##   code  country             gold silver bronze total
##   <chr> <chr>              <dbl>  <dbl>  <dbl> <dbl>
## 1 USA   UNITED STATES         46     37     38   121
## 2 GBR   GREAT BRITAIN         27     23     17    67
## 3 CHN   CHINA                 26     18     26    70
## 4 RUS   RUSSIAN FEDERATION    19     18     19    56
## 5 DEU   GERMANY               17     10     15    42
## 6 JPN   JAPAN                 12      8     21    41

12.1.1 Additional arguments

Use ?read_csv to see the full list of arguments. Two useful ones are

skip: use this to skip lines at the start of the .csv file, for example if there’s some text that you need to ignore;
col_names: you will normally use the default value TRUE, but you can specify a character vector of column names if you want/need to.

12.2 Importing an Excel .xlsx file

We can import Excel files with the command readxl::read_excel(). The readxl package (Wickham and Bryan 2019) is installed as part of the tidyverse, but not loaded by the command library(tidyverse).

We’ll import the population.xlsx file and store it as a data frame called population:

population <- readxl::read_excel("data/population.xlsx")
head(population)

## # A tibble: 6 × 3
##   `Country Name` `Country Code`    `2015`
##   <chr>          <chr>              <dbl>
## 1 Aruba          ABW               103889
## 2 Andorra        AND                70473
## 3 Afghanistan    AFG             32526562
## 4 Angola         AGO             25021974
## 5 Albania        ALB              2889167
## 6 Arab World     ARB            392022276

12.2.1 Additional arguments

Again, there are lots of extra arguments, but two particularly useful ones are

sheet, for specifying which sheet to import, if the Excel file contains multiple sheets;
range, for specifying which cells to import, e.g. "B3:G10", if we want to ignore the first column and first two rows.

12.3 Importing a plain text file

We can import plain text files (typically files with a .txt extension) with the command read_table(). For example, in the MAS61004Data.zip file, there is a file flintsdata.txt. Although the file is a plain text file, the data appear formatted as a two-column table, with headers breadth and length. We import the data (and create a new data frame called flints) as follows.

flints <- read_table("data/flintsdata.txt")

## 
## ── Column specification ───────────────────────────────────────────────
## cols(
##   breadth = col_double(),
##   length = col_double()
## )

head(flints)

## # A tibble: 6 × 2
##   breadth length
##     <dbl>  <dbl>
## 1    1.97   4.37
## 2    2.44   5.43
## 3    2.22   5   
## 4    2.1    4.38
## 5    2.43   5.45
## 6    2.5    5.8

As with the other functions for importing data, there are useful arguments for skipping lines, defining column names and so on. See ?read_table for details.

12.4 Importing online data

If you are online, R can read in data directly from the web: you just give the full web address as the file name, e.g.

maths <- read_csv("https://oakleyj.github.io/exampledata/maths.csv")

This can be convenient, though it may slow down your code if the file is large, and there is a risk that your code won’t run in the future: you can’t be sure that the file will always be there!

Nevertheless, downloading a data file via R (or another language) is preferable to clicking buttons/links on a website, as you will have a record of where you got the data from. We recommend that you download your data with R, but then save a copy on your computer, and that your analysis uses your copied version.

We can export data from R to a external file, e.g. csv format with the write_csv() command, e.g.

write_csv(maths, path = "data/myCopyOfMaths.csv")

We can also use the download.file() command, which downloads and saves the file, but doesn’t keep the data within R:

download.file(url = "https://oakleyj.github.io/exampledata/maths.csv",
              destfile = "data/myCopyOfMaths.csv")

12.5 Working with column names

All three column names in population are awkward to work with! Two of them have spaces in, and third is a number. If we wanted to access individual columns, none of the following commands would work!

population$Country Name
population$Country Code
population$2015

Use backticks around any awkward column names:

population$`Country Name`
population$`Country Code`
population$`2015`

12.6 Combining data frames

When we combine two data frames, we will either be

adding variables to a data set; we add columns to a data frame;
adding observations of the same variables to a data set; we add rows to a data frame.

12.6.1 Adding columns

Suppose we want to add the population sizes to the data in the medal table: we want to combine the olympics and population data frames. To do this, we need one column in each data frame where the same values are used to identify the countries.

We can use the columns

population$`Country Code`
medals$code

(both columns use “ISO 3166-1 alpha-3” for identifying countries).

We will use the function inner_join() which will look for countries present in both data frames (there are variations: see ?inner_join for details), storing the result in a new data frame called olympics:

olympics <- inner_join(medals, population,
                       by = c("code" = "Country Code"))

head(olympics)

## # A tibble: 6 × 8
##   code  country             gold silver bronze total `Country Name`       `2015`
##   <chr> <chr>              <dbl>  <dbl>  <dbl> <dbl> <chr>                 <dbl>
## 1 USA   UNITED STATES         46     37     38   121 United States        3.21e8
## 2 GBR   GREAT BRITAIN         27     23     17    67 United Kingdom       6.51e7
## 3 CHN   CHINA                 26     18     26    70 China                1.37e9
## 4 RUS   RUSSIAN FEDERATION    19     18     19    56 Russian Federation   1.44e8
## 5 DEU   GERMANY               17     10     15    42 Germany              8.14e7
## 6 JPN   JAPAN                 12      8     21    41 Japan                1.27e8

We used the argument by = c(“code” = “Country Code”) to specify which columns in the two data frames to use for matching the observations. If both columns already had the same name, e.g. code, we could just use the argument by = “code”.

12.6.2 Adding rows

The rbind() function can be used to combine data frames with the same column headings. Here’s a simple example:

df1 <- data.frame(x = c(10, 11), y = c(100, 101))
df2 <- data.frame(x = c(12, 13), y = c(102, 103))
rbind(df1, df2)

##    x   y
## 1 10 100
## 2 11 101
## 3 12 102
## 4 13 103

12.6.3 Data frames from lists

Continuing the example above, if we have a list of data frames, again with the same column headings, these can be assembled into a data frame as follows.

myList <- list(df1, df1)
do.call(rbind.data.frame, myList)

##    x   y
## 1 10 100
## 2 11 101
## 3 10 100
## 4 11 101

12.7 Renaming columns

The column name 2015 isn’t very descriptive, so we may wish to change the name, for example to population2015. We can do this with the rename() command.

olympics <- rename(olympics, population2015 = `2015`)
head(olympics)

## # A tibble: 6 × 8
##   code  country           gold silver bronze total `Country Name` population2015
##   <chr> <chr>            <dbl>  <dbl>  <dbl> <dbl> <chr>                   <dbl>
## 1 USA   UNITED STATES       46     37     38   121 United States       321418820
## 2 GBR   GREAT BRITAIN       27     23     17    67 United Kingdom       65138232
## 3 CHN   CHINA               26     18     26    70 China              1371220000
## 4 RUS   RUSSIAN FEDERAT…    19     18     19    56 Russian Feder…      144096812
## 5 DEU   GERMANY             17     10     15    42 Germany              81413145
## 6 JPN   JAPAN               12      8     21    41 Japan               126958472

(Note that in assigning the result to a data frame called olympics, we’ve overwritten our earlier version of the olympics data frame. In general, be careful when you do this: the command will work the first time we use it, but not the second!)

12.8 Importing data from other statistical packages

You may come across data sets stored in a format specific to another statistical package (SAS, SPSS, Stata). The haven package (Wickham and Miller 2020) can be used to import these. See https://haven.tidyverse.org/ for more details.

12.9 Reshaping data frames

Sometimes a data set may be in the wrong ‘shape’ for the analysis or plot that we want to do. As an example, we’ll use an extended data set on populations, with observations in multiple years:

countries <- read_csv("data/populationMultiple.csv")

## Rows: 264 Columns: 4
## ── Column specification ───────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country Name
## dbl (3): 2013, 2014, 2015
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(countries)

## # A tibble: 6 × 4
##   `Country Name`    `2013`    `2014`    `2015`
##   <chr>              <dbl>     <dbl>     <dbl>
## 1 Aruba             102921    103441    103889
## 2 Andorra            75902     72786     70473
## 3 Afghanistan     30682500  31627506  32526562
## 4 Angola          23448202  24227524  25021974
## 5 Albania          2896652   2893654   2889167
## 6 Arab World     376504253 384222592 392022276

For our analysis, we may need all population values in a single column, with another column indicating the year. We can use the function pivot_longer() from the tidyr package (Wickham and Henry 2020) to change the data frame from ‘wide’ to ‘long’ format:

countriesLong <- pivot_longer(countries,
                              cols = -`Country Name`,
                              names_to = "year",
                              values_to = "population")
head(countriesLong)

## # A tibble: 6 × 3
##   `Country Name` year  population
##   <chr>          <chr>      <dbl>
## 1 Aruba          2013      102921
## 2 Aruba          2014      103441
## 3 Aruba          2015      103889
## 4 Andorra        2013       75902
## 5 Andorra        2014       72786
## 6 Andorra        2015       70473

If we think of population as our ‘dependent variable’, the cols argument specifies which columns are used to create this dependent variable. Here, putting the - sign before Country Name means “use all columns except this one”.
If every column represented a ‘dependent variable’, we could use the argument cols = everything().
names_to and values_to refer to the column names we want to use in the new data frame, not column names in the current data frame we are reshaping.

It’s worth looking at the help file for pivot_longer() to see more examples.

We can go back from ‘long’ to ‘wide’ with the function pivot_wider()

countriesWide <- pivot_wider(countriesLong, 
                             names_from = year,
                             values_from = population)
head(countriesWide)

## # A tibble: 6 × 4
##   `Country Name`    `2013`    `2014`    `2015`
##   <chr>              <dbl>     <dbl>     <dbl>
## 1 Aruba             102921    103441    103889
## 2 Andorra            75902     72786     70473
## 3 Afghanistan     30682500  31627506  32526562
## 4 Angola          23448202  24227524  25021974
## 5 Albania          2896652   2893654   2889167
## 6 Arab World     376504253 384222592 392022276

12.10 Exercises

Exercise 12.1 Merging and reshaping data.

The files BrexitVotes.csv and education.csv contain, respectively, data from the 2016 UK referendum on leaving the European Union, and data on proportions of adults with educational qualifications at ‘level 4’ and above (qualifications above A-level). Data are given for local authorities within the UK.
- Import these two files into R, and combine them into a single data frame called Brexit, matching the observations by area codes.
The file cancer.xlsx contains survival times for patients with cancers in different organs. Each number in each column corresponds to a different patient; the row number has no meaning here. Note that all columns are ‘dependent variables’ here.
- Import the data into R.
- Reshape the data into ‘long format’: one column indicating the organ, and one column indicating the survival time.

For any online data set you work with, make an R script/R Markdown document with a name such as makeData.R that imports the data file and does all the cleaning/editing. In this file, include the website details and url, and the date you accessed the data: you’ll need these for referencing.

12.11 Data sources

Brexit results obtained from the Electoral Commission (https://www.electoralcommission.org.uk/find-information-by-subject/elections-and-referendums/past-elections-and-referendums/eu-referendum), and Level 4 qualification data obtained from the Office for National Statistics (https://www.nomisweb.co.uk/). Accessed 12th December 2017.
I downloaded the cancer data a long time ago (2009?) from the “Data and Story Library” at http://lib.stat.cmu.edu/DASL/DataArchive.html. This link no longer works; I think the Data and Story Library has migrated to https://dasl.datadescription.com/, but I couldn’t find this data set there.
Population data obtained from The World Bank. Accessed 6th October 2015.
Medal table obtained from https://www.rio2016.com/en/medal-count-country [Accessed on 6th October 2016, but this link is no longer active.]

12.12 Further reading

The concepts of “long” and “wide” data formats are related to the idea of tidy data, which is an important theme in the Tidyverse. For more discussion, see this vignette, and Chapter 12 of R for Data Science.

References

Wickham, Hadley, and Jennifer Bryan. 2019. Readxl: Read Excel Files. https://CRAN.R-project.org/package=readxl.

Wickham, Hadley, and Lionel Henry. 2020. Tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr.

Wickham, Hadley, Jim Hester, and Romain Francois. 2018. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.

Wickham, Hadley, and Evan Miller. 2020. Haven: Import and Export ’SPSS’, ’Stata’ and ’SAS’ Files. https://CRAN.R-project.org/package=haven.