Section 12 Importing, combining, and reshaping data
In this section we look at how to
- import data such as Excel spreadsheets and .csv files;
- combine different data sets into a single data frame;
- reshape data frames between ‘long’ and ‘wide’ formats (this is useful for plots with
ggplot2
)
When working with someone else’s data, the data almost certainly won’t be in the format that you need; you’ll need to do some editing/re-arranging of the data. If possible, do not make any edits to the original data file: do any editing you need inside R. That way, you will have a record of how you got from the original data file to the data you actually worked with in your analysis.
As an example, we’ll import the two files RIOolympics.csv
and population.xlsx
, and make a single data frame with medal results and population size for each country.
We suppose that in your current working directory, you have a folder called data
, and the two files we want to import are in that folder. First load the tidyverse
library
12.1 Importing a .csv file
Use the command read_csv()
, with the path to your target file in quotes (if you can, use the path relative to your project directory). This command is from the readr
package (Wickham, Hester, and Francois 2018). We’ll import the RIOolympics.csv
file and store it as a data frame called medals
.
## Rows: 87 Columns: 6
## ── Column specification ───────────────────────────────────────────────
## Delimiter: ","
## chr (2): code, country
## dbl (4): gold, silver, bronze, total
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
To view the first few rows:
## # A tibble: 6 × 6
## code country gold silver bronze total
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 USA UNITED STATES 46 37 38 121
## 2 GBR GREAT BRITAIN 27 23 17 67
## 3 CHN CHINA 26 18 26 70
## 4 RUS RUSSIAN FEDERATION 19 18 19 56
## 5 DEU GERMANY 17 10 15 42
## 6 JPN JAPAN 12 8 21 41
12.1.1 Additional arguments
Use ?read_csv
to see the full list of arguments. Two useful ones are
skip
: use this to skip lines at the start of the.csv
file, for example if there’s some text that you need to ignore;col_names
: you will normally use the default valueTRUE
, but you can specify a character vector of column names if you want/need to.
12.2 Importing an Excel .xlsx file
We can import Excel files with the command readxl::read_excel()
. The readxl
package (Wickham and Bryan 2019) is installed as part of the tidyverse, but not loaded by the command library(tidyverse)
.
We’ll import the population.xlsx
file and store it as a data frame called population
:
## # A tibble: 6 × 3
## `Country Name` `Country Code` `2015`
## <chr> <chr> <dbl>
## 1 Aruba ABW 103889
## 2 Andorra AND 70473
## 3 Afghanistan AFG 32526562
## 4 Angola AGO 25021974
## 5 Albania ALB 2889167
## 6 Arab World ARB 392022276
12.2.1 Additional arguments
Again, there are lots of extra arguments, but two particularly useful ones are
sheet
, for specifying which sheet to import, if the Excel file contains multiple sheets;range
, for specifying which cells to import, e.g."B3:G10"
, if we want to ignore the first column and first two rows.
12.3 Importing a plain text file
We can import plain text files (typically files with a .txt
extension) with the command read_table()
. For example, in the MAS61004Data.zip file, there is a file flintsdata.txt
. Although the file is a plain text file, the data appear formatted as a two-column table, with headers breadth
and length
. We import the data (and create a new data frame called flints
) as follows.
##
## ── Column specification ───────────────────────────────────────────────
## cols(
## breadth = col_double(),
## length = col_double()
## )
## # A tibble: 6 × 2
## breadth length
## <dbl> <dbl>
## 1 1.97 4.37
## 2 2.44 5.43
## 3 2.22 5
## 4 2.1 4.38
## 5 2.43 5.45
## 6 2.5 5.8
As with the other functions for importing data, there are useful arguments for skipping lines, defining column names and so on. See ?read_table
for details.
12.4 Importing online data
If you are online, R can read in data directly from the web: you just give the full web address as the file name, e.g.
This can be convenient, though it may slow down your code if the file is large, and there is a risk that your code won’t run in the future: you can’t be sure that the file will always be there!
Nevertheless, downloading a data file via R (or another language) is preferable to clicking buttons/links on a website, as you will have a record of where you got the data from. We recommend that you download your data with R, but then save a copy on your computer, and that your analysis uses your copied version.
We can export data from R to a external file, e.g. csv format with the write_csv()
command, e.g.
We can also use the download.file()
command, which downloads and saves the file, but doesn’t keep the data within R:
12.5 Working with column names
All three column names in population
are awkward to work with! Two of them have spaces in, and third is a number. If we wanted to access individual columns, none of the following commands would work!
Use backticks around any awkward column names:
12.6 Combining data frames
When we combine two data frames, we will either be
- adding variables to a data set; we add columns to a data frame;
- adding observations of the same variables to a data set; we add rows to a data frame.
12.6.1 Adding columns
Suppose we want to add the population sizes to the data in the medal table: we want to combine the olympics
and population
data frames. To do this, we need one column in each data frame where the same values are used to identify the countries.
We can use the columns
(both columns use “ISO 3166-1 alpha-3” for identifying countries).
We will use the function inner_join()
which will look for countries present in both data frames (there are variations: see ?inner_join
for details), storing the result in a new data frame called olympics
:
## # A tibble: 6 × 8
## code country gold silver bronze total `Country Name` `2015`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 USA UNITED STATES 46 37 38 121 United States 3.21e8
## 2 GBR GREAT BRITAIN 27 23 17 67 United Kingdom 6.51e7
## 3 CHN CHINA 26 18 26 70 China 1.37e9
## 4 RUS RUSSIAN FEDERATION 19 18 19 56 Russian Federation 1.44e8
## 5 DEU GERMANY 17 10 15 42 Germany 8.14e7
## 6 JPN JAPAN 12 8 21 41 Japan 1.27e8
We used the argument by = c(“code” = “Country Code”)
to
specify which columns in the two data frames to use for matching the
observations. If both columns already had the same name,
e.g. code
, we could just use the argument
by = “code”
.
12.6.2 Adding rows
The rbind()
function can be used to combine data frames with the same column headings. Here’s a simple example:
df1 <- data.frame(x = c(10, 11), y = c(100, 101))
df2 <- data.frame(x = c(12, 13), y = c(102, 103))
rbind(df1, df2)
## x y
## 1 10 100
## 2 11 101
## 3 12 102
## 4 13 103
12.7 Renaming columns
The column name 2015 isn’t very descriptive, so we may wish to change the name, for example to population2015
. We can do this with the rename()
command.
## # A tibble: 6 × 8
## code country gold silver bronze total `Country Name` population2015
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 USA UNITED STATES 46 37 38 121 United States 321418820
## 2 GBR GREAT BRITAIN 27 23 17 67 United Kingdom 65138232
## 3 CHN CHINA 26 18 26 70 China 1371220000
## 4 RUS RUSSIAN FEDERAT… 19 18 19 56 Russian Feder… 144096812
## 5 DEU GERMANY 17 10 15 42 Germany 81413145
## 6 JPN JAPAN 12 8 21 41 Japan 126958472
(Note that in assigning the result to a data frame called olympics
, we’ve overwritten our earlier version of the olympics
data frame. In general, be careful when you do this: the command will work the first time we use it, but not the second!)
12.8 Importing data from other statistical packages
You may come across data sets stored in a format specific to another statistical package (SAS, SPSS, Stata). The haven
package (Wickham and Miller 2020) can be used to import these. See https://haven.tidyverse.org/ for more details.
12.9 Reshaping data frames
Sometimes a data set may be in the wrong ‘shape’ for the analysis or plot that we want to do. As an example, we’ll use an extended data set on populations, with observations in multiple years:
## Rows: 264 Columns: 4
## ── Column specification ───────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country Name
## dbl (3): 2013, 2014, 2015
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 6 × 4
## `Country Name` `2013` `2014` `2015`
## <chr> <dbl> <dbl> <dbl>
## 1 Aruba 102921 103441 103889
## 2 Andorra 75902 72786 70473
## 3 Afghanistan 30682500 31627506 32526562
## 4 Angola 23448202 24227524 25021974
## 5 Albania 2896652 2893654 2889167
## 6 Arab World 376504253 384222592 392022276
For our analysis, we may need all population values in a single column, with another column indicating the year. We can use the function pivot_longer()
from the tidyr
package (Wickham and Henry 2020) to change the data frame from ‘wide’ to ‘long’ format:
countriesLong <- pivot_longer(countries,
cols = -`Country Name`,
names_to = "year",
values_to = "population")
head(countriesLong)
## # A tibble: 6 × 3
## `Country Name` year population
## <chr> <chr> <dbl>
## 1 Aruba 2013 102921
## 2 Aruba 2014 103441
## 3 Aruba 2015 103889
## 4 Andorra 2013 75902
## 5 Andorra 2014 72786
## 6 Andorra 2015 70473
- If we think of
population
as our ‘dependent variable’, thecols
argument specifies which columns are used to create this dependent variable. Here, putting the-
sign beforeCountry Name
means “use all columns except this one”. - If every column represented a ‘dependent variable’, we could use the argument
cols = everything()
. names_to
andvalues_to
refer to the column names we want to use in the new data frame, not column names in the current data frame we are reshaping.
It’s worth looking at the help file for pivot_longer()
to see more examples.
We can go back from ‘long’ to ‘wide’ with the function pivot_wider()
countriesWide <- pivot_wider(countriesLong,
names_from = year,
values_from = population)
head(countriesWide)
## # A tibble: 6 × 4
## `Country Name` `2013` `2014` `2015`
## <chr> <dbl> <dbl> <dbl>
## 1 Aruba 102921 103441 103889
## 2 Andorra 75902 72786 70473
## 3 Afghanistan 30682500 31627506 32526562
## 4 Angola 23448202 24227524 25021974
## 5 Albania 2896652 2893654 2889167
## 6 Arab World 376504253 384222592 392022276
12.10 Exercises
Exercise 12.1 Merging and reshaping data.
- The files
BrexitVotes.csv
andeducation.csv
contain, respectively, data from the 2016 UK referendum on leaving the European Union, and data on proportions of adults with educational qualifications at ‘level 4’ and above (qualifications above A-level). Data are given for local authorities within the UK.- Import these two files into R, and combine them into a single data frame called
Brexit
, matching the observations by area codes.
- Import these two files into R, and combine them into a single data frame called
- The file
cancer.xlsx
contains survival times for patients with cancers in different organs. Each number in each column corresponds to a different patient; the row number has no meaning here. Note that all columns are ‘dependent variables’ here.- Import the data into R.
- Reshape the data into ‘long format’: one column indicating the organ, and one column indicating the survival time.
For any online data set you work with, make an R script/R Markdown
document with a name such as makeData.R
that imports the
data file and does all the cleaning/editing. In this file, include the
website details and url, and the date you accessed the data: you’ll need
these for referencing.
12.11 Data sources
Brexit results obtained from the Electoral Commission (https://www.electoralcommission.org.uk/find-information-by-subject/elections-and-referendums/past-elections-and-referendums/eu-referendum), and Level 4 qualification data obtained from the Office for National Statistics (https://www.nomisweb.co.uk/). Accessed 12th December 2017.
I downloaded the cancer data a long time ago (2009?) from the “Data and Story Library” at http://lib.stat.cmu.edu/DASL/DataArchive.html. This link no longer works; I think the Data and Story Library has migrated to https://dasl.datadescription.com/, but I couldn’t find this data set there.
Population data obtained from The World Bank. Accessed 6th October 2015.
Medal table obtained from https://www.rio2016.com/en/medal-count-country [Accessed on 6th October 2016, but this link is no longer active.]
12.12 Further reading
The concepts of “long” and “wide” data formats are related to the idea of tidy data, which is an important theme in the Tidyverse. For more discussion, see this vignette, and Chapter 12 of R for Data Science.