Section 16 APIs
This is very brief introduction to obtaining data directly from “Application Programming Interfaces” (APIs). This is provided for background reference only: you can skip this chapter! Working with an API can be difficult, and you are typically reliant on appropriate documentation provided at the site hosting the API.
If a website provides data via an API, this means that you would make selections for what data you wanted, and then a data set would be built for you to download. If you want R to download the data directly, this causes a problem, as there there may not be a file simply waiting to be downloaded; you can’t provide R with a web address.
We’ll give two examples
The main R package we need is httr
(Wickham 2020) and we will also need jsonlite
(Ooms 2014) for the first example.
16.1 Example: obtaining Covid-19 case data (data in JSON format)
We will give an example of using the API at https://ukhsa-dashboard.data.gov.uk/ for obtaining Covid-19 data. In particular we will download COVID 19 cases by day
From the menu on this site, we navigate to [Access our data; API developer’s guide (https://ukhsa-dashboard.data.gov.uk/access-our-data){target=“_blank”}.
The first step is to specify the web address of the API. We have to do a little digging through the documentation. The first example on this page suggests we can use the following, with the last section modified to give the metric we want.
endpoint <- "https://api.ukhsa-dashboard.data.gov.uk/themes/infectious_disease/sub_themes/respiratory/topics/COVID-19/geography_types/Nation/geographies/England/metrics/COVID-19_cases_casesByDay"
The command to get something from the API is httr::GET()
. Arguments include the web address, and any query
options, if they are required. Working out how to set the query
argument is the tricky part: this is where you need to study the API documentation carefully! The example page suggests we try specifying year
and page_size
arguments. We’ll set the latter to give the first 10 entries for 2024.
The data we want is in response$content
, but it first needs converting into a format we can use:
The data are now in JSON (JavaScript Object Notation) format, which we won’t worry about here, but we can convert this to list format:
We inspect covid
to see what’s there (try str(covid)
); we see that there is a data frame covid$results
. We can extract the data we want as follows.
## [1] "2024-01-01" "2024-01-02" "2024-01-03" "2024-01-04" "2024-01-05"
## [6] "2024-01-06" "2024-01-07" "2024-01-08" "2024-01-09" "2024-01-10"
## [1] 1079 1395 1271 1107 939 790 787 974 845 877
16.2 Example: obtaining an .xlsx file
In this example, we’ll obtain a spreadsheet provided by the Office for Students (OfS): the OfS Register of all English higher education providers. This is provided by an API, but there is no API documentation.
From inspecting the download link from the file, we can see that the link is to an API, rather than to the file directly. We try this link as the API address:
We try the httr::GET()
function with no additional arguments:
If we try
## [1] "application/octet-stream"
then this tells us file type hasn’t been determined, but digging around with str(response)
suggests there is an .xlsx file sitting in there somewhere!
A post on stack overflow suggests exporting an object to a temporary .xlsx
file, and then reading it in again. The temporary file will be deleted once you close down your R session. We do
and then
to set up the temporary file. Then to save it:
To read it back in again (where I already know I want to skip the first two rows)
To check that we’ve got something:
## # A tibble: 5 × 2
## `Provider’s legal name` `Highest level of degree awarding powers held`
## <chr> <chr>
## 1 Lamda Limited Taught
## 2 The University of Surrey Research
## 3 University of York Research
## 4 Aston University Research
## 5 Royal College of Music Research
16.3 Exercise
Exercise 16.1 Try using the API at http://open-notify.org/Open-Notify-API/People-In-Space/ to download a data set on who is currently in space. You can leave the query
argument blank in httr::GET()
, so you just need the API web address.
16.4 Acknowledgements
Data obtained from https://coronavirus.data.gov.uk/. Accessed 2024-09-03. Contains public sector information licensed under the Open Government Licence v3.0.
Thanks to Christian Pascual for providing this useful blog post on APIs, hosted at Dataquest.