Section 16 APIs

This is very brief introduction to obtaining data directly from “Application Programming Interfaces” (APIs). This is provided for background reference only: you can skip this chapter! Working with an API can be difficult, and you are typically reliant on appropriate documentation provided at the site hosting the API.

If a website provides data via an API, this means that you would make selections for what data you wanted, and then a data set would be built for you to download. If you want R to download the data directly, this causes a problem, as there there may not be a file simply waiting to be downloaded; you can’t provide R with a web address.

We’ll give two examples

The main R package we need is httr (Wickham 2020) and we will also need jsonlite (Ooms 2014) for the first example.

16.1 Example: obtaining Covid-19 case data (data in JSON format)

We will give an example of using the API at https://ukhsa-dashboard.data.gov.uk/ for obtaining Covid-19 data. In particular we will download COVID 19 cases by day

From the menu on this site, we navigate to [Access our data; API developer’s guide (https://ukhsa-dashboard.data.gov.uk/access-our-data){target=“_blank”}.

The first step is to specify the web address of the API. We have to do a little digging through the documentation. The first example on this page suggests we can use the following, with the last section modified to give the metric we want.

endpoint <- "https://api.ukhsa-dashboard.data.gov.uk/themes/infectious_disease/sub_themes/respiratory/topics/COVID-19/geography_types/Nation/geographies/England/metrics/COVID-19_cases_casesByDay"

The command to get something from the API is httr::GET(). Arguments include the web address, and any query options, if they are required. Working out how to set the query argument is the tricky part: this is where you need to study the API documentation carefully! The example page suggests we try specifying year and page_size arguments. We’ll set the latter to give the first 10 entries for 2024.

response <- httr::GET(
  url = endpoint,
  query = list(year = 2024,
               page_size = 10)
)

The data we want is in response$content, but it first needs converting into a format we can use:

contentText <- rawToChar(response$content)

The data are now in JSON (JavaScript Object Notation) format, which we won’t worry about here, but we can convert this to list format:

covid <- jsonlite::fromJSON(contentText)

We inspect covid to see what’s there (try str(covid)); we see that there is a data frame covid$results. We can extract the data we want as follows.

covid$results$date

##  [1] "2024-01-01" "2024-01-02" "2024-01-03" "2024-01-04" "2024-01-05"
##  [6] "2024-01-06" "2024-01-07" "2024-01-08" "2024-01-09" "2024-01-10"

covid$results$metric_value

##  [1] 1079 1395 1271 1107  939  790  787  974  845  877

16.2 Example: obtaining an .xlsx file

In this example, we’ll obtain a spreadsheet provided by the Office for Students (OfS): the OfS Register of all English higher education providers. This is provided by an API, but there is no API documentation.

From inspecting the download link from the file, we can see that the link is to an API, rather than to the file directly. We try this link as the API address:

endpoint <- "https://register-api.officeforstudents.org.uk/api/Download"

We try the httr::GET() function with no additional arguments:

response <- httr::GET(url = endpoint)

If we try

httr::http_type(response)

## [1] "application/octet-stream"

then this tells us file type hasn’t been determined, but digging around with str(response) suggests there is an .xlsx file sitting in there somewhere!

A post on stack overflow suggests exporting an object to a temporary .xlsx file, and then reading it in again. The temporary file will be deleted once you close down your R session. We do

raw_xlsx <- httr::content(response)

and then

tmp <- tempfile(fileext = '.xlsx')

to set up the temporary file. Then to save it:

writeBin(raw_xlsx, tmp)

To read it back in again (where I already know I want to skip the first two rows)

my_excel <- readxl::read_excel(tmp, skip = 2)

To check that we’ve got something:

my_excel[1:5, c(1, 14) ]

## # A tibble: 5 × 2
##   `Provider’s legal name`  `Highest level of degree awarding powers held`
##   <chr>                    <chr>                                         
## 1 Lamda Limited            Taught                                        
## 2 The University of Surrey Research                                      
## 3 University of York       Research                                      
## 4 Aston University         Research                                      
## 5 Royal College of Music   Research

16.3 Exercise

Exercise 16.1 Try using the API at http://open-notify.org/Open-Notify-API/People-In-Space/ to download a data set on who is currently in space. You can leave the query argument blank in httr::GET(), so you just need the API web address.

16.4 Acknowledgements

Data obtained from https://coronavirus.data.gov.uk/. Accessed 2024-09-03. Contains public sector information licensed under the Open Government Licence v3.0.
Thanks to Christian Pascual for providing this useful blog post on APIs, hosted at Dataquest.

References

Ooms, Jeroen. 2014. “The Jsonlite Package: A Practical and Consistent Mapping Between JSON Data and r Objects.” arXiv:1403.2805 [Stat.CO]. https://arxiv.org/abs/1403.2805.

———. 2020. Httr: Tools for Working with URLs and HTTP. https://CRAN.R-project.org/package=httr.