Section 16 APIs

Some websites provide data via an “Application Programming Interface” (API). Typically, this means that you would make selections for what data you wanted, and then a data set would be built for you to download. If you want R to download the data directly, this causes a problem, as there there may not be a file simply waiting to be downloaded; you can’t provide R with a web address.

APIs provide data in different formats. Some APIs include documentation that will help you to get R to work with the API directly, but others don’t. We’ll give two examples

The main R package we need is httr (Wickham 2020) and we will also need jsonlite (Ooms 2014) for the first example.

16.1 Example: obtaining Covid-19 case data (data in JSON format)

We will give an example of using the API at https://coronavirus.data.gov.uk/ for obtaining Covid-19 data (you’ll see that the frequency of reporting has changed)

We will also need to read the API documentation for the host site.

The first step is to specify the web address of the API. This will be provided in the API documentation

endpoint <- "https://api.coronavirus.data.gov.uk/v1/data"

The command to get something from the API is httr::GET(). Arguments include the web address, and any query options, if they are required. Working out how to set the query argument is the tricky part: this is where you need to study the API documentation carefully!

The documentation on query parameters tells us that we have to specify something for filters and structure. There are some examples to copy. We do

response <- httr::GET(
  url = endpoint,
  query = list(filters = 'areaType=nation; areaName=england',
               structure='{"date":"date","newCases":"newCasesByPublishDate"}')
)

The data we want is in response$content, but it first needs converting into a format we can use:

contentText <- rawToChar(response$content)

The data are now in JSON (JavaScript Object Notation) format, which we won’t worry about here, but we can convert this to list format:

covid <- jsonlite::fromJSON(contentText)

We inspect covid to see what’s there (try str(covid)); we see that there is a data frame covid$data:

head(covid$data)
##         date newCases
## 1 2022-10-13    59101
## 2 2022-10-12        0
## 3 2022-10-11        0
## 4 2022-10-10        0
## 5 2022-10-09        0
## 6 2022-10-08        0

To finish, we’ll plot the data. We can convert the date column from a character string to dates that R will understand using the lubridate package (Grolemund and Wickham 2011) and the function lubridate::ymd() (year, month, day):

ggplot(data = covid$data, 
       aes(x = lubridate::ymd(date), y = newCases)) +
  geom_line()

16.2 Example: obtaining an .xlsx file

In this example, we’ll obtain a spreadsheet provided by the Office for Students (OfS): the OfS Register of all English higher education providers. This is provided by an API, but there is no API documentation.

From inspecting the download link from the file, we can see that the link is to an API, rather than to the file directly. We try this link as the API address:

endpoint <- "https://register-api.officeforstudents.org.uk/api/Download"

We try the httr::GET() function with no additional arguments:

response <- httr::GET(url = endpoint)

If we try

httr::http_type(response) 
## [1] "application/octet-stream"

then this tells us file type hasn’t been determined, but digging around with str(response) suggests there is an .xlsx file sitting in there somewhere!

A post on stack overflow suggests exporting an object to a temporary .xlsx file, and then reading it in again. The temporary file will be deleted once you close down your R session. We do

raw_xlsx <- httr::content(response)

and then

tmp <- tempfile(fileext = '.xlsx')

to set up the temporary file. Then to save it:

writeBin(raw_xlsx, tmp)

To read it back in again (where I already know I want to skip the first two rows)

my_excel <- readxl::read_excel(tmp, skip = 2)

To check that we’ve got something:

my_excel[1:5, c(1, 14) ]
## # A tibble: 5 × 2
##   `Provider’s legal name`  `Highest level of degree awarding powers held`
##   <chr>                    <chr>                                         
## 1 Lamda Limited            Taught                                        
## 2 The University of Surrey Research                                      
## 3 University of York       Research                                      
## 4 Aston University         Research                                      
## 5 Royal College of Music   Research

16.3 Exercise

Exercise 16.1 Try using the API at http://open-notify.org/Open-Notify-API/People-In-Space/ to download a data set on who is currently in space. You can leave the query argument blank in httr::GET(), so you just need the API web address.

16.4 Acknowledgements

References

Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. http://www.jstatsoft.org/v40/i03/.
Ooms, Jeroen. 2014. “The Jsonlite Package: A Practical and Consistent Mapping Between JSON Data and r Objects.” arXiv:1403.2805 [Stat.CO]. https://arxiv.org/abs/1403.2805.
———. 2020. Httr: Tools for Working with URLs and HTTP. https://CRAN.R-project.org/package=httr.