Section 25 Example: working with strings

In this section we’ll discuss some code used to create the file universities.csv used in a Case Study. The main features of this example are as follows.

  • We want to extract some data corresponding to universities from a larger dataset of many different employers.
  • We can obtain a separate list of universities, but names in this list may not match precisely the names in the first dataset. For example, in one dataset, the name will be UNIVERSITY OF SHEFFIELD, but in another it will be The University of Sheffield.

We’d also like to import all the data directly from the web. One of the files is accessed via an API at the OfS Register, but we have already shown how to import it in Section 16.2.

25.1 Importing the two data files

The main data set we are interested in is from the Gender pay gap service. We import the data for the year 2021-22:

library(tidyverse)
pay22 <- read_csv("https://gender-pay-gap.service.gov.uk/viewing/download-data/2021")

This file has data for about 10,000 employers.

Using the code from Section 16.2, we import the list of universities from the Office for Students. Note that I’ve looked at this file already, and know that I want to skip the first two rows.

endpoint <- "https://register-api.officeforstudents.org.uk/api/Download"

response <- httr::GET(
  url = endpoint)

raw_xlsx <- httr::content(response)
tmp <- tempfile(fileext = '.xlsx')

writeBin(raw_xlsx, tmp)
ofs <- readxl::read_excel(tmp, skip = 2)

25.2 Preliminary tidying

We need to do a little tidying up of the ofs data frame, and we would also like to create a column to indicate whether a university is a pre-92 or a post-92.

We are interested in four columns:

colnames(ofs)[c(1, 3, 18, 19)]
## [1] "Provider’s legal name"                                             
## [2] "Provider’s trading name(s)"                                        
## [3] "Does the provider have the right to use ‘university’ in its title?"
## [4] "Date that use of 'university' was granted"

We will simplify the names:

colnames(ofs)[1] <- "legal"
colnames(ofs)[3] <- "trading"
colnames(ofs)[18] <- "university"
colnames(ofs)[19] <- "year"

Now we extract the universities from the OfS register:

ofs %>%
  filter(university == "Yes") %>%
  select(legal, trading, year) -> universities

We clean up the year and legal (name) variable for Oxbridge:

tempIndex <- which(nchar(universities$year) > 4)
universities$year[tempIndex] <- 1571
universities$year <- as.numeric(universities$year)
universities$legal[tempIndex]
## [1] "The Chancellor, Masters, and Scholars of the University of Cambridge"
## [2] "The Chancellor, Masters and Scholars of the University of Oxford"
universities$legal[tempIndex] <- c("The University of Cambridge",
                                   "The University of Oxford")

Now we create a column to record whether each university is a pre-92 or post-92:

universities$pre92 <- ifelse(universities$year < 1992, "yes", "no")

25.3 Approximately matching strings

We want to find the universities in the data frame pay22 according to the names in universities (we’ll look for both the legal and trading names), but these names may not match precisely in the two data frames. We’ll try the following:

  • change all letters to lower case;
  • remove ‘stop words’ (small ‘insignificant’ words such as ‘the’ and ‘of’) using the package tm (Feinerer, Hornik, and Meyer 2008);
  • arranging the words within a name in alphabetical order (thanks to this post on Stack Overflow);
  • removing various characters such as brackets.

We’ll write a function to do this:

cleanUpNames <- function(x){
  # first change to lower case, then remove stopwords
  tolower(x) %>%
    tm::removeWords(tm::stopwords()) -> namesTemp
    
  # Split string into separate words, sort
  # words into alphabetical order, then collapse 
  # words back into single string
  sapply(lapply(strsplit(namesTemp, " "),
                      sort),
               paste,
               collapse=" ") %>%
    str_remove_all(",") %>%
    str_remove_all("&") %>%
    str_remove_all("\\[\\]") %>%
    str_remove_all("\\(\\)") %>%
    str_trim()
}

Now we’ll use the function to clean up each column of names:

uNamesLegalClean <- cleanUpNames(universities$legal)
uNamesTradingClean <- cleanUpNames(universities$trading)
pNamesClean <- cleanUpNames(pay22$EmployerName)

We’ll see how many universities we can find in the pay gap data, using both the legal and trading names:

sum(is.element(pNamesClean, uNamesLegalClean))
## [1] 90
sum(is.element(pNamesClean, uNamesTradingClean))
## [1] 25

25.4 Assembling the target data frame

We now add the vector of cleaned names to pay22, find all the universities we can, then extract the original and cleaned names, and also the EmployerId column:

pay22 %>%
  mutate(pNamesClean = pNamesClean) %>%
  filter(pNamesClean %in% uNamesTradingClean |
           pNamesClean %in% uNamesLegalClean) %>%
  select(pNamesClean, EmployerName, EmployerId) -> pay22NameId

We’ll add the cleaned names to the universities data frame as well.

universities %>%
  mutate(uNamesTradingClean = uNamesTradingClean,
         uNamesLegalClean = uNamesLegalClean) -> universities

Finally, we’ll merge the two data frames, and get rid of any duplicate rows:

uni <- distinct(
rbind(inner_join(pay22NameId, universities, 
                 by = c("pNamesClean" = "uNamesTradingClean")) %>%
        select(EmployerName, EmployerId, pre92),
      inner_join(pay22NameId, universities, 
                 by = c("pNamesClean" = "uNamesLegalClean")) %>%
        select(EmployerName, EmployerId, pre92)))

From inspection, we need to correct the pre-92 status for some London universities:

uni[c(5, 46),]
## # A tibble: 2 × 3
##   EmployerName                 EmployerId pre92
##   <chr>                             <dbl> <chr>
## 1 Imperial College London            6539 no   
## 2 LONDON SOUTH BANK UNIVERSITY       7820 no
uni[c(5, 46), "pre92"] <- "yes"

Finally, we can output the data to .csv

write_csv(uni, "universities.csv")

References

Feinerer, Ingo, Kurt Hornik, and David Meyer. 2008. “Text Mining Infrastructure in r.” Journal of Statistical Software 25 (5): 1–54. https://www.jstatsoft.org/v25/i05/.