Section 25 Example: working with strings
In this section we’ll discuss some code used to create the file universities.csv
used in a Case Study. The main features of this example are as follows.
- We want to extract some data corresponding to universities from a larger dataset of many different employers.
- We can obtain a separate list of universities, but names in this list may not match precisely the names in the first dataset. For example, in one dataset, the name will be
UNIVERSITY OF SHEFFIELD
, but in another it will beThe University of Sheffield
.
We’d also like to import all the data directly from the web. One of the files is accessed via an API at the OfS Register, but we have already shown how to import it in Section 16.2.
25.1 Importing the two data files
The main data set we are interested in is from the Gender pay gap service. We import the data for the year 2021-22:
library(tidyverse)
pay22 <- read_csv("https://gender-pay-gap.service.gov.uk/viewing/download-data/2021")
This file has data for about 10,000 employers.
Using the code from Section 16.2, we import the list of universities from the Office for Students. Note that I’ve looked at this file already, and know that I want to skip the first two rows.
25.2 Preliminary tidying
We need to do a little tidying up of the ofs
data frame, and we would also like to create a column to indicate whether a university is a pre-92 or a post-92.
We are interested in four columns:
## [1] "Provider’s legal name"
## [2] "Provider’s trading name(s)"
## [3] "Does the provider have the right to use ‘university’ in its title?"
## [4] "Date that use of 'university' was granted"
We will simplify the names:
colnames(ofs)[1] <- "legal"
colnames(ofs)[3] <- "trading"
colnames(ofs)[18] <- "university"
colnames(ofs)[19] <- "year"
Now we extract the universities from the OfS register:
We clean up the year and legal (name) variable for Oxbridge:
tempIndex <- which(nchar(universities$year) > 4)
universities$year[tempIndex] <- 1571
universities$year <- as.numeric(universities$year)
## [1] "The Chancellor, Masters, and Scholars of the University of Cambridge"
## [2] "The Chancellor, Masters and Scholars of the University of Oxford"
Now we create a column to record whether each university is a pre-92 or post-92:
25.3 Approximately matching strings
We want to find the universities in the data frame pay22
according to the names in universities
(we’ll look for both the legal
and trading
names), but these names may not match precisely in the two data frames. We’ll try the following:
- change all letters to lower case;
- remove ‘stop words’ (small ‘insignificant’ words such as ‘the’ and ‘of’) using the package
tm
(Feinerer, Hornik, and Meyer 2008); - arranging the words within a name in alphabetical order (thanks to this post on Stack Overflow);
- removing various characters such as brackets.
We’ll write a function to do this:
cleanUpNames <- function(x){
# first change to lower case, then remove stopwords
tolower(x) %>%
tm::removeWords(tm::stopwords()) -> namesTemp
# Split string into separate words, sort
# words into alphabetical order, then collapse
# words back into single string
sapply(lapply(strsplit(namesTemp, " "),
sort),
paste,
collapse=" ") %>%
str_remove_all(",") %>%
str_remove_all("&") %>%
str_remove_all("\\[\\]") %>%
str_remove_all("\\(\\)") %>%
str_trim()
}
Now we’ll use the function to clean up each column of names:
uNamesLegalClean <- cleanUpNames(universities$legal)
uNamesTradingClean <- cleanUpNames(universities$trading)
pNamesClean <- cleanUpNames(pay22$EmployerName)
We’ll see how many universities we can find in the pay gap data, using both the legal and trading names:
## [1] 90
## [1] 25
25.4 Assembling the target data frame
We now add the vector of cleaned names to pay22
, find all the universities we can, then extract the original and cleaned names, and also the EmployerId
column:
pay22 %>%
mutate(pNamesClean = pNamesClean) %>%
filter(pNamesClean %in% uNamesTradingClean |
pNamesClean %in% uNamesLegalClean) %>%
select(pNamesClean, EmployerName, EmployerId) -> pay22NameId
We’ll add the cleaned names to the universities
data frame as well.
universities %>%
mutate(uNamesTradingClean = uNamesTradingClean,
uNamesLegalClean = uNamesLegalClean) -> universities
Finally, we’ll merge the two data frames, and get rid of any duplicate rows:
uni <- distinct(
rbind(inner_join(pay22NameId, universities,
by = c("pNamesClean" = "uNamesTradingClean")) %>%
select(EmployerName, EmployerId, pre92),
inner_join(pay22NameId, universities,
by = c("pNamesClean" = "uNamesLegalClean")) %>%
select(EmployerName, EmployerId, pre92)))
From inspection, we need to correct the pre-92 status for some London universities:
## # A tibble: 2 × 3
## EmployerName EmployerId pre92
## <chr> <dbl> <chr>
## 1 Imperial College London 6539 no
## 2 LONDON SOUTH BANK UNIVERSITY 7820 no
Finally, we can output the data to .csv