Section 14 Strings

Here, we’ll look at working with text and strings, which can be more awkward to deal with compared with purely numerical data. In this chapter, as an example, we’ll consider a data set provided as a plain text file, in which there are some formatting problems we’ll need to deal with.

14.1 Example: (fictitous) exam mark data

We’ll consider a small text file of fictitious data. The file is called stat101.txt, and looks like this.

STAT101 module marks
29/06/20

student cwk exam
12015    55   62
12468    78   84
11560    55   40* 
12589    62   -- 

* denotes resit attempt - capped at 40

Some problems with working with this data in R would be

  • awkward labels attached to numbers (e.g. 40*);
  • handling of missing data: here -- has been used, whereas R uses NA;
  • lines of text both before and after the data.

It may be tempting to edit a file such as this by hand in a text editor. Try not to do this! It may not be practical if the file is large, and it’s hard to keep a good record of what edits you made.

14.2 Importing text data with read_lines()

If the text file was only contained a table (and so looked like a data frame), we could use readr::read_table() to import it. But if we need to do any processing of the text first, a better option may be to first import the file with readr::read_lines(). This will create a vector of character strings, where each element is one whole line of the text file.

read_lines("data/stat101.txt")
##  [1] "STAT101 module marks"                  
##  [2] "29/06/20"                              
##  [3] ""                                      
##  [4] "student cwk exam"                      
##  [5] "12015    55   62"                      
##  [6] "12468    78   84"                      
##  [7] "11560    55   40* "                    
##  [8] "12589    62   -- "                     
##  [9] ""                                      
## [10] "* denotes resit attempt - capped at 40"
  • as with other commands for importing data, read_lines() can import files directly from websites - just give the full url;
  • we can use the argument skip to skip lines at the start (we might skip lines 1-3 here, but I will leave them in for now);
  • the argument skip_empty_rows is FALSE by default, but I will set it to TRUE to skip rows 3 and 9:
examTextRaw <- read_lines("data/stat101.txt",
                          skip_empty_rows = TRUE)

14.3 Finding (and replacing) characters in strings

We can test for equality of entire strings in the usual way, e.g.

x <- c("red house", "blue car")
x == "red house"
## [1]  TRUE FALSE

but here, we will want to search within a string for some text, e.g., how to determine which elements of x contain the word red? Here, we will make use of the stringr package (Wickham 2019a).

14.3.1 Finding text with str_which()

Suppose we want to find the line with the column headings. Here, will do this by finding which line contains the text student (assuming we know that’s what we need to look for):

str_which(examTextRaw, "student")
## [1] 3

14.3.2 Escape characters and regular expressions

Some characters have special meaning, which makes it harder to search for them. If we wanted to find lines containing *, this won’t work:

str_which(examTextRaw, "*")

Here, we have to insert two backslash symbols, so that R understands we are searching for a *:

str_which(examTextRaw, "\\*")
## [1] 6 8

To do more complicated searches, one can make use of regular expressions. Regular expressions describe particular patterns of text, and can be used in many different programming languages. We won’t cover these here, but further reading is given at the of this chapter.

14.3.3 Replacing or removing text

Suppose we want to replace -- by NA, to indicate a missing value. We do

str_replace_all(examTextRaw, pattern = "--",
                replacement = "NA")
## [1] "STAT101 module marks"                  
## [2] "29/06/20"                              
## [3] "student cwk exam"                      
## [4] "12015    55   62"                      
## [5] "12468    78   84"                      
## [6] "11560    55   40* "                    
## [7] "12589    62   NA "                     
## [8] "* denotes resit attempt - capped at 40"

To delete text, we can either set replacement = "" in the above or use str_remove_all(). For example, to get rid of the asterisks, we would do

str_remove_all(examTextRaw, pattern = "\\*")
## [1] "STAT101 module marks"                 
## [2] "29/06/20"                             
## [3] "student cwk exam"                     
## [4] "12015    55   62"                     
## [5] "12468    78   84"                     
## [6] "11560    55   40 "                    
## [7] "12589    62   -- "                    
## [8] " denotes resit attempt - capped at 40"

14.3.4 Removing white space at start/end of strings

Blank spaces at the end of a string can cause problems, as R might think there is an extra column of data. We can get rid of these with str_trim()

str_trim(examTextRaw)
## [1] "STAT101 module marks"                  
## [2] "29/06/20"                              
## [3] "student cwk exam"                      
## [4] "12015    55   62"                      
## [5] "12468    78   84"                      
## [6] "11560    55   40*"                     
## [7] "12589    62   --"                      
## [8] "* denotes resit attempt - capped at 40"

14.4 Subsetting strings

We might want to search for some text at a particular place in a string, or just extract part of a string. We can do this with str_sub(). For example, to extract the module code from the data, we could do

str_sub(examTextRaw[1], start = 1, end = 7)
## [1] "STAT101"

14.5 Making a data frame

We’ll first do some operations to clean up the text:

examTextClean <- examTextRaw %>%
  str_remove_all(pattern = "\\*") %>%
  str_replace_all(pattern = "--",
                  replacement = "NA") %>%
  str_trim()

and then do some searching to see which lines we want to use in our data frame:

header <- str_which(examTextClean, pattern = "student")
endLine <- str_which(examTextClean, pattern = "denotes")

We can now make a data frame with

read_table(examTextClean[header:(endLine - 1)])
## # A tibble: 4 × 3
##   student   cwk  exam
##     <dbl> <dbl> <dbl>
## 1   12015    55    62
## 2   12468    78    84
## 3   11560    55    40
## 4   12589    62    NA

14.6 Further reading

References

———. 2019a. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.