Section 14 Strings
Here, we’ll look at working with text and strings, which can be more awkward to deal with compared with purely numerical data. In this chapter, as an example, we’ll consider a data set provided as a plain text file, in which there are some formatting problems we’ll need to deal with.
14.1 Example: (fictitous) exam mark data
We’ll consider a small text file of fictitious data. The file is called stat101.txt
, and looks like this.
STAT101 module marks
29/06/20
student cwk exam
12015 55 62
12468 78 84
11560 55 40*
12589 62 --
* denotes resit attempt - capped at 40
Some problems with working with this data in R would be
- awkward labels attached to numbers (e.g.
40*
); - handling of missing data: here
--
has been used, whereas R usesNA
; - lines of text both before and after the data.
It may be tempting to edit a file such as this by hand in a text editor. Try not to do this! It may not be practical if the file is large, and it’s hard to keep a good record of what edits you made.
14.2 Importing text data with read_lines()
If the text file was only contained a table (and so looked like a data frame), we could use readr::read_table()
to import it. But if we need to do any processing of the text first, a better option may be to first import the file with readr::read_lines()
. This will create a vector of character strings, where each element is one whole line of the text file.
## [1] "STAT101 module marks"
## [2] "29/06/20"
## [3] ""
## [4] "student cwk exam"
## [5] "12015 55 62"
## [6] "12468 78 84"
## [7] "11560 55 40* "
## [8] "12589 62 -- "
## [9] ""
## [10] "* denotes resit attempt - capped at 40"
- as with other commands for importing data,
read_lines()
can import files directly from websites - just give the full url; - we can use the argument
skip
to skip lines at the start (we might skip lines 1-3 here, but I will leave them in for now); - the argument
skip_empty_rows
isFALSE
by default, but I will set it toTRUE
to skip rows 3 and 9:
14.3 Finding (and replacing) characters in strings
We can test for equality of entire strings in the usual way, e.g.
## [1] TRUE FALSE
but here, we will want to search within a string for some text, e.g., how to determine which elements of x
contain the word red
? Here, we will make use of the stringr
package (Wickham 2019a).
14.3.1 Finding text with str_which()
Suppose we want to find the line with the column headings. Here, will do this by finding which line contains the text student
(assuming we know that’s what we need to look for):
## [1] 3
14.3.2 Escape characters and regular expressions
Some characters have special meaning, which makes it harder to search for them. If we wanted to find lines containing *
, this won’t work:
Here, we have to insert two backslash symbols, so that R understands we are searching for a *
:
## [1] 6 8
To do more complicated searches, one can make use of regular expressions. Regular expressions describe particular patterns of text, and can be used in many different programming languages. We won’t cover these here, but further reading is given at the of this chapter.
14.3.3 Replacing or removing text
Suppose we want to replace --
by NA
, to indicate a missing value. We do
## [1] "STAT101 module marks"
## [2] "29/06/20"
## [3] "student cwk exam"
## [4] "12015 55 62"
## [5] "12468 78 84"
## [6] "11560 55 40* "
## [7] "12589 62 NA "
## [8] "* denotes resit attempt - capped at 40"
To delete text, we can either set replacement = ""
in the above or use str_remove_all()
. For example, to get rid of the asterisks, we would do
## [1] "STAT101 module marks"
## [2] "29/06/20"
## [3] "student cwk exam"
## [4] "12015 55 62"
## [5] "12468 78 84"
## [6] "11560 55 40 "
## [7] "12589 62 -- "
## [8] " denotes resit attempt - capped at 40"
14.3.4 Removing white space at start/end of strings
Blank spaces at the end of a string can cause problems, as R might think there is an extra column of data. We can get rid of these with str_trim()
## [1] "STAT101 module marks"
## [2] "29/06/20"
## [3] "student cwk exam"
## [4] "12015 55 62"
## [5] "12468 78 84"
## [6] "11560 55 40*"
## [7] "12589 62 --"
## [8] "* denotes resit attempt - capped at 40"
14.4 Subsetting strings
We might want to search for some text at a particular place in a string, or just extract part of a string. We can do this with str_sub()
. For example, to extract the module code from the data, we could do
## [1] "STAT101"
14.5 Making a data frame
We’ll first do some operations to clean up the text:
examTextClean <- examTextRaw %>%
str_remove_all(pattern = "\\*") %>%
str_replace_all(pattern = "--",
replacement = "NA") %>%
str_trim()
and then do some searching to see which lines we want to use in our data frame:
header <- str_which(examTextClean, pattern = "student")
endLine <- str_which(examTextClean, pattern = "denotes")
We can now make a data frame with
## # A tibble: 4 × 3
## student cwk exam
## <dbl> <dbl> <dbl>
## 1 12015 55 62
## 2 12468 78 84
## 3 11560 55 40
## 4 12589 62 NA
14.6 Further reading
For more on
stringr
, see Chapter 14 of R for Data ScienceIn particular, see Section 14.3 on regular expressions