Section 15 Processing multiple files
Sometimes, we may be working with data sets spread across multiple files, where the structure of the data within each file is the same or similar. It can be tempting to copy and paste a block of R code, once for each file, editing each block as necessary. Try to avoid this if you can! Your code may get messy/hard to read if you have lots of data files, and it may lead to bugs if you don’t edit each block correctly.
15.1 Repeating a process with a for
loop
There are different ways we might get R to run the same block of code multiple times (with small changes each time). It may be a good idea to create your own function, but we’ll consider a simpler solution here, which is to put code inside a for
loop. (There are more efficient methods, but for
loops are easy to write and read, and will be sufficient for this module.)
A for
loop has the basic syntax
for(i in 1:n){
}
- Everything inside the curly brackets will be carried out
n
times; - any instance of
i
will be replaced by1
, then2
,3
,…,n
. If, for example, there is anx[i]
inside the curly brackets, this code will be run first usingx[1]
, then usingx[2]
and so on.
15.2 Example: cleaning two text files
As an example, we’ll continue with the fictitious student data, but now suppose there are two data files, stat101.txt
and stat102.txt
. We suppose each data set needs cleaning, and then we’d like to combine the two data frames.
15.2.1 The core code block
We have a ‘core’ block of code that we want to use multiple times, making small changes each time. The code to get a single data file into a data frame was as follows. (We’ll add an extra mutate()
command to store the module code).
examTextRaw <- read_lines("data/stat101.txt")
examTextClean <- examTextRaw %>%
str_remove_all(pattern = "\\*") %>%
str_replace_all(pattern = "--",
replacement = "NA") %>%
str_trim()
header <- str_which(examTextClean, pattern = "student")
endLine <- str_which(examTextClean, pattern = "denotes")
read_table(examTextClean[header:(endLine - 1)]) %>%
mutate(module = "stat101")
We want to run this block twice, once for the file stat101.txt
and once for stat102.txt
.
15.2.2 Identify the variables we need to specify
The code block above needs to run twice, once using the file name stat101.txt
, and once using the name stat102.txt
. We will have to specify these in advance. We can actually just specify module codes:
and construct file paths using paste0()
:
## [1] "data/stat101.txt" "data/stat102.txt"
15.2.3 Create an empty list to store the results
Lists are useful here, as an element of a list can be any type of object. We make an empty one as follows:
## [[1]]
## NULL
##
## [[2]]
## NULL
(we could have just specified length = 2
, but try to avoid specifying numerical values like this if they may change at some point, e.g. if another module was to be added.)
15.2.4 Use a for loop to run the code block multiple times
The code to be repeated goes inside a for loop, with one element of the list moduleResults
filled each time. (We’ll repeat the commands to set up the variables at the start.)
modules <- c("stat101", "stat102")
filePaths <- paste0("data/", modules, ".txt")
moduleResults <- vector(mode = "list", length = length(modules))
for(i in 1:length(modules)){
examTextRaw <- read_lines(filePaths[i])
examTextClean <- examTextRaw %>%
str_remove_all(pattern = "\\*") %>%
str_replace_all(pattern = "--",
replacement = "NA") %>%
str_trim()
header <- str_which(examTextClean, pattern = "student")
endLine <- str_which(examTextClean, pattern = "denotes")
moduleResults[[i]] <-
read_table(examTextClean[header:(endLine - 1)]) %>%
mutate(module = modules[i])
}
15.2.5 Convert the list to a data frame
As each data frame in the list has the same column headings, we can convert the data frame to a list as follows
## # A tibble: 7 × 4
## student cwk exam module
## <dbl> <dbl> <dbl> <chr>
## 1 12015 55 62 stat101
## 2 12468 78 84 stat101
## 3 11560 55 40 stat101
## 4 12589 62 NA stat101
## 5 12015 61 69 stat102
## 6 12468 81 78 stat102
## 7 11579 51 40 stat102
15.3 Exercise
Exercise 15.1 There is a third module data file, stat103.txt
. Modify the code above, so that all three files are imported, cleaned, and combined into a single data frame. Unlike the other two files, this file does not end with the line
* denotes resit attempt - capped at 40
so some parts of the code above will not work! Modify the code so that the single block inside the for
loop will correctly process each text file.