Section 15 Processing multiple files

Sometimes, we may be working with data sets spread across multiple files, where the structure of the data within each file is the same or similar. It can be tempting to copy and paste a block of R code, once for each file, editing each block as necessary. Try to avoid this if you can! Your code may get messy/hard to read if you have lots of data files, and it may lead to bugs if you don’t edit each block correctly.

15.1 Repeating a process with a `for` loop

There are different ways we might get R to run the same block of code multiple times (with small changes each time). It may be a good idea to create your own function, but we’ll consider a simpler solution here, which is to put code inside a for loop. (There are more efficient methods, but for loops are easy to write and read, and will be sufficient for this module.)

A for loop has the basic syntax

for(i in 1:n){
 
}

Everything inside the curly brackets will be carried out n times;
any instance of i will be replaced by 1, then 2, 3,…,n. If, for example, there is an x[i] inside the curly brackets, this code will be run first using x[1], then using x[2] and so on.

15.2 Example: cleaning two text files

As an example, we’ll continue with the fictitious student data, but now suppose there are two data files, stat101.txt and stat102.txt. We suppose each data set needs cleaning, and then we’d like to combine the two data frames.

15.2.1 The core code block

We have a ‘core’ block of code that we want to use multiple times, making small changes each time. The code to get a single data file into a data frame was as follows. (We’ll add an extra mutate() command to store the module code).

examTextRaw <- read_lines("data/stat101.txt")

examTextClean <- examTextRaw %>%
  str_remove_all(pattern = "\\*") %>%
  str_replace_all(pattern = "--",
                  replacement = "NA") %>%
  str_trim()

header <- str_which(examTextClean, pattern = "student")
endLine <- str_which(examTextClean, pattern = "denotes")

read_table(examTextClean[header:(endLine - 1)]) %>%
  mutate(module = "stat101")

We want to run this block twice, once for the file stat101.txt and once for stat102.txt.

15.2.2 Identify the variables we need to specify

The code block above needs to run twice, once using the file name stat101.txt, and once using the name stat102.txt. We will have to specify these in advance. We can actually just specify module codes:

modules <- c("stat101", "stat102")

and construct file paths using paste0():

paste0("data/", modules, ".txt")

## [1] "data/stat101.txt" "data/stat102.txt"

15.2.3 Create an empty list to store the results

Lists are useful here, as an element of a list can be any type of object. We make an empty one as follows:

moduleResults <- vector(mode = "list", length = length(modules))
moduleResults

## [[1]]
## NULL
## 
## [[2]]
## NULL

(we could have just specified length = 2, but try to avoid specifying numerical values like this if they may change at some point, e.g. if another module was to be added.)

15.2.4 Use a for loop to run the code block multiple times

The code to be repeated goes inside a for loop, with one element of the list moduleResults filled each time. (We’ll repeat the commands to set up the variables at the start.)

modules <- c("stat101", "stat102")
filePaths <- paste0("data/", modules, ".txt")
moduleResults <- vector(mode = "list", length = length(modules))

for(i in 1:length(modules)){
  examTextRaw <- read_lines(filePaths[i])
  
  examTextClean <- examTextRaw %>%
    str_remove_all(pattern = "\\*") %>%
    str_replace_all(pattern = "--",
                    replacement = "NA") %>%
    str_trim()
  
  header <- str_which(examTextClean, pattern = "student")
  endLine <- str_which(examTextClean, pattern = "denotes")
  
  moduleResults[[i]] <- 
    read_table(examTextClean[header:(endLine - 1)]) %>%
    mutate(module = modules[i])
}

15.2.5 Convert the list to a data frame

As each data frame in the list has the same column headings, we can convert the data frame to a list as follows

do.call(rbind.data.frame, moduleResults)

## # A tibble: 7 × 4
##   student   cwk  exam module 
##     <dbl> <dbl> <dbl> <chr>  
## 1   12015    55    62 stat101
## 2   12468    78    84 stat101
## 3   11560    55    40 stat101
## 4   12589    62    NA stat101
## 5   12015    61    69 stat102
## 6   12468    81    78 stat102
## 7   11579    51    40 stat102

15.3 Exercise

Exercise 15.1 There is a third module data file, stat103.txt. Modify the code above, so that all three files are imported, cleaned, and combined into a single data frame. Unlike the other two files, this file does not end with the line

* denotes resit attempt - capped at 40

so some parts of the code above will not work! Modify the code so that the single block inside the for loop will correctly process each text file.