Section 6 Data frames, tibbles and lists

6.1 Data frames

Data sets in R are organised in data frames. R has various built-in data frames that we can use as illustrations, for example, mtcars. The full data set has 32 rows, but we will just display the first 6, using the head() function

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Each column has a name (mpg, cyl, etc.) which can be used to access the values in that column. The rows in this data frame are also named, but that is optional, and we usually select particular rows by their row number.

The mtcars data frame has columns of numeric quantities only (though some columns are really dummy variables to represent factors), but is common for data frames to have a mix of variable types, for example, in the CO2 data frame we have

head(CO2)
##   Plant   Type  Treatment conc uptake
## 1   Qn1 Quebec nonchilled   95   16.0
## 2   Qn1 Quebec nonchilled  175   30.4
## 3   Qn1 Quebec nonchilled  250   34.8
## 4   Qn1 Quebec nonchilled  350   37.2
## 5   Qn1 Quebec nonchilled  500   35.3
## 6   Qn1 Quebec nonchilled  675   39.2

The built-in data frames have help files with more information: try ?mtcars for example.

Built-in data frames are useful when giving examples, or asking for help online: everyone will have them, so code using a built-in data frame can be easy for others to run.

6.1.1 Making data frames

A data frame can be made within R using the function data.frame(), but more commonly, we’ll be making data frames by importing data into R (e.g. .csv files). We’ll cover this in a later section.

6.1.2 Extracting rows and columns

Extracting data from data frames can be confusing, in that there are multiple ways to do it, and the format of the result (e.g. a vector or another data frame) can vary.

For now, one syntax for extracting a column is dataframe-name$column-name, for example:

mtcars$cyl
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4

and a complete row can be extracted using its position (row number):

mtcars[2, ]
##               mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

We’ll look more at extracting subsets of data frames later on.

6.2 tibbles

A tibble is a special type of data frame (Müller and Wickham 2020). When we view a tibble, R will normally only show the first 10 rows, and as many columns as can be fitted in the screen (but R will tell us if there are more columns). For example, a tibble version of the mtcars data frame would look like this:

## # A tibble: 32 × 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # … with 22 more rows

Line three tells us what data type we have in each column. (<dbl> is short for “double precision”, a numeric variable type).

In this module, it won’t make any difference whether our data are in a tibble format or ‘standard’ data frame, but there are some situations where they behave differently.

If we want to force R to display all the rows of a tibble, we can use the print function:

mtcarsTibble <- tibble::tibble(mtcars)
print(mtcarsTibble, n = nrow(mtcarsTibble))

6.3 Inspecting large data frames and tibbles

If we have a large number of columns in a data frame or tibble in R, it can be difficult to see exactly what data we’ve got. (The default display of a large tibble can still be overwhelming.) The str() function can be useful here (and will work with any type of R object):

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

A similar function, with slightly different output, is

tibble::glimpse(mtcars)
## Rows: 32
## Columns: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

6.4 Lists

We’ll briefly mention another type of object known as a list. Whereas a data frame is essentially a table, combining columns of the same length, a list is a more general collection of objects, which can vary in size and type.

For example, we can create a list as follows:

mylist <- list(a = 1:10, b = "Monday")

We can view the names of the objects inside a list with the command

names(mylist)
## [1] "a" "b"

and access objects inside the list with the $ operator:

mylist$b
## [1] "Monday"

A data frame behaves like a list: it is a list of columns (although there are extra things you can do with data frames that you can’t do with lists.)

Exercise 6.1 The data frame morley contains data from experiments measuring the speed of light. Find the mean recorded speed of light, for all 100 observations.

6.5 Further reading

At some point, it is worth understanding in more detail how subsetting works. A good reference is Chapter 4 of Advanced R (Wickham, 2019).

References

Müller, Kirill, and Hadley Wickham. 2020. Tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble.