Section 6 Data frames, tibbles and lists
6.1 Data frames
Data sets in R are organised in data frames. R has various built-in data frames that we can use as illustrations, for example, mtcars
. The full data set has 32 rows, but we will just display the first 6, using the head()
function
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Each column has a name (mpg
, cyl
, etc.) which can be used to access the values in that column. The rows in this data frame are also named, but that is optional, and we usually select particular rows by their row number.
The mtcars
data frame has columns of numeric quantities only (though some columns are really dummy variables to represent factors), but is common for data frames to have a mix of variable types, for example, in the CO2
data frame we have
## Plant Type Treatment conc uptake
## 1 Qn1 Quebec nonchilled 95 16.0
## 2 Qn1 Quebec nonchilled 175 30.4
## 3 Qn1 Quebec nonchilled 250 34.8
## 4 Qn1 Quebec nonchilled 350 37.2
## 5 Qn1 Quebec nonchilled 500 35.3
## 6 Qn1 Quebec nonchilled 675 39.2
The built-in data frames have help files with more information: try ?mtcars
for example.
Built-in data frames are useful when giving examples, or asking for help online: everyone will have them, so code using a built-in data frame can be easy for others to run.
6.1.1 Making data frames
A data frame can be made within R using the function data.frame()
, but more commonly, we’ll be making data frames by importing data into R (e.g. .csv files). We’ll cover this in a later section.
6.1.2 Extracting rows and columns
Extracting data from data frames can be confusing, in that there are multiple ways to do it, and the format of the result (e.g. a vector or another data frame) can vary.
For now, one syntax for extracting a column is dataframe-name$column-name
, for example:
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
and a complete row can be extracted using its position (row number):
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
We’ll look more at extracting subsets of data frames later on.
6.2 tibbles
A tibble is a special type of data frame (Müller and Wickham 2020). When we view a tibble, R will normally only show the first 10 rows, and as many columns as can be fitted in the screen (but R will tell us if there are more columns). For example, a tibble version of the mtcars
data frame would look like this:
## # A tibble: 32 × 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # ℹ 22 more rows
Line three tells us what data type we have in each column. (<dbl>
is short for “double precision”, a numeric variable type).
In this module, it won’t make any difference whether our data are in a tibble format or ‘standard’ data frame, but there are some situations where they behave differently.
If we want to force R to display all the rows of a tibble, we can use the print
function:
6.3 Inspecting large data frames and tibbles
If we have a large number of columns in a data frame or tibble in R, it can be difficult to see exactly what data we’ve got. (The default display of a large tibble can still be overwhelming.) The str()
function can be useful here (and will work with any type of R object):
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
A similar function, with slightly different output, is
## Rows: 32
## Columns: 11
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
6.4 Lists
We’ll briefly mention another type of object known as a list. Whereas a data frame is essentially a table, combining columns of the same length, a list is a more general collection of objects, which can vary in size and type.
For example, we can create a list as follows:
We can view the names of the objects inside a list with the command
## [1] "a" "b"
and access objects inside the list with the $
operator:
## [1] "Monday"
A data frame behaves like a list: it is a list of columns (although there are extra things you can do with data frames that you can’t do with lists.)
Exercise 6.1 The data frame morley
contains data from experiments measuring the speed of light. Find the mean recorded speed of light, for all 100 observations.
6.5 Further reading
At some point, it is worth understanding in more detail how subsetting works. A good reference is Chapter 4 of Advanced R (Wickham, 2019).