Chapter 11 Multiple independent variables

Recall the PISA maths test data from Semester 1. Here, our dependent variable was a country’s maths test score, and we had multiple independent variables: gdp, income inequality (gini coefficient), homework hours, and school starting age.

We’ll now look at how to incorporate multiple independent variables in a linear model.

If the main interest is in, say, the effect of gdp, incorporating the variables enables us to assess the effect of gdp, whilst accounting for (sometimes described as “adjusting for” or “controlling for”) possible effects of other variables.

We’ll first import the data, and create one extra column (wealthiest): a binary variable that indicates whether a country as a gdp higher and 17000.

maths <- read_csv("http://www.jeremy-oakley.staff.shef.ac.uk/mas113/maths.csv") %>%
  mutate(wealthiest = gdp > 17000)
head(maths)
## # A tibble: 6 × 8
##   country         continent     score   gdp  gini homework start.age wealthiest
##   <chr>           <chr>         <dbl> <dbl> <dbl>    <dbl>     <dbl> <lgl>     
## 1 Albania         Europe          413  4147  29        5.1         6 FALSE     
## 2 Algeria         Africa          360  3844  27.6     NA           6 FALSE     
## 3 Argentina       South America   409 12449  42.7      3.7         6 FALSE     
## 4 Australia       Oceanea         494 49928  34.7      6           5 TRUE      
## 5 Austria         Europe          497 44177  30.5      4.5         6 TRUE      
## 6 B-S-J-G (China) Asia            531  8123  42.2     13.8         6 FALSE

For illustration, we’ll treat start.age as a factor variable with three levels, and consider modelling the effects of gdp, gini, homework and factor(start.age) on score. How might we write down such a model?

When thinking about suitable notation:

  • use one letter for each quantitative independent variable;
  • use one additional subscript for each qualitative independent variable.

We imagine splitting the data up into three groups:

  • group \(i=1\) for countries with a start.age of 5;
  • group \(i=2\) for countries with a start.age of 6;
  • group \(i=3\) for countries with a start.age of 7.

Then, for the \(j\)-th observation (country) within group \(i\), define

  • \(Y_{ij}\) to be the country’s score;
  • \(w_{ij}\) to be the country’s gdp;
  • \(x_{ij}\) to be the country’s gini coefficient;
  • \(z_{ij}\) to be the mean number of homework hours per week in that country.
We can now write the model as \[ Y_{ij} = \mu + \tau_i + \beta_1 w_{ij} + \beta_2 x_{ij} + \beta_3z_{ij}+ \varepsilon_{ij}, \] with \(\varepsilon_{ij}\sim N(0, \sigma^2)\). This model is overparameterised, so we apply the constraint \(\tau_1 = 0\).

We interpret \(\tau_i\), for \(i=2\) (or 3) as the difference in mean score between two countries with identical values of gdp, gini and homework, but in which one has a school start.age of 6 (or 7) and one with a school start.age of 5.

We fit the model in R as follows:

lmMaths <- lm(score ~ gdp + gini + homework + factor(start.age), maths)
summary(lmMaths)
## 
## Call:
## lm(formula = score ~ gdp + gini + homework + factor(start.age), 
##     data = maths)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.398 -17.884  -0.204  17.867  69.026 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.827e+02  3.743e+01  12.896  < 2e-16 ***
## gdp                 1.025e-03  2.181e-04   4.701 2.07e-05 ***
## gini               -2.634e+00  6.939e-01  -3.796 0.000399 ***
## homework            8.343e+00  2.337e+00   3.569 0.000802 ***
## factor(start.age)6  3.062e+00  1.971e+01   0.155 0.877183    
## factor(start.age)7  1.023e+01  2.118e+01   0.483 0.630987    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.31 on 50 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.5947, Adjusted R-squared:  0.5542 
## F-statistic: 14.67 on 5 and 50 DF,  p-value: 7.547e-09

We read off from this \(\hat{\mu}=482.7\), \(\hat{\tau}_2= 3.06\), \(\hat{\tau}_3=10.23\), \(\hat{\beta}_1=0.001\), \(\hat{\beta}_2=-2.634\), and \(\hat{\beta}_3=8.343\).

11.1 Causation versus association

Suppose we were interested in the effect of gdp only. Let’s try fitting the model with this independent variable only:

lmMathsGDP <- lm(score ~ gdp , maths)
summary(lmMathsGDP)
## 
## Call:
## lm(formula = score ~ gdp, data = maths)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -105.932  -27.488    5.583   25.925   95.609 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.239e+02  7.975e+00  53.153  < 2e-16 ***
## gdp         1.417e-03  2.322e-04   6.101 5.67e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.18 on 68 degrees of freedom
## Multiple R-squared:  0.3537, Adjusted R-squared:  0.3442 
## F-statistic: 37.22 on 1 and 68 DF,  p-value: 5.667e-08

Notice how the slope estimate for gdp has changed from that in lmMaths: it has increased by about 50% to 0.00147. Two points to note are

  1. gdp and gini are correlated. This means that removing gini from the model is likely to change the estimated slope for gdp. This is something we try to avoid, if we can design the experiment (i.e. we can choose the values of the independent variables.)
  2. If we are hoping to learn a causal relationship between gdp and score, we need to include other variables in the model which may also effect score. If these other variables are omitted, we are (potentially) instead learning about the association between gdp and score. An interpretation of this is as follows.
    • lmMaths tells us that when we increase gdp by one unit, and all other variables stay the same, we expect score to increase by 0.001 units. If we were confident that all other relevant variables were unchanged, the change in score must be caused by the change in gdp.
    • lmMathsGDP (together with our knowledge of the correlation of gdp and gini) tells us that when we increase gdp by one unit, we expect other variables such as gini to change. The combined change in all these variables means that we expect score to increase by 0.0014 units. We cannot necessarily attribute the change in score to the change in gpd: it might have been caused by another variables changing.

11.2 Analysis of Covariance

Recall this plot we produced in Semester 1, Chapter 1:

ggplot(maths, aes(x = gini, y = score, colour = wealthiest))+
  geom_point() +
  geom_smooth(method = "lm")

We have two fitted regression lines: one for the group wealthiest == TRUE (countries with gdp\(>17000\)), and one for the remaining countries. This sort of modelling with multiple regression lines is sometimes referred to as Analysis of Covariance (ANCOVA). The emphasis is usually on comparing means between groups, adjusting for different values of covariates between groups.

We will now consider how we would write down this model, and how to fit it in R. Again, the idea is to use an extra subscript for the qualitative variable (wealthiest): define \(Y_{ij}\) to be the score for the \(j\)th country in group \(i\), where \(i=1\) corresponds to wealthiest==FALSE and \(i=2\) corresponds to wealthiest==TRUE, with \(x_{ij}\) the corresponding gdp value. Our model is \[ Y_{ij} = \mu + \tau_i + \beta_i x_{ij} + \varepsilon_{ij} \] with \(\varepsilon_{ij}\sim N(0,\sigma^2)\). Again this is overparametrised, so we set \(\tau_1=0\). Note that the slope parameter for gdp depends on whether wealthiest is FALSE or TRUE. We think of this as an interaction between gdp and wealthiest. We fit the model in R as follows:

lmMathsWealthiest <- lm(score ~ gini * wealthiest, maths)
summary(lmMathsWealthiest)
## 
## Call:
## lm(formula = score ~ gini * wealthiest, data = maths)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -84.730 -17.966  -0.595  15.104 113.483 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         492.3368    32.5735  15.115   <2e-16 ***
## gini                 -1.7730     0.8507  -2.084   0.0415 *  
## wealthiestTRUE       75.0248    61.7823   1.214   0.2295    
## gini:wealthiestTRUE  -0.4358     1.8385  -0.237   0.8134    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.19 on 59 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.5292, Adjusted R-squared:  0.5053 
## F-statistic: 22.11 on 3 and 59 DF,  p-value: 1.022e-09

We read off \(\hat{\mu}=492.3368\), \(\hat{\tau}_2 = 75.0248\) and \(\hat{\beta_1} = -1.7730\).

The row gini:wealthiestTRUE corresponds to the difference \(\hat{\beta}_2- \hat{\beta}_1\), so we read off \(\hat{\beta}_2= -1.7730 -0.4358\).

We’ll try to redraw the plot, to check we’ve interpreted everything correctly! (To see what the default colours are in hex format, try scales::show_col(hue_pal()(2)))

ggplot(maths, aes(x = gini, y = score, colour = wealthiest))+
  geom_point() +
  geom_abline(slope = -1.773, intercept = 492.3368, col = "#F8766D" )+
  geom_abline(slope = -1.773 - 0.4358, 
              intercept = 492.3368 + 75.0248,
              col = "#00BFC4" )

The formatting of the lines is slightly different, but otherwise this looks correct.

Example 11.1 (Fitting a linear model in R: ANCOVA.)

Consider the built in dataset mtcars (see ?mtcars for details)

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

We’ll be using the column am which describes transmission type. To make it more readable, we’ll do

mtcars2 <- mtcars %>%
  mutate(transmission = factor(am, labels = c("automatic",  "manual")))
  1. Fit the following model in R:

\[ Y_{ij} = \mu + \tau_i + \beta_i x_{ij} + \varepsilon_{ij} \] where \(Y_{ij}\) is the fuel economy mpg of the \(j\)th car in group \(i\), and \(x_{ij}\) is the corresponding weight wt, for \(i=1,2\). Group \(i=1\) corresponds to automatic cars, and \(i=2\) corresponds to manual cars.

  1. For a car with weight of 3 units (1000 lbs), what is the expected change in fuel economy from changing from automatic to manual?

  2. We can plot the fitted model with the commands

ggplot(mtcars2, aes(x = wt, y = mpg, colour = transmission)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Try to (approximately) redraw this plot, without using geom_smooth(): use geom_abline() instead.

  1. In R we do
lmCars <- lm(mpg ~ wt * transmission, mtcars2)
summary(lmCars)
## 
## Call:
## lm(formula = mpg ~ wt * transmission, data = mtcars2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6004 -1.5446 -0.5325  0.9012  6.0909 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            31.4161     3.0201  10.402 4.00e-11 ***
## wt                     -3.7859     0.7856  -4.819 4.55e-05 ***
## transmissionmanual     14.8784     4.2640   3.489  0.00162 ** 
## wt:transmissionmanual  -5.2984     1.4447  -3.667  0.00102 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.591 on 28 degrees of freedom
## Multiple R-squared:  0.833,  Adjusted R-squared:  0.8151 
## F-statistic: 46.57 on 3 and 28 DF,  p-value: 5.209e-11

From this, we read off \(\hat{\mu}=31.4161\), \(\hat{\beta}_1=-3.7859\), \(\hat{\tau}_2=14.8784\), \(\hat{\beta}_2=-3.7859 -5.2984\).

  1. An automatic car with weight 3 units has estimated expected mgp: \[ E(Y) = \hat{\mu} +\hat{\beta}_1\times 3= 31.4161 -3.7859 \times 3 = 20.0584 \] If the car changes from automatic to manual, the estimated expected mgp is \[ \hat{\mu} +\hat{\tau}_2 +\hat{\beta}_2\times 3 \] The change is estimated to be \[\hat{\tau}_2 + 3(\hat{\beta}_2 - \hat{\beta}_1) = 14.8784 -5.2984\times 3. \]
  2. We (approximately) reproduce the plot with the commands
ggplot(mtcars2, aes(x = wt, y = mpg, colour = transmission))+
  geom_point() +
  geom_abline(slope = -3.7859, intercept = 31.4161, col = "#F8766D" )+
  geom_abline(slope = -3.7859 -5.2984, 
              intercept = 31.4161 + 14.8784,
              col = "#00BFC4" )