Chapter 11 Multiple independent variables
Recall the PISA maths test data from Semester 1. Here, our dependent variable was a country’s maths test score, and we had multiple independent variables: gdp, income inequality (gini coefficient), homework hours, and school starting age.
We’ll now look at how to incorporate multiple independent variables in a linear model.
If the main interest is in, say, the effect of gdp, incorporating the variables enables us to assess the effect of gdp, whilst accounting for (sometimes described as “adjusting for” or “controlling for”) possible effects of other variables.
We’ll first import the data, and create one extra column (wealthiest
): a binary variable that indicates whether a country as a gdp
higher and 17000.
<- read_csv("http://www.jeremy-oakley.staff.shef.ac.uk/mas113/maths.csv") %>%
maths mutate(wealthiest = gdp > 17000)
head(maths)
## # A tibble: 6 × 8
## country continent score gdp gini homework start.age wealthiest
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 Albania Europe 413 4147 29 5.1 6 FALSE
## 2 Algeria Africa 360 3844 27.6 NA 6 FALSE
## 3 Argentina South America 409 12449 42.7 3.7 6 FALSE
## 4 Australia Oceanea 494 49928 34.7 6 5 TRUE
## 5 Austria Europe 497 44177 30.5 4.5 6 TRUE
## 6 B-S-J-G (China) Asia 531 8123 42.2 13.8 6 FALSE
For illustration, we’ll treat start.age
as a factor variable with three levels, and consider modelling the effects of gdp
, gini
, homework
and factor(start.age)
on score
. How might we write down such a model?
When thinking about suitable notation:
- use one letter for each quantitative independent variable;
- use one additional subscript for each qualitative independent variable.
We imagine splitting the data up into three groups:
- group \(i=1\) for countries with a
start.age
of 5; - group \(i=2\) for countries with a
start.age
of 6; - group \(i=3\) for countries with a
start.age
of 7.
Then, for the \(j\)-th observation (country) within group \(i\), define
- \(Y_{ij}\) to be the country’s score;
- \(w_{ij}\) to be the country’s gdp;
- \(x_{ij}\) to be the country’s gini coefficient;
- \(z_{ij}\) to be the mean number of homework hours per week in that country.
We interpret \(\tau_i\), for \(i=2\) (or 3) as the difference in mean score between two countries with identical values of gdp
, gini
and homework
, but in which one has a school start.age
of 6 (or 7) and one with a school start.age
of 5.
We fit the model in R as follows:
<- lm(score ~ gdp + gini + homework + factor(start.age), maths)
lmMaths summary(lmMaths)
##
## Call:
## lm(formula = score ~ gdp + gini + homework + factor(start.age),
## data = maths)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.398 -17.884 -0.204 17.867 69.026
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.827e+02 3.743e+01 12.896 < 2e-16 ***
## gdp 1.025e-03 2.181e-04 4.701 2.07e-05 ***
## gini -2.634e+00 6.939e-01 -3.796 0.000399 ***
## homework 8.343e+00 2.337e+00 3.569 0.000802 ***
## factor(start.age)6 3.062e+00 1.971e+01 0.155 0.877183
## factor(start.age)7 1.023e+01 2.118e+01 0.483 0.630987
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.31 on 50 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.5947, Adjusted R-squared: 0.5542
## F-statistic: 14.67 on 5 and 50 DF, p-value: 7.547e-09
We read off from this \(\hat{\mu}=482.7\), \(\hat{\tau}_2= 3.06\), \(\hat{\tau}_3=10.23\), \(\hat{\beta}_1=0.001\), \(\hat{\beta}_2=-2.634\), and \(\hat{\beta}_3=8.343\).
11.1 Causation versus association
Suppose we were interested in the effect of gdp
only. Let’s try fitting the model with this independent variable only:
<- lm(score ~ gdp , maths)
lmMathsGDP summary(lmMathsGDP)
##
## Call:
## lm(formula = score ~ gdp, data = maths)
##
## Residuals:
## Min 1Q Median 3Q Max
## -105.932 -27.488 5.583 25.925 95.609
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.239e+02 7.975e+00 53.153 < 2e-16 ***
## gdp 1.417e-03 2.322e-04 6.101 5.67e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.18 on 68 degrees of freedom
## Multiple R-squared: 0.3537, Adjusted R-squared: 0.3442
## F-statistic: 37.22 on 1 and 68 DF, p-value: 5.667e-08
Notice how the slope estimate for gdp
has changed from that in lmMaths
: it has increased by about 50% to 0.00147. Two points to note are
gdp
andgini
are correlated. This means that removinggini
from the model is likely to change the estimated slope forgdp
. This is something we try to avoid, if we can design the experiment (i.e. we can choose the values of the independent variables.)- If we are hoping to learn a causal relationship between
gdp
andscore
, we need to include other variables in the model which may also effectscore
. If these other variables are omitted, we are (potentially) instead learning about the association betweengdp
andscore
. An interpretation of this is as follows.lmMaths
tells us that when we increasegdp
by one unit, and all other variables stay the same, we expectscore
to increase by 0.001 units. If we were confident that all other relevant variables were unchanged, the change inscore
must be caused by the change ingdp
.lmMathsGDP
(together with our knowledge of the correlation ofgdp
andgini
) tells us that when we increasegdp
by one unit, we expect other variables such asgini
to change. The combined change in all these variables means that we expectscore
to increase by 0.0014 units. We cannot necessarily attribute the change inscore
to the change ingpd
: it might have been caused by another variables changing.
11.2 Analysis of Covariance
Recall this plot we produced in Semester 1, Chapter 1:
ggplot(maths, aes(x = gini, y = score, colour = wealthiest))+
geom_point() +
geom_smooth(method = "lm")
We have two fitted regression lines: one for the group wealthiest == TRUE
(countries with gdp
\(>17000\)), and one for the remaining countries. This sort of modelling with multiple regression lines is sometimes referred to as Analysis of Covariance (ANCOVA). The emphasis is usually on comparing means between groups, adjusting for different values of covariates between groups.
We will now consider how we would write down this model, and how to fit it in R.
Again, the idea is to use an extra subscript for the qualitative variable (wealthiest
): define \(Y_{ij}\) to be the score
for the \(j\)th country in group \(i\), where \(i=1\) corresponds to wealthiest==FALSE
and \(i=2\) corresponds to wealthiest==TRUE
, with \(x_{ij}\) the corresponding gdp
value. Our model is
\[
Y_{ij} = \mu + \tau_i + \beta_i x_{ij} + \varepsilon_{ij}
\]
with \(\varepsilon_{ij}\sim N(0,\sigma^2)\). Again this is overparametrised, so we set \(\tau_1=0\). Note that the slope parameter for gdp
depends on whether wealthiest
is FALSE
or TRUE
. We think of this as an interaction between gdp
and wealthiest
. We fit the model in R as follows:
<- lm(score ~ gini * wealthiest, maths)
lmMathsWealthiest summary(lmMathsWealthiest)
##
## Call:
## lm(formula = score ~ gini * wealthiest, data = maths)
##
## Residuals:
## Min 1Q Median 3Q Max
## -84.730 -17.966 -0.595 15.104 113.483
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 492.3368 32.5735 15.115 <2e-16 ***
## gini -1.7730 0.8507 -2.084 0.0415 *
## wealthiestTRUE 75.0248 61.7823 1.214 0.2295
## gini:wealthiestTRUE -0.4358 1.8385 -0.237 0.8134
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.19 on 59 degrees of freedom
## (7 observations deleted due to missingness)
## Multiple R-squared: 0.5292, Adjusted R-squared: 0.5053
## F-statistic: 22.11 on 3 and 59 DF, p-value: 1.022e-09
We read off \(\hat{\mu}=492.3368\), \(\hat{\tau}_2 = 75.0248\) and \(\hat{\beta_1} = -1.7730\).
The row gini:wealthiestTRUE
corresponds to the difference \(\hat{\beta}_2- \hat{\beta}_1\), so we read off \(\hat{\beta}_2= -1.7730 -0.4358\).
We’ll try to redraw the plot, to check we’ve interpreted everything correctly! (To see what the default colours are in hex format, try scales::show_col(hue_pal()(2))
)
ggplot(maths, aes(x = gini, y = score, colour = wealthiest))+
geom_point() +
geom_abline(slope = -1.773, intercept = 492.3368, col = "#F8766D" )+
geom_abline(slope = -1.773 - 0.4358,
intercept = 492.3368 + 75.0248,
col = "#00BFC4" )
The formatting of the lines is slightly different, but otherwise this looks correct.
Example 11.1 (Fitting a linear model in R: ANCOVA.)
Consider the built in dataset mtcars
(see ?mtcars
for details)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
We’ll be using the column am
which describes transmission type. To make it more readable, we’ll do
<- mtcars %>%
mtcars2 mutate(transmission = factor(am, labels = c("automatic", "manual")))
- Fit the following model in R:
\[
Y_{ij} = \mu + \tau_i + \beta_i x_{ij} + \varepsilon_{ij}
\]
where \(Y_{ij}\) is the fuel economy mpg
of the \(j\)th car in group \(i\), and \(x_{ij}\) is the corresponding weight wt
, for \(i=1,2\). Group \(i=1\) corresponds to automatic cars, and \(i=2\) corresponds to manual cars.
For a car with weight of 3 units (1000 lbs), what is the expected change in fuel economy from changing from automatic to manual?
We can plot the fitted model with the commands
ggplot(mtcars2, aes(x = wt, y = mpg, colour = transmission)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
Try to (approximately) redraw this plot, without using geom_smooth()
: use geom_abline()
instead.
- In R we do
<- lm(mpg ~ wt * transmission, mtcars2)
lmCars summary(lmCars)
##
## Call:
## lm(formula = mpg ~ wt * transmission, data = mtcars2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6004 -1.5446 -0.5325 0.9012 6.0909
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.4161 3.0201 10.402 4.00e-11 ***
## wt -3.7859 0.7856 -4.819 4.55e-05 ***
## transmissionmanual 14.8784 4.2640 3.489 0.00162 **
## wt:transmissionmanual -5.2984 1.4447 -3.667 0.00102 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.591 on 28 degrees of freedom
## Multiple R-squared: 0.833, Adjusted R-squared: 0.8151
## F-statistic: 46.57 on 3 and 28 DF, p-value: 5.209e-11
From this, we read off \(\hat{\mu}=31.4161\), \(\hat{\beta}_1=-3.7859\), \(\hat{\tau}_2=14.8784\), \(\hat{\beta}_2=-3.7859 -5.2984\).
- An automatic car with weight 3 units has estimated expected
mgp
: \[ E(Y) = \hat{\mu} +\hat{\beta}_1\times 3= 31.4161 -3.7859 \times 3 = 20.0584 \] If the car changes from automatic to manual, the estimated expectedmgp
is \[ \hat{\mu} +\hat{\tau}_2 +\hat{\beta}_2\times 3 \] The change is estimated to be \[\hat{\tau}_2 + 3(\hat{\beta}_2 - \hat{\beta}_1) = 14.8784 -5.2984\times 3. \] - We (approximately) reproduce the plot with the commands
ggplot(mtcars2, aes(x = wt, y = mpg, colour = transmission))+
geom_point() +
geom_abline(slope = -3.7859, intercept = 31.4161, col = "#F8766D" )+
geom_abline(slope = -3.7859 -5.2984,
intercept = 31.4161 + 14.8784,
col = "#00BFC4" )