1 Basic concepts

In this chapter, we will:

start with a motivating example;
introduce the idea of a linear model;
present matrix notation for representing linear models;
define some terms used when discussing and analysing linear models.

1.1 Motivating example

Example 1.1 Respiration data

The following is an extract from a data set with data on 305 individuals:

library(tidyverse)
respiration <- read_table("https://oakleyj.github.io/exampledata/resp_data.txt")
respiration

# A tibble: 305 × 8
     vol exercise pulse pollution asbestos  home smoke asthma
   <dbl>    <dbl> <dbl>     <dbl>    <dbl> <dbl> <dbl>  <dbl>
 1  117.     15.8 17.5     11.9          1     1     1      0
 2  148.     23.5 28.4      1.88         1     3     1      1
 3  214.     13.8 15.0      2.38         1     1     0      0
 4  162.     15.6 15.7      6.34         1     3     0      1
 5  352.     26.6 19.7      1.63         0     4     0      0
 6  304.     23.2 14.2      2.72         0     4     0      0
 7  157.     29.0 33.7      4.03         0     1     1      0
 8  110.     16.6 14.7      1.24         1     2     1      0
 9  218.     15.1  9.16     1.14         0     2     0      0
10  246.     16.9 21.9      0.731        0     4     0      0
# ℹ 295 more rows

The meanings of the column headings are:

vol - volume of air expelled whilst breathing out
exer - number of hours exercise per month
pulse- scaled resting pulse rate
poll - lifetime exposure to air pollution
asbestos - has the individual been exposed to asbestos (1=exposed, 0=unexposed)
home - place of residence (1=England, 2=Scotland, 3=Wales, 4=Northern Ireland)
smoke - have they ever smoked (1=yes, 0=no)
asthma - do they have asthma (1=yes, 0=no)

The key question we are interested in is how the volume of air expelled relates to the other variables.

There are a number of types of variables in this data set: vol, exer, pulse and poll are continuous variables, while asbestos, home, smoke and asthma are categorical (factor) variables. Of the latter, home has multiple levels while the others are binary.

The variable we are trying to predict is the response variable, also referred to as the dependent variable; in this example, the response is vol. The variables that we use to predict the response are called either predictor, explanatory or independent variables.

Questions we might like to answer include:

Which variables are ‘best’ at explaining the variation in the volume of air and do we need all the variables? (This amounts to choosing a statistical model.)
How much of the variation in the volume of air can we account for with our model?
Does the model that we come up with satisfy the implicit assumptions of a linear model?
How can we perform hypothesis tests relating to parameters in our statistical model?
Are there observations that don’t fit our model?
Are there observations that might be exerting a lot of influence over our model parameters?
How can we make predictions about volumes based on new observations and how can we calculate confidence intervals for these predictions?

1.2 The idea of a linear model

In simple linear regression, we have \(n\) observations of two variables, \((x_i,y_i)\) for \(i=1,\ldots,n\). We treat \(x_i\) as an explanatory variable and \(y_i\) as the response variable (or dependent variable), and we attempt to fit a straight line \(y=\beta_0+\beta_1 x\) to our data points. We further suppose that each \(y_i\) is an observed value of a random variable \(Y_i\).

Notation: lower and upper case letters

A convention is to use an upper case letter to denote a random variable, and a lower case letter to denote the observed value of that random variable, e.g., in statistical modelling we treat some data \(y_1,\ldots,y_n\) as observations of random variables \(Y_1,\ldots,Y_n\), and our statistical model is a probability distribution for \(Y_1,\ldots,Y_n\).

We will not always keep to this convention as it can become notationally difficult. You will need to be alert to the context as to whether we are considering a random variable or an observed value, though I will try to make it as clear as possible in these notes.

A statistical model here takes the form \(Y_i=\beta_0+\beta_1 x_i+\varepsilon_i\), where \(\varepsilon_i\) is a (random) “error” term giving the difference between the straight line value and the actual value \(y_i\). In this section we discuss the generalization of this known as multiple regression, which allows for more than one explanatory variable.

Suppose that for each observation \(i\) we have a response \(y_i\) and \(r\) explanatory variables \(x_{i1},\ldots,x_{ir}\), giving data \[ (x_{i1},\ldots,x_{ir},y_i), \quad i=1,\ldots,n. \] We could propose the statistical model \[ Y_i=\beta_0+\beta_1 x_{i1}+\cdots+\beta_rx_{ir}+\varepsilon_i. \tag{1.1}\] This has two parts, a linear predictor \(\beta_0+\beta_1 x_{i1}+\cdots+\beta_r x_{ir}\) and a random error \(\varepsilon_i\). The linear predictor formalizes the idea that the response is a linear combination of terms involving the explanatory variables. The error term allows for variation in the response for identical values of the explanatory variables (not necessarily measurement errors); we later impose some conditions on \(\varepsilon_i\).

When \(r=2\), giving \(Y_i=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\varepsilon_i\), we are fitting a plane to the data; as the number of parameters increases we fit increasingly higher-dimensional hyper-planes.

Alternatively, reverting to a single explanatory variable, we might wish to express the relationship between \(y\) and \(x\) in a quadratic way: \[ Y_i=\beta_0+\beta_1 x_i+\beta_2 x_i^2+\varepsilon_i. \tag{1.2}\] In this model \(\beta_0+\beta_1 x_i+\beta_2 x_i^2\) is the linear predictor. Although the relationship is quadratic, we would still call this a linear model because it is linear in the parameters \(\beta_0\), \(\beta_1\) and \(\beta_2\). It is a common misconception that linear models require the relationship between the response and explanatory variables to be linear, but in this terminology a linear model is one which is linear in the parameters \(\beta_i\), not necessarily in the explanatory variables.

1.3 Matrix formulation

Note

In these notes vectors are written in bold and scalars in normal text, so \(\boldsymbol{y}\) is a vector and \(y\) is a single value.

It is convenient to express linear models using vectors and matrices. We can always collect the observed values of the variable \(y\) into a vector \(\boldsymbol{y} =(y_1,\ldots,y_n)^T\) and we can similarly define \(\boldsymbol{x} =(x_1,\ldots,x_n)^T\) and \(\boldsymbol{\varepsilon}=(\varepsilon_1,\ldots,\varepsilon_n)^T\). If we also define \(\boldsymbol{1}_n=(1,,\ldots,1)^T\) to be a vector of \(n\) ones, we can write the simple regression model as \[\begin{equation}\label{eq3_3} \boldsymbol{Y}=\beta_0 \boldsymbol{1}_n+\beta_1 \boldsymbol{x}+\boldsymbol{\varepsilon}. \end{equation}\] Even more neatly, if we define the parameter vector \(\boldsymbol{\beta}=(\beta_0,\beta_1)^T\) and the \(n\times 2\) matrix \(X\) is defined to have \(\boldsymbol{1}_n\) as its first column and \(\boldsymbol{x}\) as its second, then

\[ \boldsymbol{Y}=X\boldsymbol{\beta} +\boldsymbol{\varepsilon}. \tag{1.3}\] The form (Equation 1.3) naturally generalizes to more complicated models; for a suitable definition of the parameter vector \(\boldsymbol{\beta}\) and the matrix \(X\) (also called the design matrix) we can express both Equation 1.1 and Equation 1.2 in this form.

The importance of matrix notation

Matrix notation is really useful. As well as clarifying the definition of a linear model (any model that can be expressed in the form Equation 1.3), it means that any result obtained using the matrix notation can be applied to all linear models.

1.4 Parameters and assumptions

The linear model is expressed by (Equation 1.3). In its general form,

\(\boldsymbol{y}\) is a \(n\times 1\) vector of observed random variables.
\(X\) is an \(n\times p\) matrix of known coefficients (perhaps observed or controlled values of other explanatory variables, but also perhaps functions of such explanatory variables, or just constants).
\(\boldsymbol{\beta}\) is a \(p\times 1\) vector of unknown parameters.
\(\boldsymbol{\varepsilon}\) is an \(n \times 1\) vector of unobserved random variables.

The number of components, \(p\), of \(\boldsymbol{\beta}\) is allowed to be whatever we wish, as long as it is not too large to be sensible for the number of observations. In the multiple regression model (Equation 1.1) it would be \(r+1\), while in the quadratic regression model (Equation 1.2) it would be 3.

To fully define our model we need to make some assumptions on \(\boldsymbol{\varepsilon}\). In the general linear model, we assume that \(\boldsymbol{\varepsilon}\) has a multivariate normal distribution, whose components have zero mean, are independent, and have common variance \(\sigma^2\) (that they have the same variance is referred to as homoscedasticity), which is another unknown parameter. Testing these assumptions and what to do if they are not satisfied will be a major part of this module.

From these assumptions, \[ \boldsymbol{\varepsilon}\sim N_n(\boldsymbol{0},\sigma^2I_n) \] where \(N_n\) represents an \(n\)-dimensional multivariate normal distribution.

Since \(\boldsymbol{Y}=X\boldsymbol{\beta}+\boldsymbol{\varepsilon}\), it follows (from the linear transformation property for the multivariate normal) that \[ \boldsymbol{Y}\sim N_n(X\boldsymbol{\beta},\sigma^2I_n).\]

1.5 Terminology

We can think of each column of the \(X\) matrix as comprising \(n\) values of a regressor variable, also called an independent variable. So we have in general one response variable and \(p\) (\(p=r+1\) in multiple regression, \(p=3\) in quadratic regression) regressor variables.

Sometimes this terminology may seem strange, because, for example, in the simple linear regression model the \(X\) matrix has two columns, the first of which is a column of ones. We still would say that this column defines a regressor variable, but this ‘variable’ is a constant.

Regressor variables are also sometimes called explanatory variables, but in this course we will give this term a slightly different meaning. Consider the quadratic regression (Equation 1.2). Here there are three regressor variables (the constant, the \(x_i\) column and the \(x_i^2\) column), but there is really only one \(x\) variable. The three regressor variables are all functions of the \(x\) variable. In this course we will refer to the \(x\) variable as the single explanatory variable in the quadratic regression model. In general, the regressor variables will always be functions of the explanatory variables.

Example 1.2 An example of the design matrix \(X\) - polynomial regression

The general polynomial regression with one explanatory variable, a generalization of the quadratic regression Equation 1.2, is \[ Y_i=\beta_0+\beta_1x_i+\cdots+\beta_rx_i^r+\varepsilon_i \] which is turned into the general linear model form by \[ \boldsymbol{Y}=\left(\begin{array}{c} Y_1 \\ Y_2 \\ \vdots \\ Y_n\end{array}\right), \quad X=\left(\begin{array}{cccc} 1 & x_1 & \cdots & x_1^r \\ 1 & x_2 & \cdots & x_2^r \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_n & \cdots & x_n^r\end{array}\right), \quad \boldsymbol{\beta}=\left(\begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_r\end{array}\right), \quad \boldsymbol{\varepsilon}=\left(\begin{array}{c} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n\end{array}\right) \] and \(p=r+1\).

Exercise

Exercise 1.1 An alternative way to parameterise the simple linear regression model is as follows: \[ Y_i = \beta_0 + \beta_1(x_i-\bar{x}) + \varepsilon_i. \] This is sometimes referred to as mean-centring the independent variable.

Write this model in matrix notation.
By considering \(E(Y_i)\), give an interpretation of the parameter \(\beta_0\). (Assume that \(\bar{x}\neq 0\), so that \(\beta_0\) would not be the \(y\)-intercept in a regression line.)

Solution

In the matrix notation \[\boldsymbol{Y} = X\boldsymbol{\beta} + \boldsymbol{\varepsilon},\] we have

\[ \boldsymbol{Y}=\left(\begin{array}{c} Y_1 \\ Y_2 \\ \vdots \\ Y_n\end{array}\right), \quad X=\left(\begin{array}{cc} 1 & (x_1-\bar{x})\\ 1 & (x_2- \bar{x})\\ \vdots & \vdots \\ 1 & (x_n- \bar{x}) \end{array}\right), \quad \boldsymbol{\beta}=\left(\begin{array}{c} \beta_0 \\ \beta_1 \end{array}\right), \quad \boldsymbol{\varepsilon}=\left(\begin{array}{c} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n\end{array}\right). \]

We note that if \(x_i = \bar{x}\), then \(E(Y_i) = \beta_0\). So we interpret \(\beta_0\) as the expected value of the dependent variable when the independent variable is equal to its mean value.