Chapter 7 Parameter estimation
In this Chapter we study how to fit a linear model to some observed data, i.e., how to use the data to estimate the model parameters. Recall the disputed election example (re-plotted below, with the contested election omitted).
For election \(i\) let \(x_i\) denote the difference in machine votes (Democrat \(-\) Republican) and \(y_i\) denote the observed difference in absentee votes (Democrat \(-\) Republican). We’ll consider the simple linear regression model, and in a slight abuse of notation we will write \[\begin{equation} y_i=\beta_0 + \beta_1 x_i + \varepsilon_i, \end{equation}\] for \(i=1,\ldots,n\). (Because \(y_i\) is an observed value rather than a random variable \(Y_i\), the error term \(\varepsilon_i\) is now a constant: the ‘realised’ error rather than a random variable with the \(N(0,\sigma^2)\) distribution. But we will not make this distinction in our notation.)
The unknown parameters to be estimated are \(\beta_0,\beta_1\) and \(\sigma^2\). We will consider how to estimate \(\beta_0\) and \(\beta_1\) first and deal with \(\sigma^2\) later.
7.1 Least squares estimation
Firstly, we rearrange the above equation to get \[ \varepsilon_i=y_i - \beta_0 - \beta_1 x_i. \] Since we know the values of \(y_i\) and \(x_i\), for any choice of \((\beta_0,\beta_1)\) we can then calculate the implied values of \(\varepsilon_i\). Note that in this model the random error term has expectation 0, and so small (absolute) values of the errors are more probable than large (absolute) values. Additionally, some choices of \((\beta_0,\beta_1)\) are clearly better than others. Consider the following figure:
We see that choosing \((\beta_0=-125,\beta_1=0.013)\) appears to fit the data better than choosing \((\beta_0=0,\beta_1=0.001)\) because the observed points lie closer to the line \(y=\beta_0 +\beta_1 x\). Specifically, we prefer \((\beta_0=-125,\beta_1=0.013)\) because on average \(|\varepsilon_1|,\ldots,|\varepsilon_n|\) are smaller:
\(\sum_{i=1}^n |\varepsilon_i|\) | \(\sum_{i=1}^n \varepsilon_i^2\) | |
---|---|---|
\((\beta_0=0,\beta_1=0.001)\) | 8410 | 4,704,524 |
\((\beta_0=-125,\beta_1=0.013)\) | 5018 | 2,007,945 |
This suggests we should choose \((\beta_0,\beta_1)\) to make \(\sum_{i=1}^n \varepsilon_i^2\) as small as possible (note that minimising \(\sum_{i=1}^n \varepsilon_i^2\) is easier than minimising \(\sum_{i=1}^n |\varepsilon_i|\)). Define \[\begin{equation}R(\beta_0,\beta_1)=\sum_{i=1}^n \varepsilon_i^2 = \sum_{i=1}^n (y_i-\beta_0-\beta_1 x_i)^2. \end{equation}\] The least squares estimates of \(\beta_0\) and \(\beta_1\) are the values of \(\beta_0\) and \(\beta_1\) that minimise \(R(\beta_0,\beta_1)\). Now \[\begin{eqnarray} \frac{\partial R(\beta_0,\beta_1)}{\partial \beta_0}&=& -2\sum_{i=1}^n(y_i-\beta_0-\beta_1 x_i),\\ \frac{\partial R(\beta_0,\beta_1)}{\partial \beta_1}&=& -2\sum_{i=1}^nx_i(y_i-\beta_0-\beta_1 x_i). \end{eqnarray}\] Let \(\hat{\beta}_0\) and \(\hat{\beta}_1\) denote the least squares estimates of \(\beta_0\) and \(\beta_1\). Since (by definition) \(\hat{\beta}_0\) and \(\hat{\beta}_1\) minimise \(R(\beta_0,\beta_1)\), both partial derivative must equal 0 at \(\beta_0=\hat{\beta}_0\) and \(\beta_1=\hat{\beta}_1\), so \[\begin{eqnarray} 0&=& -\sum_{i=1}^n(y_i-\hat{\beta}_0-\hat{\beta}_1 x_i),\\ 0&=& -\sum_{i=1}^nx_i(y_i-\hat{\beta}_0-\hat{\beta}_1 x_i), \end{eqnarray}\] which we can re-write as \[\begin{eqnarray} n\hat{\beta}_0 +\hat{\beta}_1\sum_{i=1}^n x_i &=& \sum_{i=1}^n y_i,\\ \hat{\beta}_0\sum_{i=1}^n x_i +\hat{\beta}_1 \sum_{i=1}^n x_i^2 &=& \sum_{i=1}^n x_iy_i, \end{eqnarray}\] and these can be solved simultaneously to give \[\begin{eqnarray} \hat{\beta}_1&=&\frac{n \sum_{i=1}^n x_i y_i -\sum_{i=1}^n x_i \sum_{i=1}^n y_i}{n \sum_{i=1}^n x_i^2 -\left(\sum_{i=1}^n x_i \right)^2}, \\ \hat{\beta}_0&=&\bar{y} - \hat{\beta}_1\bar{x}. \end{eqnarray}\]
Although you won’t need to calculate these terms by hand, we will tidy up these expressions to get the formulae that you may see in textbooks.
Define \[\begin{eqnarray} s_{xy}&=&\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y}),\\ s_{xx}&=&\sum_{i=1}^n (x_i-\bar{x})^2. \end{eqnarray}\] Note that \(\sum_{i=1}^n (x_i-\bar{x})=0\) and so \(\sum_{i=1}^n(x_i-\bar{x})\bar{y}=\bar{y}\sum_{i=1}^n(x_i-\bar{x})=0\). Hence \[\begin{equation} s_{xy}=\sum_{i=1}^n(x_i-\bar{x})y_i = \sum_{i=1}^n x_i y_i -\frac{\sum_{i=1}^n x_i\sum_{i=1}^n y_i}{n}. \end{equation}\] Also, \[\begin{equation} s_{xx}=\sum_{i=1}^n x_i^2 - \frac{\left(\sum_{i=1}^n x_i\right)^2}{n}, \end{equation}\] and so we can write \[\begin{equation} \boxed{\,\hat{\beta}_1=\frac{s_{xy}}{s_{xx}}\quad\hat{\beta}_0=\bar{y}-\hat{\beta}_1\bar{x}.}\label{abhat} \end{equation}\]
We will see how to fit linear models in R later on, but for now, for the election data we can do
<- election$machine.diff
x <- election$absentee.diff
y <- sum((x - mean(x)) * (y - mean(y)))/
beta1Hat sum((x - mean(x))^2)
<- mean(y) - beta1Hat * mean(x)
beta0Hat c(beta0Hat, beta1Hat)
## [1] -125.90364382 0.01270346
So we have \(\hat{\beta}_0=-125.9036\) and \(\hat{\beta}_1=0.0127\). The line \(y=-125.9036+0.0127x\). This gives a line very close to that in panel (b) in Figure 7.2.
Example 7.1 (Least squares estimation: weights example.)
Continuing from Example 6.1, suppose the observed weights are denoted by \(y_1, y_2, y_3\). Derive the least squares estimates of \(\theta_A\) and \(\theta_B\)
Solution
We define \[ R(\theta_A, \theta_B) := \sum_{i=1}^3 \varepsilon_i^2 = (y_1-\theta_A)^2 + (y_2-\theta_B)^2+(y_3-\theta_A -\theta_B)^2. \] Then the least squares estimates are the solutions of \[\begin{align*} \frac{\partial R(\theta_A, \theta_B)}{\partial \theta_A} = -2(y_1-\theta_A) -2(y_3 - \theta_A-\theta_B) = 0,\\ \frac{\partial R(\theta_A, \theta_B)}{\partial \theta_B} = -2(y_2-\theta_B) -2(y_3 - \theta_A-\theta_B) = 0. \end{align*}\]
After a little algebra to solve these two equations simultaneously, we get \[ \hat{\theta}_A = \frac{2y_1 -y_2 + y_3}{3}, \quad \hat{\theta}_B = \frac{2y_2 -y_1 + y_3}{3}. \]
7.2 Assumptions for least squares
Note that we have not made use of the assumption that the errors \(\varepsilon_i\) are normally distributed. If we look again at the function we chose to minimise \[ R(\beta_0,\beta_1)=\sum_{i=1}^n \varepsilon_i^2 = \sum_{i=1}^n (y_i-\beta_0-\beta_1 x_i)^2, \] we note that each error \(\varepsilon_i\) has equal weight in the sum; we are not trying to make some errors smaller than others. This relates back to the assumption that the errors have equal variance; this is an assumption we should try to check. There are analyses we might choose to do that rely on the normality assumption, but these will not be the main focus in this module.
There are actually two ways in which a single observation might be unduly influential:
- the error \(\varepsilon_i\) is an outlier;
- the independent variable \(x_i\) is an outlier.
Case (1) may be the result of non-constant variance of the errors. An example is shown below.
Case (2) is sometimes referred to as high leverage. We illustrate this below.
7.3 Relationship with maximum likelihood estimation
Exercise. (This is a good opportunity to test your understanding of likelihood!) Suppose, in the simple linear regression model, \(\beta_0\) and \(\beta_1\) are to be estimated by maximum likelihood. Show that the maximum likelihood estimates are also the least squares estimates. (You will not need to actually obtain the maximum likelihood estimates to prove this: you just need to show that maximum likelihood must result in the same solution.)
Solution
The likelihood function is given by
\[ L(\beta_0, \beta_1, \sigma^2; \mathbf{y}) = \prod_{i=1}^nf_Y(y_i; \beta_0, \beta_1,\sigma^2) \] We have \(Y_i \sim N(\beta_0 + \beta_1x_i, \sigma^2)\), so \[ f_Y(y_i; \beta_0, \beta_1,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{1}{2\sigma^2}(y_i - \beta_0-\beta_1x_i)^2\right) \] and so the log-likelihood is \[ \ell(\beta_0, \beta_1, \sigma^2; \mathbf{y}) = -\frac{n}{2}\log\sigma^2 - \frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - \beta_0-\beta_1x_i)^2. \] The maximum likelihood estimators of \(\beta_0\) and \(\beta_1\) are the solutions of \[\begin{align*} \frac{\partial \ell}{\partial \beta_0} &= \frac{1}{\sigma^2}\sum_{i=1}^n(y_i - \beta_0-\beta_1x_i) = 0,\\ \frac{\partial \ell}{\partial \beta_1} &= \frac{1}{\sigma^2}\sum_{i=1}^nx_i(y_i - \beta_0-\beta_1x_i) = 0, \end{align*}\] but these are the same two simultaneous equations we solved to get the least squares estimates of \(\beta_0\) and \(\beta_1\): the least squares estimates are the same as the maximum likelihood estimates.
7.4 Estimating \(\sigma^2\)
We’ll just state the estimate of \(\sigma^2\), without going too much into the detail. We define \[ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i \] to be the \(i\)-th fitted value, so that \((x_i, \hat{y}_i)\) is a point on the fitted regression line. We define \[ e_i = y_i - \hat{y}_i \] to be the \(i\)-th residual: the difference between the observation \(y_i\) and the corresponding fitted value \(\hat{y}_i\). Informally, we could think of \(e_i = y_i - \hat{\beta}_0 + \hat{\beta}_1 x_i\) as an estimate of the error term \(\varepsilon_i = y_i - \beta_0 + \beta_1 x_i\). We then estimate \(\sigma^2\) using the residual sum of squares \[ \hat{\sigma}^2 = \frac{1}{n-2} \sum_{i=1}^n e_i^2. \] The denominator \(n-2\) is the residual degrees of freedom. Informally, it makes sense that we need at least three observations (for the simple linear regression model) to estimate \(\sigma^2\), because with only two observations, the fitted line would pass through both points; we wouldn’t have any idea how far an observation can be from the regression line.
We can get \(\hat{\sigma}^2\) easily from R, but we will do the calculation manually for now. Repeating the earlier code:
<- election$machine.diff
x <- election$absentee.diff
y <- sum((x - mean(x)) * (y - mean(y)))/
beta1Hat sum((x - mean(x))^2)
<- mean(y) - beta1Hat * mean(x) beta0Hat
Then
<- y - beta0Hat - beta1Hat*x
e sum(e^2) / (length(e) - 2)
## [1] 105519.8
This suggests that most points will be within \(2\sqrt{105519.8}\simeq 650\) of the fitted regression line (on the \(y\)-axis). From looking at panel (b) in Figure 7.2, this seems about right.