Chapter 5 Likelihood for confidence intervals and hypothesis tests

Maximum likelihood estimation gives us a single value for the unknown parameters \(\boldsymbol{\theta}\), a so-called point estimate. In many settings in statistical inference we want to go further than point estimation, in particular to give some idea of the uncertainty in our point estimate. For example, where we are trying to estimate a single parameter \(\theta\), we may want to produce an interval estimate, typically a set of values \([\theta_1,\theta_2]\) which we believe that the true value \(\theta\) lies in. Alternatively, we may want to test a hypothesis about \(\theta\). The likelihood function can often be used to construct appropriate methods in these settings too, and as with maximum likelihood estimation it can often be shown that they are in some sense optimal.

5.1 Confidence intervals

We will start off by thinking about interval estimation. Assume, in the one parameter case, that we have a likelihood function \(L(\theta;\mathbf{x})\) defined for \(\theta\in\Theta\), maximised at its maximum likelihood estimate \(\hat{\theta}\). Then a natural choice of interval estimate is to set some threshold, \(L_0\) say, and to use the values of \(\theta\) such that \(L(\theta;\mathbf{x})\geq L_0\) as an interval estimate. One common choice for the threshold is to choose \(L_0\) to be a fixed multiple of the maximum likelihood, say \[L_0=e^{-k}L(\hat{\theta};\mathbf{x})\] for some chosen \(k>0\). Equivalently in terms of the log-likelihood, \[\log L_0=\ell(\hat{\theta};\mathbf{x})-k.\] Our choice of \(k\) here will involve a trade off between a precise answer (meaning a narrow interval) and minimising the risk of missing the true value from the interval: a small \(k\) will give a narrow interval but relatively low confidence that the interval contains the true value, while a large \(k\) will give a larger interval and higher confidence.

More generally, we can make the following definition. The \(k\)-unit likelihood region for parameters \(\boldsymbol{\theta}\) based on data \(\mathbf{x}\) is the region \[R_k=\left\{\boldsymbol{\theta}:L(\boldsymbol{\theta};\mathbf{x})\geq e^{-k}L(\boldsymbol{\hat{\theta}};\mathbf{x})\right\},\] or equivalently \[R_k=\left\{\boldsymbol{\theta}:\ell(\boldsymbol{\theta};\mathbf{x})\geq \ell(\boldsymbol{\hat{\theta}};\mathbf{x})-k\right\},\] where \(\boldsymbol{\hat{\theta}}\) is the maximum likelihood estimate of \(\boldsymbol{\theta}\) based on \(\mathbf{x}\).

The values of \(\boldsymbol{\theta}\) within the \(k\)-unit likelihood region are those whose likelihood is at least within a factor \(e^{-k}\) of the maximum. For instance, points in the 1-unit region have likelihoods within a factor \(e^{-1} = 0.368\) of the maximum. The 2-unit region contains points with likelihoods within a factor \(e^{-2} = 0.135\) of the maximum. The 2-unit region is the most commonly used in practice.

Example 5.1 (Likelihood regions.)

Suppose that we have i.i.d. data \(\mathbf{x}=(x_1,x_2,\ldots,x_n)\), for which each data point is modelled as a random sample from \(N(\mu,\sigma^2)\) where \(\mu\) is unknown and \(\sigma^2\) is known. Find the \(k\)-likelihood region \(R_k\) for the parameter \(\mu\).

Solution

First, we need to find the MLE \(\hat\mu\) of \(\mu\). The likelihood function for our model is \[L(\mu;\mathbf{x})=\prod\limits_{i=1}^n \phi(x_i;\mu)=\frac{1}{(2\pi\sigma^2)^{n/2}}\exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2\right),\] where the range of parameter values is all \(\mu\in\mathbb{R}\). The log likelihood is \[\ell(\mu;\mathbf{x}) = -\frac{n}{2}\left(\log(2\pi)+\log(\sigma^2)\right)-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2.\] The usual process of maximisation shows that the maximum likelihood estimator is the sample mean, \[\hat\mu=\frac{1}{n}\sum\limits_{i=1}^n x_i.\]

Now we are ready to identify the \(k\)-likelihood region for \(\mu\). By definition, the \(k\)-likelihood region is \[R_k=\left\{\mu\in\mathbb{R}:|l(\mu;\mathbf{x})-l(\hat{\mu};\mathbf{x})|\leq k\right\}.\] So, \(\mu\in R_k\) if and only if \[\left|\frac{1}{2\sigma^2}\sum\limits_{i=1}^n(x_i-\mu)^2-\frac{1}{2\sigma^2}\sum\limits_{i=1}^n(x_i-\hat\mu)^2\right|\leq k.\] We can simplify this inequality, by noting that \[\begin{align*} \sum\limits_{i=1}^n(x_i-\mu)^2-\sum\limits_{i=1}^n(x_i-\hat\mu)^2 &=\sum\limits_{i=1}^n x_i^2-2x_i\mu+\mu^2-x_i^2+2x_i\hat\mu-\hat\mu^2\\ &=n\mu^2-n\hat\mu^2+2(\hat\mu-\mu)\sum\limits_{i=1}^n x_i\\ &=n\mu^2-n\hat\mu^2+2(\hat\mu-\mu)n\hat\mu\\ &=n(\mu^2+\hat\mu^2-2\mu\hat\mu)\\ &=n(\hat\mu-\mu)^2. \end{align*}\] So, \(\mu\in R_k\) if and only if \[\frac{n}{2\sigma^2}|\hat\mu-\mu|^2\leq k,\] or in other words, \(|\hat\mu-\mu|\leq \sigma\sqrt{\frac{2k}{n}}\), so \[R_k=\left[\hat\mu-\sigma\sqrt{\frac{2k}{n}},\;\hat\mu+\sigma\sqrt{\frac{2k}{n}}\right].\]


5.2 Hypothesis tests

If we are trying to test a null hypothesis \(H_0:\theta=\theta_0\) against a general alternative hypothesis \(H_1:\theta\neq \theta_0\), then we can use a similar idea: we choose a suitable \(k\), construct the \(k\)-likelihood region \(R_k\), and accept \(H_0\) if \(\theta_0\) is inside \(R_k\), or reject \(H_0\) if \(\theta\) is outside \(R_k\).

Example 5.2 (Hypothesis tests based on likelihood.)

In Example , if we used a \(2\)-likelihood test, would we accept the hypothesis that the radioactive decay of carbon-15 is equal to \(\lambda=0.27\)?

Solution

We had found, given the data, that the likelihood function of \(\theta\) was \[L(\lambda;\mathbf{x})=\lambda^{15}e^{-47.58\lambda}\] and the maximum likelihood estimator of \(\lambda\) was \(\hat\lambda\approx 0.32.\) The \(2\)-likelihood region for \(\lambda\) is the set \[R_2=\left\{\lambda>0:L(\lambda;\mathbf{x})\geq e^{-2}L(\hat\lambda;\mathbf{x})\right\},\] so \(\lambda\in R_2\) if and only if \[\lambda^{15}e^{-47.58\lambda}\geq e^{-2}L(0.32;\mathbf{x})=1.24\times 10^{-15}.\] Note that, unlike the previous example, we can’t simplify this inequality and find a `nice’ form for the likelihood region.

Our hypothesis is that, in fact, \(\lambda=0.27\). Our \(2\)-likelihood test will pass if \(\lambda=0.27\) is within the \(2\)-likelihood region, and fail if not. We can evaluate (use e.g.~), \[0.27^{15}e^{-47.58\times 0.27}\approx 7.78\times 10^{-15}\] and note that \(7.78\times 10^{-15}\geq 1.24\times 10^{-15}\). Hence \(\lambda=0.27\) is within the \(2\)-likelihood region and we accept the hypothesis.

5.3 A more formal approach

The confidence intervals and hypothesis tests described above were justified informally, in a way that is sufficient for this module. We will now briefly discuss a more formal justification.

The theory in the following sections may seem overwhelming, but don’t panic! The aim here is just to make you aware of some important results that underpin a lot of statistical theory. We will work through two simple examples, so you can get some sense of how these results can be applied.

5.3.1 Asymptotic normality of the maximum likelihood estimator

In the usual notation, we have \(n\) i.i.d. random variables \(X_1,\ldots,X_n\) with distribution depending on some unknown vector of parameters. \(\boldsymbol{\theta}\). Suppose we have derived an expression for the maximum likelihood estimator. Before we observe the data, we can think of the estimator as a function of \(\mathbf{X} = (X_1,\ldots,X_n)\): we denote it by \(\hat{\boldsymbol{\theta}}(\mathbf{X})\).

We think of \(\hat{\boldsymbol{\theta}}(\mathbf{X})\) as a random variable, because it is a function of the random \(\mathbf{X}\). For example, for \(X_1,\ldots,X_n\stackrel{i.i.d}{\sim}N(\mu, \sigma^2)\), the maximum likelihood estimator for \(\mu\) is \(\bar{X}\), and \(\bar{X}\) is a random variable (normally distributed with mean \(\mu\) and variance \(\sigma^2/n\)).

It can be shown that as \(n\rightarrow \infty\), the distribution of \(\hat{\boldsymbol{\theta}}(\mathbf{X})\) tends to the (multivariate) normal distribution \[ \hat{\boldsymbol{\theta}}(\mathbf{X}) \sim N(\boldsymbol{\theta}, I_E(\boldsymbol{\theta})^{-1}), \] where \(I_E(\boldsymbol{\theta})\) is the Fisher Information Matrix, defined for a vector \(\boldsymbol{\theta}=(\theta_1,\ldots,\theta_d)\) by \[\begin{equation} I_E(\boldsymbol{\theta})=\left(\begin{array}{ccc}e_{1,1}(\boldsymbol{\theta}) & \cdots & e_{1,d}(\boldsymbol{\theta})\\ \vdots & & \vdots \\ e_{d,1}(\boldsymbol{\theta}) & \cdots & e_{d,d}(\boldsymbol{\theta}) \end{array}\right), \end{equation}\] with \[\begin{equation} e_{i,j}(\boldsymbol{\theta})=E\left\{-\frac{\partial^2}{\partial \theta_i\,\partial \theta_j}\ell(\boldsymbol{\theta}; \mathbf{X})\right\}. \end{equation}\]

The expectation on the right hand side needs a little unpacking. It is the expectation of a function of the random variable \(\mathbf{X}\). This function is a partial derivative of another function: the log-likelihood function evaluated at \(\mathbf{X}\) and expressed as a function of \(\boldsymbol{\theta}\).

5.3.2 Confidence intervals based on asymptotic normality

Now suppose we want to construct a \(100(1-\alpha)\%\) confidence interval for any particular element of \(\boldsymbol{\theta}\), say \(\theta_j\). For suitably large \(n\), we have \[\begin{equation} \hat{\theta_j} \sim N(\theta_j,\gamma_{j,j}),\tag{5.1} \end{equation}\] where we \(\gamma_{j,j}\) is the \(\{j,j\}\) element of \(I_E(\theta)^{-1}\). This then gives us an approximate interval as \[\begin{equation} (\hat{\theta_j}-z_{1-\frac{\alpha}{2}}\sqrt{\gamma_{j,j}},\hat{\theta_j}+ z_{1-\frac{\alpha}{2}}\sqrt{\gamma_{j,j}}), \end{equation}\] with \(z_{1-\frac{\alpha}{2}}\) the appropriate percentile from the standard normal distribution.

We will not be able to calculate this interval in practice, as we do not know the true value of \(\boldsymbol{\theta}\), which we would need to evaluate \(I_E(\boldsymbol{\theta})\). Instead, we approximate \(I_E(\boldsymbol{\theta})\) by the observed information matrix

\[\begin{equation} I_O(\boldsymbol{\theta})=\left(\begin{array}{ccc}-\frac{\partial^2}{\partial \theta_1^2}\ell(\boldsymbol{\theta}; \mathbf{x}) & \cdots & -\frac{\partial^2}{\partial \theta_1\partial \theta_d}\ell(\boldsymbol{\theta}; \mathbf{x})\\ \vdots & & \vdots \\ -\frac{\partial^2}{\partial \theta_d\partial \theta_1}\ell(\boldsymbol{\theta}; \mathbf{x}) & \cdots &-\frac{\partial^2}{\partial \theta_d^2}\ell(\boldsymbol{\theta}; \mathbf{x}) \end{array}\right), \end{equation}\] evaluated at \(\boldsymbol{\theta}=\hat{\boldsymbol{\theta}}(\mathbf{x})\). Then denoting \(\tilde{\gamma}_{i,j}\) as the \(i,j\)th element of the inverse of \(I_O(\boldsymbol{\theta})\), we use \[\begin{equation} (\hat{\theta_j}-z_{1-\frac{\alpha}{2}}\sqrt{\tilde{\gamma}_{j,j}},\hat{\theta_j}+ z_{1-\frac{\alpha}{2}}\sqrt{\tilde{\gamma}_{j,j}}), \end{equation}\] as an approximate confidence interval. Since we know that \(\hat{\theta} \rightarrow \theta\) as \(n\rightarrow \infty\), with probability 1, we would expect \(I_O(\boldsymbol{\theta})\) to be similar to \(I_E(\boldsymbol{\theta})\) for large sample sizes.

Example 5.3 (Asymptotic confidence interval.)

Suppose that we have i.i.d. data \(\mathbf{x}=(x_1,x_2,\ldots,x_n)\), for which each data point is modelled as a random sample from \(N(\mu,\sigma^2)\) where \(\mu\) is unknown and \(\sigma^2\) is known. Using asymptotic normality of the maximum likelihood estimator, derive an approximate 95% confidence interval for \(\mu\), and show that it approximately the same as the \(k=2\) likelihood region found in Example 5.1.

Solution

Using equation (5.1), our approximate distribution of the maximum likelihood estimator \(\hat{\mu}(\mathbf{X})\) is \[ \hat{\mu}(\mathbf{X}) \sim N(\mu, \gamma), \] where

\[\gamma ^{-1}=E\left(-\frac{\partial^2}{\partial\mu^2}\ell(\mu; \mathbf{X})\right).\] For the log likelihood we have \[ \ell(\mu;\mathbf{X}) =-\frac{n}{2} \log(2\pi\sigma^2)- \frac{1}{2\sigma^2}\sum_{i=1}^n(X_i - \mu)^2, \] and so \[ -\frac{\partial^2}{\partial\mu^2}\ell(\mu; \mathbf{X}) = \frac{n}{\sigma^2}. \] This is a constant, and so \[E\left(-\frac{\partial^2}{\partial\mu^2}\ell(\mu; \mathbf{X})\right) = \frac{n}{\sigma^2},\] hence our approximate distribution for \(\hat{\mu}(\mathbf{X})\) is \(N(\mu, \sigma^2/n)\) (so the approximation is actually exact in this case.) This gives us the familiar 95% confidence interval

\[ \left[\hat{u} - 1.96 \frac{\sigma}{\sqrt{n}},\quad \hat{u} + 1.96 \frac{\sigma}{\sqrt{n}}\right]. \] We can now see that the \(k=2\) likelihood region in Example 5.1 is the same, just with 2 approximating 1.96.

5.3.3 The generalised likelihood ratio test

Previously, we suggested performing a (Neyman-Pearson) hypothesis test, based on whether a \(k\)-likelihood region contains the hypothesised \(\boldsymbol{\theta}_0\) under \(H_0\). For \(\boldsymbol{\theta}_0\) to lie outside the \(k\)-likelihood region, we require \[ \ell(\hat{\boldsymbol{\theta}}; \mathbf{x}) - \ell(\boldsymbol{\theta}_0;\mathbf{x}) > k, \]

or equivalently \[ \frac{L(\hat{\boldsymbol{\theta}}; \mathbf{x})}{L(\boldsymbol{\theta}_0;\mathbf{x})} > \exp(k) \]

This is an example of a generalised likelihood ratio test (GLRT): the test is derived from considering the ratio of two likelihood functions (but note that the ratio is more commonly written the other way round).

We’ll now state the GLRT in a more general form. Consider hypotheses of the form \[\begin{eqnarray*} H_0&:& \boldsymbol{\theta}\in\Theta_0,\\ H_1&:& \boldsymbol{\theta}\in\Theta_1. \end{eqnarray*}\] The GLRT says reject \(H_0\) if \[\begin{equation} \lambda:=\frac{\sup_{\boldsymbol{\theta}\in\Theta_0}L(\boldsymbol{\theta};x)}{\sup_{\boldsymbol{\theta}\in\Theta}L(\boldsymbol{\theta};x)}<k, \end{equation}\] where \(\Theta\) is the full parameter space for \(\theta\) and with \(k\) chosen such that \[\begin{equation} P\left(\lambda<k|H_0 \mbox{ true}\right)=\alpha. \end{equation}\]

Note that because the null hypothesis is in composite form, the numerator involves maximising the likelihood over all \(\boldsymbol{\theta}\) consistent with \(H_0\).

Now let \(\boldsymbol{\theta}\) be a vector of parameters, and write \(\boldsymbol{\theta}=(\boldsymbol{\theta}_r,\boldsymbol{\theta}_{-r})\), where \(\boldsymbol{\theta}_r\) is a subvector of \(r\) parameters from \(\boldsymbol{\theta}\), and \(\boldsymbol{\theta}_{-r}\) denotes the remaining parameters that make up \(\boldsymbol{\theta}\). Now consider a hypothesis of the form \[\begin{eqnarray*} H_0&:& \boldsymbol{\theta}_r=\boldsymbol{\theta}_0,\\ H_1&:& \boldsymbol{\theta}_r\neq\boldsymbol{\theta}_0. \end{eqnarray*}\] We write the GLR test statistic as \[\begin{equation} \lambda = \frac{L((\boldsymbol{\theta}_0,\hat{\boldsymbol{\theta}}_{-r});\mathbf{x})}{L(\hat{\boldsymbol{\theta}};\mathbf{x})}, \end{equation}\] where \(\hat{\boldsymbol{\theta}}_{-r}\) is the value of \(\boldsymbol{\theta}_{-r}\) that maximises the likelihood when \(\boldsymbol{\theta}_r\) is fixed at \(\boldsymbol{\theta}_0\), and \(\hat{\boldsymbol{\theta}}\) is the usual, unconstrained, m.l.e. It is then possible to prove that as the sample size tends to infinity, the distribution of \(-2\log \lambda\) tends to the \(\chi^2_r\) distribution, when \(H_0\) is true. Hence for a test of size \(\alpha\), we would reject \(H_0\) if \(-2\log\lambda\) is greater than the \(100(1-\alpha)\) percentile of the \(\chi^2_r\) distribution.

Example 5.4 (GLRT for normally distributed data, known variance.)

Suppose that we have i.i.d. data \(\mathbf{x}=(x_1,x_2,\ldots,x_n)\), for which each data point is modelled as a random sample from \(N(\mu,\sigma^2)\) where \(\mu\) is unknown and \(\sigma^2\) is known. To test the hypothesis \(H_0:\mu = \mu_0\) against a two sided alternative, we use the test statistic \[ Z = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}}, \] where, for a test of size 0.05, we reject if \(|Z|\) is greater than the 97.5th percentile of the \(N(0,1)\) distribution. Show that this is equivalent to a GLRT, using the approximate distribution for \(\lambda\).

Solution

Under \(H_0\), we have \(\mu\) fixed at \(\mu_0\), and under \(H_A\), the likelihood is maximised at \(\mu = \bar{x}\). So we have \[ L(\mu_0; \mathbf{x}) = \frac{1}{(2\pi\sigma^2)^{n/2}}\exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i - \mu_0)^2\right), \] and \[ L(\hat{\mu}; \mathbf{x}) = \frac{1}{(2\pi\sigma^2)^{n/2}}\exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i - \bar{x})^2\right), \] and so \[ \lambda = \frac{L(\mu_0; \mathbf{x})}{L(\hat{\mu}; \mathbf{x})} = \exp\left(-\frac{1}{2\sigma^2}\left(\sum_{i=1}^n(x_i - \mu_0)^2 -\sum_{i=1}^n(x_i - \bar{x})^2 \right)\right). \]

Hence, after a little algebra

\[ -2\log \lambda = \frac{n}{\sigma^2}(\bar{x} - \mu_0)^2. \] Using the approximate distribution for \(\lambda\), we reject \(H_0\) for a test of size 0.05 if \[-2\log \lambda > \chi^2_{1;\, 0.05}\simeq 3.8415\simeq(1.96)^2 \]. This is equivalent to rejecting \(H_0\) if \[ \left|\frac{\sqrt{n}}{\sigma}(\bar{x} - \mu_0) \right| > Z_{0.05}. \] Note that for \(Z\sim N(0,1)\), we have \(Z^2\sim \chi^2_1\), which helps us to understand why the tests are equivalent.