Chapter 1 MCMC Sampling Recap
\(\def \mb{\mathbb}\) \(\def \E{\mb{E}}\) \(\def \P{\mb{P}}\) \(\DeclareMathOperator{\var}{Var}\) \(\DeclareMathOperator{\cov}{Cov}\)
Aims of this chapter |
---|
1. Revision of Markov Chain Monte Carlo for practical Bayesian inference. |
2. In particular, revise the random walk Metropolis-Hastings. |
3. Appreciate the limitations of random walk Metropolis-Hastings as an inference tool. |
1.1 Bayesian inference continued
We will continue the Bayesian approach to statistical inference that we studied in Semester 1. In particular, given data \(x\) and an unknown parameters of interest \(\theta\), we derive the posterior distribution \(f(\theta|x)\) so that, for example, if we want to consider our probability of \(\theta\) lying in some set \(R\), we would compute \[ P(\theta\in R|x) = \int_R f(\theta|x)d\theta. \] Given our prior distribution \(f(\theta)\) and likelihood \(f(x|\theta)\), the integral above can be computed via \[ \int_R f(\theta|x)d\theta = \frac{\int_R f(\theta)f(x|\theta)d\theta}{\int_{\Theta} f(\theta)f(x|\theta)d\theta}, \] where \(\Theta\) is the sample space of \(\theta\). Often, however, we cannot evaluate these integrals analytically (we usually won’t have conjugate priors we can use). The parameter \(\theta\) may be high-dimensional, and so to use Bayesian inference in practice, we need a reliable way to compute (estimate) high-dimensional integrals.
1.2 Monte Carlo estimation
Suppose we are interested in an integral \[I=\E(g(X))=\int g(x) f(x) \mathrm{d}x.\] Let \(X_1,X_2,\ldots,X_n\) be independent random variables with pdf \(f(x)\). Then a Monte Carlo approximation to \(I\) is \[\begin{equation} \hat{I}_n=\frac{1}{n} \sum_{i=1}^n g(X_i). \end{equation}\]
The main idea in Monte Carlo integration is to approximate \(I\) by \(\hat{I}_n\).
For example, suppose we wish to find the theoretical mean \[\E(X)= \int_{-\infty}^{\infty} x f_X(x) \mathrm{d} x,\] then we can approximate the theoretical mean by the sample mean \[\E(X) \approx \frac{1}{n}\sum_{i=1}^n X_i,\] and this is the same as the Monte Carlo estimate.
1.2.1 Properties of the Monte Carlo estimate
\(\hat{I}_n\) is an unbiased estimator of \(I\).
\(\hat{I}_n\) converges to \(I\) as \(n \rightarrow \infty\) according to the strong law of large numbers (SLLN).
Although we know that \(\hat{I}_n\) converges, we do not know how fast. We need to know how large \(n\) must be to achieve a certain error. The variance of the Monte Carlo estimate is given by \[\E[(\hat{I}_n-I)^2] = \frac{\sigma^2}{n},\] where \(\sigma^2 = \var(g(X))\). Thus the ‘root mean square error’ (RMSE) of \(\hat{I}_n\) is \[\mbox{RMSE}(\hat{I}_n) = \frac{\sigma}{\sqrt{n}} = O(n^{-1/2}).\]
Thus, our estimate is more accurate as \(n \rightarrow \infty\), and is less accurate when \(\sigma^2\) is large. \(\sigma^2\) will usually be unknown, but we can estimate it: \[\hat{\sigma}^2=\frac{1}{n-1} \sum_{i=1}^n (g(X_i)-\hat{I}_n)^2.\] We call \(\frac{\hat{\sigma}}{\sqrt{n}}\) the Monte Carlo standard error.
We write \[\mbox{RMSE}(\hat{I}_n) = O(n^{-1/2}),\] to emphasise the rate of convergence of the error with \(n\). To get 1 digit more accuracy requires a 100-fold increase in \(n\). A 3-digit improvement would require us to multiply \(n\) by \(10^6\). Consequently Monte Carlo is not usually suited for problems where we need a very high accuracy. Although the error rate is low (the RMSE decreases slowly with \(n\)), it has the nice properties that the RMSE:
- Does not depend on \(d=\dim(x)\),
- Does not depend on the smoothness of \(f\).
Consequently Monte Carlo is very competitive in high dimensional problems that are not smooth.
In addition to the rate of convergence, the central limit theorem tells us the asymptotic distribution of \(\hat{I}_n\): \[\frac{\sqrt{n} (\hat{I}_n-I)}{\sigma} \rightarrow N(0,1) \mbox{ in distribution as } n \rightarrow \infty.\] Informally, \(\hat{I}_n\) is approximately \(N(I, \frac{\sigma^2}{n})\) for large \(n\). This allows us to calculate confidence intervals for \(I\).
1.2.2 Expressing quantities as expectations
In many statistical inference problems, we are interested in estimating some descriptive variable, such as a probability. It is useful to us that we can often write these quantities in terms of expectations, and apply our Monte Carlo estimate appropriately.
- For example, a variance can be expressed as two expectations:
\[\begin{equation}
\var(X)=\E(X^2)-\E(X)^2.
\end{equation}\]
- A probability \(\P(X<a)\) can be expressed as an expectation: the
expectation of an indicator function of \(X\). An indicator function
\(\mb{I}(E)\) of an event \(E\) is defined as
\[\begin{equation}
\mb{I}(E)=\left\{\begin{array}{cc}1 & \mbox{ $E$ is true} \\ 0 & \mbox{
$E$ is false}
\end{array} \right.
\end{equation}\]
Then we have
\[\begin{eqnarray}
\E\{\mb{I}(X<a)\}&=&1\times \P(X<a)+0\times \P(X\ge a) \\ &=& \P(X<a).
\end{eqnarray}\]
- A percentile of the distribution of a random variable \(X\) can be estimated by taking the sample percentile from the generated sample of values \(X_1,\ldots,X_n\). Informally, we would expect the estimate to be more accurate as \(n\) increases. Determining a percentile is equivalent to inverting the distribution function; if for example we wish to know the 95th percentile, we must find \(\nu\) such that \[\begin{equation} \P(X\le \nu)=0.95, \end{equation}\] so the more accurately we estimate the distribution function \(F(X)\), the more accurate we would expect the estimate of any particular percentile to be.
1.2.3 Estimation of general integrals
In general, let the integral of interest be \[\begin{equation} R=\int f(x) \mathrm{d}x. \label{intI} \end{equation}\] Suppose that \(g(x)\) is some density function that we can easily produce a sample of values \(X_1,\ldots,X_n\) from. Then \[\begin{eqnarray} R&=&\int \frac{f(x)}{g(x)} g(x) \mathrm{d}x\\ &=& \int h(x) g(x) \mathrm{d}x, \end{eqnarray}\] with \(h(x)=\frac{f(x)}{g(x)}\). So we now have \[\begin{equation} R=\E\{h(X)\}, \end{equation}\] where \(X\) has the density function \(g(x)\). If we now sample \(X_1,\ldots,X_n\) from \(g(x)\), then evaluate \(h(X_1),\ldots,h(X_n)\), then \[\begin{equation} \hat{R}=\frac{1}{n}\sum_{i=1}^n h(X_i), \end{equation}\] is an unbiased estimator of \(R\).
1.3 Markov chain Monte Carlo (MCMC)
In Bayesian statistics, we are faced with the problem that we cannot often derive the posterior distribution analytically. We can, however, write down the posterior distribution up to proportionality, simply by multiplying the prior and likelihood \[ f(\theta|x) \propto f(\theta)f(x|\theta).\]
If we can somehow produce a sample \(\theta_1,\ldots,\theta_n\) from the posterior distribution then, using the Monte Carlo ideas described in the previous section, we can obtain any posterior summary of \(\theta\) that we want, and to arbitrary accuracy given a sufficiently large sample. One way to obtain this sample is through Markov chain Monte Carlo methods.
Recall that a sequence of random variables \(X_1,\ldots,X_n\) is an instance of a Markov chain if the transition kernel \(\P(X_{t+1}|X_t)\) satisfies the Markov property, i.e. that \[\P(X_{t+1}|X_1,\ldots,X_t)=\P(X_{t+1}|X_t),\] for all \(t>0\). If a Markov chain is aperiodic and irreducible, then it has a unique stationary distribution, \(\boldsymbol{\pi}\), such that \(\lim_{t \rightarrow \infty} \P(X_t=i)=\boldsymbol{\pi}_i\).
Our aim for Bayesian inference therefore is to specify a transition kernel for a Markov chain that allows us to sample \(\theta_1,\ldots,\theta_n\) such that the stationary distribution is our posterior. The important algorithm that you were introduced to in Semester 1 to achieve this task is the Metropolis-Hastings algorithm, which we recap next.
1.3.1 The Metroplis-Hastings (MH) algorithm
We will revisit the MH algorithm for generating a sample \(\theta_1,\ldots,\theta_n\) from the posterior distribution \(f(\theta|x)\).
Given a sample \(\theta_i\), we:
- Simulate the proposal \(\theta^*\sim q(\cdot|\theta_i)\).
- Evaluate \[\begin{equation} m=\min\left\lbrace1, \frac{f(\theta^*)f(x|\theta^*)q(\theta_i|\theta^*)}{f(\theta_i)f(x|\theta_i)q(\theta^*|\theta_i)} \right\rbrace, \tag{1.1} \end{equation}\]
- Accept \(\theta_{i+1}=\theta^*\) with probability \(m\), and set \(\theta_{i+1}=\theta_i\) otherwise.
The MH acceptance probability, given by \(m\) above, comprises of two parts:
- The ratio of the posterior density between the proposed and current parameter values: \[\frac{f(\theta^*)f(x|\theta^*)}{f(\theta_i)f(x|\theta_i)},\] and
- The ratio of the proposal density between the forwards and backwards proposal steps: \[\frac{q(\theta_i|\theta^*)}{q(\theta^*|\theta_i)}.\] We’ll particularly focus on the concept of the random walk MH algorithm, where the proposal distribution is symmetric, i.e. \(q(i|j)=q(j|i)\) for all \(i,j\). It is common in practice for this to be the case, with the proposal distribution being Gaussian centred on \(\theta_i\). The MH acceptance ratio therefore simplifies to \(m=\min\left\lbrace1, \frac{f(\theta^*)f(x|\theta^*)}{f(\theta_i)f(x|\theta_i)} \right\rbrace\), and is dependent only on the ratio of the posterior density.
1.3.2 The problem with random walk proposals
Under the random walk MH scenario, the algorithm amounts to choosing at random a new direction and distance to ‘step’ away from the current location (given by \(\theta_i\)) within the parameter space. The shape or magnitude of the posterior density has no basis for how we make this step proposal, only whether or not we accept the proposal once it has been made. If our proposed value, \(\theta^*\), has a higher posterior density than the current value \(\theta_i\), then we automatically accept the proposal. If not, we have the MH ratio of densities to probabilistically decide on acceptance of the proposal.
The theory of Markov chains tells us that a well designed MCMC sampler, such as the MH algorithm, will eventually converge to its stationary distribution (the posterior). But this theory does not tell us that this convergence will happen in a finite—or more importantly computationally feasible—amount of time/iterations.
Convergence becomes particularly problematic for random walk MH in parameter spaces of high dimension. As dimension increases, the region of the parameter space that contributes significantly towards the posterior distribution function reduces, and it can be hard to design a ‘good’ proposal distribution for the MH sampler. This is because:
- If the proposed steps are too large, they are almost always rejected. This means the MCMC chains will get ‘stuck’ in an area of the parameter space.
- If the proposed steps are made very small so that acceptances occur, then the chain move around the parameter space prohibitively slowly. In either of these cases, this can result in an MCMC sample having an effective sample sizes of almost zero and not having explored the full parameter space. In high dimensional parameter spaces, multi-modality and complex correlations between dimensions further exacerbate this problem.
The problems discussed here form the motivation for our next chapter—we aim to define an MCMC sampler that produces samples from our target posterior distribution, but which moves around the parameter space in a more efficient manner than the random walk.
1.3.3 The Gibbs algorithm
An improvement to MCMC algorithms that rely on random walks that you met in Semester 1 was the Gibbs algorithm, which is a special case of the MH sampler. Rather than proposing a step in the parameter space at random, without consideration to the shape of the posterior distribution, the Gibbs algorithm does make proposals based the shape of the posterior.
We first recap the Gibbs algorithm for sampling the posterior of the parameter vector \((\theta_1,\ldots,\theta_n)\).
Given \((\theta_{1,t},\ldots,\theta_{n,t})\):
- Choose a random update ordering \(\lbrace d_1,\ldots,d_n \rbrace\) of the \(n\) dimensions, i.e. a permutation of \(\lbrace 1,\ldots,n\rbrace\).
- In the order determined, sample from the conditional posterior for each parameter, using the most up-to-date parameter values. For example:
- set \(\theta_{d_1,t+1}\) equal to a sample from \(f(\theta_{d_1}|x,\theta_{d_2, t},\ldots,\theta_{d_n, t})\),
- set \(\theta_{d_2,t+1}\) equal to a sample from \(f(\theta_{d_2}|x,\theta_{d_1, t+1},\theta_{d_3, t},\ldots,\theta_{d_n, t})\),
- set \(\theta_{d_3,t+1}\) equal to a sample from \(f(\theta_{d_3}|x,\theta_{d_1, t+1},\theta_{d_2, t+1},\theta_{d_4, t},\ldots,\theta_{d_n, t})\),
- and so forth.
Intuitively, in two dimensions, the Gibbs sampler amounts to exploring the parameter space by taking steps alternately along the horizontal (\(x\)) and vertical (\(y\)) axes. If the current location is at \((x,y)\) and the next update is in the \(y\) direction, then conditional posterior distribution is considered at the fixed point \(x\). Because we sample \(y\) from this conditional distribution, we are taking into account the shape of the conditional posterior, and are most likely to move towards a region of high posterior conditional density.
The Gibbs sampler is a special case of the MH sampler, but we do not have to evaluate the acceptance probability given in Equation (1.1), because the proposal distribution is equal to the conditional posterior we need to consider, and therefore this term simplifies to 1.
1.3.4 The problem with Gibbs
The Gibbs sampler appears to solve our problem of proposing random directions to move around the parameter space. However, this also has a number of disadvantages.
- We need to be able to evaluate the conditional posterior density of every dimension of our parameter vector and be able to sample from it. In practice, this is often not possible, particularly at high dimensions.
- We can only move in a subset of dimensions at a time. This can be slow in high dimensions, and is exacerbated when correlations are high or ridges exist in the posterior. Intuitively, the stepping direction of Gibbs in two dimensions is alternately the horizontal and vertical, so if there is a high correlation between the two dimensions the region of high posterior density sits as a diagonal. Therefore the geometry of the stepping directions is not aligned with the geometry of the posterior.