Chapter 3 Models and data
In the previous Chapter we estimated a parameter value based on a single data point. We will usually have more than one observation, and in this section we discuss how to carry out maximum likelihood estimation using many data points.
To simplify the presentation we will assume here that we are working with continuous random variables. For discrete random variables, we can simply replace the term “probability density function” with “probability mass function.” However, in a slight abuse of notation, we will use \(f_X(x)\) to represent either a probability density function or a probability mass function: for discrete \(X\) we will write \(f_X(x) = Pr(X = x)\).
3.1 Data: notation
Typically we will have a set of \(n\) data points which we can think of as a vector, \(\mathbf{x}=(x_1,x_2,\ldots,x_n)\). We will think of the data as being a realisation of a random vector \(\mathbf{X}=(X_1,X_2,\ldots,X_n)\). Note the use of capital letters for the random variables and lower case letters for the values they take.
The random vector \(\mathbf{X}\) will have a joint probability density function \[ f_{\mathbf{X}}(\mathbf{x})=f_{X_1,X_2,\cdots,X_n}(x_1,x_2,\ldots,x_n). \] This function will be unknown (it will depend on unknown parameters), the aim of the inference being to obtain information about it.
We will assume that our data points \(x_1,x_2,\ldots,x_n\) come from independent, identically distributed experiments. With this in mind, we call them i.i.d. samples. Because of this, we also assume that the random variables \(X_1,X_2,\ldots,X_n\) are independent and identically distributed. In this case, the joint probability function \(f_{\mathbf{X}}(\mathbf{x})\) will be a product of terms for each experiment: \[\begin{equation} f_{\mathbf{X}}(\mathbf{x})=\prod_{i=1}^n f_X(x_i), \end{equation}\] where \(f_X(x)\) is the common probability density function of the random variables \(X_1,X_2,\ldots,X_n\).
3.2 Models and parameters
We assume that we already know that the joint probability density function of \(\mathbf{X}\) takes a particular form, usually involving some standard distribution. We refer to this as our model.However, the parameters of this standard distribution are unknown, and our aim in analysing the data will be to obtain good choices of values for these parameters, based on the data we have.
Example 3.1 (Models, parameters and data: aerosols.)
The ‘particle size distribution’ of an aerosol is the distribution of the diameter of aerosol particles within a typical region of air. The term is also used for particles within a powder, or suspended in a fluid.
In many situations, the particle size distribution is modelled using the log-normal distribution. If a random variable \(Y\) has log-normal distribution, this simply means that \(\log Y\) is normally distributed.
The p.d.f. of the log-normal distribution is \[ f_Y(y)= \begin{array}{ll} \frac{1}{y\sigma\sqrt{2\pi}}\exp\left(-\frac{(\log y-\mu)^2}{2\sigma^2}\right) & \text{if }y\in(0,\infty)\\ 0 & \text{otherwise}. \end{array} \]
It is typically reasonable to assume that the diameters of particles are independent. Assuming this model, find the joint probability density function of the diameters observed in a sample of \(n\) particles, and state the parameters of the model.
Solution
The parameters of this distribution, and hence also the parameters of our model, are \(\mu\in\mathbb{R}\) and \(\sigma\in(0,\infty)\). Since the diameters of particles are assumed to be independent, the joint probability density function of \(\mathbf{Y}=(Y_1,Y_2,\ldots,Y_n)\), where \(Y_i\) is the diameter of the \(i^{th}\) particle, is \[\begin{align*} f_{\mathbf{Y}}(y_1,\ldots,y_n) &=\prod\limits_{i=1}^n f_{Y_i}(y_i)\\ &=\begin{cases} \frac{1}{(2\pi\sigma^2)^{n/2}}\frac{1}{y_1y_2\ldots y_n}\exp\left(-\sum\limits_{i=1}^n\frac{(\log y_i-\mu)^2}{2\sigma^2}\right) & \text{if }y_i>0\text{ for all }i\\ 0 & \text{otherwise}. \end{cases} \end{align*}\] Note that, if one (or more) of the \(y_i\) is less than or equal to zero then \(f_{Y_i}(y_i)=0\), which means that also \(f_{\mathbf{Y}}(y_1,\ldots,y_n)=0\).
It may seem odd to declare that \(f\) is unknown, and then assume that in fact \(f\) takes a particular form with only unknown parameters. There are statistical methods aimed at handling completely unknown \(f\), but they are outside of the scope of this course. In many situations it is sensible to assume a carefully chosen model with unknown parameters.
Our choice of model may well be wrong. But we hope that it is approximately correct, there are various ‘goodness of fit’ tests to help us check.
We denote the parameters of our model by \(\boldsymbol{\theta}\) and we represent \(\boldsymbol{\theta}=(\theta_1,\theta_2,\ldots)\) as a vector. We write \(\Theta\) for the set of possible parameter values.
Sometimes some of the unknown parameters are so-called nuisance parameters: their values are unknown, and we have to take account of this in our analysis, but they are not what we are really interested in.
Given a model, and a particular set of parameter values \(\boldsymbol{\theta}\), we write the probability density function of \(\mathbf{X}\) as \(f_{\mathbf{X}}(\mathbf{x};\boldsymbol{\theta})\), to make sure that we don’t forget the importance of the parameters.
3.3 Likelihood functions for i.i.d data
How can we apply the ideas of maximum likelihood estimation in our new setting? The key is to note that Definition 2.1 already makes sense in the multivariate case, with our vector \(\mathbf{X}\) in place of \(X\) and a vector of data \(\mathbf{x}\) in place of \(x\). We also allow more than just one unknown parameter.
So, to summarise, let \(\mathbf{X}=(X_1,\ldots,X_k)\) be a random vector, with a known distribution that has one or more unknown parameters \(\boldsymbol\theta=(\theta_1,\theta_2,\ldots,\theta_j)\). Write \(\Theta\) for the set of all possible choices \(\boldsymbol\theta\) of parameter(s). Let \(\mathbf{x}\) be our vector of data, which we think of as a sample of \(\mathbf{X}\).
The likelihood function of \(\mathbf{X}\), given the data \(\mathbf{x}\), is the function \(L:\Theta\to\mathbb{R}\) defined by \[L(\boldsymbol\theta;\mathbf{x})=f_{\mathbf{X}}(\mathbf{x};\boldsymbol\theta).\] The likelihood function is therefore the joint density of the observed data, expressed as a function of the model parameters.
The (hopefully, unique) value \(\boldsymbol\theta\in\Theta\) which maximises \(L(\boldsymbol\theta;\mathbf{x})\) is known as the maximum likelihood estimator of \(\boldsymbol\theta\), written \(\widehat{\boldsymbol\theta}\).
It is worth noting that, when an explicit numerical value is found for \(\hat\theta\), some people refer to the value found for \(\hat\theta\) as a maximum likelihood estimate (and they would only refer to the general formula for \(\hat\theta\), in terms of \(\mathbf{x}\), as the maximum likelihood estimator). You can choose whichever convention you prefer.
We are mainly interested in the case where our data are i.i.d. samples, and we assume the \(X_1,X_2,\ldots,X_n\) are independent and identically distributed. In this case, from , the likelihood function of our model is \[\begin{equation} L(\boldsymbol\theta;\mathbf{x})=f_\mathbf{X}(\mathbf{x};\boldsymbol\theta)=\prod\limits_{i=1}^n f_X(x_i;\boldsymbol\theta). \end{equation}\] where \(f_X\) is the common p.d.f. of the \(X_i\).
Example 3.2 (Maximum likelihood estimation with i.i.d. data.)
Let \(X\sim Bern(\theta)\), where \(\theta\) is an unknown parameter. Suppose that we have \(3\) independent samples of \(X\), which are \[\mathbf{x}=\{0,1,1\}.\] Find the likelihood function of \(\theta\), given this data.
Solution
The probability mass function of a single \(Bern(\theta)\) random variable is \[f_X(x;\theta)= \begin{cases} \theta & \text{if }x=1\\ 1-\theta & \text{if }x=0\\ 0 & \text{otherwise} \end{cases} \]
Since our three samples are independent, we model \(\mathbf{x}\) as a sample from the joint distribution \(\mathbf{X}=(X_1,X_2,X_3)\), where \[\begin{align*} f_\mathbf{X}(\mathbf{x};\theta)=\prod\limits_{i=1}^3 f_{X_i}(x_i;\theta) \end{align*}\] and \(f_{X_i}\) is the probability mass function of a single \(Bern(\theta)\) random variable. Since \(f_{X_i}\) has several cases, it would be unhelpful to try and expand out this formula before we put in values for the \(x_i\). Our likelihood function is therefore \[\begin{align*} L(\theta;\mathbf{x})&=f_{X_1}(0;\theta)\,f_{X_2}(1;\theta)\,f_{X_3}(1;\theta)\\ &=(1-\theta)\theta\theta\\ &=\theta^2-\theta^3. \end{align*}\] The range of values that the parameter \(\theta\) can take is \(\Theta=[0,1]\).
Find the maximum likelihood estimator of \(\theta\), given the data \(\mathbf{x}\).
Solution
We seek to maximize \(L(\theta;\mathbf{x})\) for \(\theta\in[0,1]\). Differentiating once, \[\frac{dL}{d\theta}=2\theta-3\theta^2=\theta(2-3\theta)\] so the turning points are at \(\theta=0\) and \(\theta=\frac23\). Differentiating again, \[\frac{d^2L}{d\theta^2}=2-6\theta\] which gives \(\frac{d^2L}{d\theta^2}\big|_{\theta=0}=2\) and \(\frac{d^2L}{d\theta^2}\big|_{\theta=2/3}=2-4=-2\). Hence, \(\theta=0\) is a local minimum and \(\theta=\frac23\) is a local maximum, so \(\theta=\frac23\) maximises \(L(\theta;\mathbf{x})\) over \(\theta\in[0,1]\). The maximum likelihood estimator of \(\theta\) is therefore \[\hat\theta=\frac23.\] This is, hopefully, reassuring. The number of \(1\)s in our sample of \(3\) was \(2\), so (using independence) \(\theta=\frac23\) seems like a good guess.
Example 3.3 (Maximum likelihood estimation (radioactive decay).)
Atoms of radioactive elements decay as time passes, meaning that any such atom will, at some point in time, suddenly break apart. This process is known as radioactive decay.
The time taken for a single atom of, say, carbon-15 to decay is usually modelled as an exponential random variable, with unknown parameter \(\lambda\in(0,\infty)\). The parameter \(\lambda\) is known as the ‘decay rate.’ The times at which atoms decay are known to be independent.
Using this model, find the likelihood function for the time to decay of a sample of \(n\) carbon-15 atoms.
Solution
The decay time \(X_i\) of the \(i^{th}\) atom is exponential with parameter \(\lambda\in(0,\infty)\), and therefore has p.d.f. \[f_{X_i}(x_i;\lambda)= \begin{cases} \lambda e^{-\lambda x_i} & \text{if }x_i>0\\ 0 & \text{otherwise}. \end{cases} \] Since each atom decays independently, the joint distribution of \(\mathbf{X}=(X_i)_{i=1}^n\) is \[\begin{align*} f_{\mathbf{X}}(\mathbf{x};\lambda) =\prod\limits_{i=1}^n f_{X_i}(x_i;\lambda) &= \begin{cases} \prod\limits_{i=1}^n \lambda e^{-\lambda x_i} & \text{if }x_i>0\text{ for all }i\\ 0 & \text{otherwise.} \end{cases}\\ &= \begin{cases} \lambda^n \exp\left(-\lambda \sum_{i=1}^nx_i\right) & \text{if }x_i>0\text{ for all }i\\ 0 & \text{otherwise.} \end{cases} \end{align*}\] Therefore, the likelihood function is \[L(\lambda;\mathbf{x})= \begin{cases} \lambda^n \exp\left(-\lambda \sum_{i=1}^nx_i\right) & \text{if }x_i>0\text{ for all }i\\ 0 & \text{otherwise.} \end{cases} \] The range of possible values of the parameter \(\lambda\) is \(\Theta=(0,\infty)\).
Suppose that we have sampled the decay times of \(15\) carbon-15 atoms (in seconds, accurate to two decimal places), and found them to be \[\begin{align*} \mathbf{x}&=\{ 0.50,\, 2.19,\, 0.88,\, 4.06,\, 9.75,\, 2.62,\, 0.13,\, 2.70,\, 0.03,\, 0.28,\, 4.15,\, 9.52,\, 2.67,\, 3.79,\, 4.31 \}, \end{align*}\] with \[ \sum\limits_{i=1}^{15} x_i= 47.58. \]
Find the maximum likelihood estimator of \(\lambda\), based on these data.
Solution
Given this data, for which \(\sum\limits_{i=1}^{15} x_i= 47.58\), our likelihood function is \[L(\lambda;\mathbf{x})=\lambda^{15}e^{-47.58\lambda}.\] Differentiating, we have \[\begin{align*} \frac{dL}{d\lambda}&=15\lambda^{14}e^{-47.58\lambda}-47.58\lambda^{15}e^{-47.58\lambda}\\ &=\lambda^{14}(15-47.58\lambda)e^{-47.58\lambda} \end{align*}\] which is zero only when \(\lambda=0\) or \(\lambda=15/47.58\approx 0.32\). Since \(\lambda=0\) is outside of the range \(\Theta=(0,\infty)\) of possible parameter values, the only turning point of interest is \(\lambda=15/47.58\).
Differentiating again (with the details left to you), we end up with \[\begin{align*} \frac{d^2L}{d\lambda^2} &=(210\lambda^{13}- 1427.4\lambda^{14}+2263.86\lambda^{15})e^{-47.58 \lambda}\\ &=\lambda^{13}\left(210-1427.4\lambda+2263.86\lambda^2\right)e^{-47.58\lambda} \end{align*}\] Evaluating at our turning point gives \[\frac{d^2L}{d\lambda^2}\Big|_{\lambda=15/47.58}=\left(\frac{15}{47.58}\right)^{13}\left(-14.9996\right)e^{-15}<0.\] So, our turning point is a local maximum. Since there are no other turning points (within the allowable range) our turning point is the global maximum. Hence, the maximum likelihood estimator of \(\lambda\), given our data \(\mathbf{x}\), is \[\hat\lambda=\frac{15}{47.58}\approx 0.32.\] In reality, physicists are able to collect vastly more data than \(n=15\), but even with \(15\) data points we are not far away from the true value of \(\lambda\), which is \(\lambda\approx 0.283033\). Of course, by ‘true’ value here we mean the value that has been discovered experimentally, with the help of statistical inference.