Chapter 3 Populations, samples and statistical models
We describe the problem of inferring characteristics of a population, such as a population mean or proportion, given a random sample drawn from that population. One general approach for learning about populations from samples is to model the sample observations as random draws from a probability distribution, with the probability distribution representing the whole population. The parameters of the distribution can then be interpreted as describing population characteristics, such as the mean of the population.
3.1 Statistical models
Given a general description of a problem, we have to decide what probability distribution to use a suitable model, and think how the parameter(s) of that distribution relates to our quantity of interest
Example 3.1 (Choosing probability distributions to represent data.)
Here are four examples.
- In a crime survey, we wish to estimate the proportion of households in a city that have been burgled in the last year. One hundred households are to be selected at random. Define \(X\) to be the number of households responding that they have been burgled. Choose a suitable probability distribution for \(X\), and relate the parameter(s) in that distribution to the quantity of interest.
\(X\) is a discrete random variable. If we treat the 100 (randomly selected) households as independent, we can think of \(X\) as the number of times an event happens out of 100 trials. We model this with a binomial distribution: \[ X\sim Bin(100, \theta). \] The parameter \(\theta\) would be the probability of a single household responding that they have been burgled; we interpret \(\theta\) as the population proportion of all burgled households.
It is unlikely that the sample proportion \(X/n\) will equal the true (but unknown) population proportion \(\theta\). By studying this model, we can understand how \(X/n\) may deviate from \(\theta\) (more in the next chapter).
- The number of accidents occurring over a road network is to be investigated. The interest is in the mean number of accidents per week, averaged over ‘all’ weeks. Define \(X_1,\ldots,X_{10}\) to be the number of accidents that will be observed in each of the first 10 weeks. Choose a suitable probability distribution for \(X_1,\ldots,X_{10}\), and relate the parameter(s) in that distribution to the quantity of interest.
Each \(X_i\) is a discrete random variable, and is a count of the number of accidents in a particular week. We are counting the number of events in a period of time: this situation can be modelled with a Poisson distribution: \[ X_1,\ldots,X_{10}\stackrel{i.i.d}\sim Poisson(\lambda). \] The notation i.i.d is short for “independent and identically distributed”. Each of \(X_1,\ldots,X_{10}\) has the same \(Poisson(\lambda)\) distribution, and \(X_1,\ldots,X_{10}\) are assumed independent.
We interpret \(\lambda\) as the population mean number of accidents per week, over all weeks, assuming no change to the underlying risk of an accident occurring in any week.
- In a call-centre, there is interest in the volume of incoming calls during peak opening times. Specifically, there is interest in the mean length of time (in seconds) between two successive incoming calls. Define \(X_1,\ldots,X_{100}\) to be the times that will be observed between calls for 100 successive pairs of calls. Choose a suitable probability distribution for \(X_1,\ldots,X_{100}\), and relate the parameter(s) in that distribution to the quantity of interest.
The time between two successive calls is a continuous quantity, cannot be negative, but could be close to 0. From the distributions you have met so far in this module, the exponential distribution would be the most suitable choice. We suppose \[ X_1,\ldots,X_{100}\stackrel{i.i.d}\sim Exponential(\lambda), \] and we interpret \(1/\lambda\) as the population mean time between all successive calls during peak hours.
- For an electric car, we are interested in how many miles the car can be driven on a single charge of the battery, on a particular test journey route. The interest is in both what the mileage is ‘on average’, and how variable this mileage can be. Eight cars, all of the same model, are to be driven on the test route, all starting at the same time and day each week. Let \(X_1,\ldots,X_8\) denote the mileages that will be observed. Choose a suitable probability distribution for \(X_1,\ldots,X_{8}\), and relate the parameter(s) in that distribution to the quantities of interest.
Mileage is a continuous quantity. Out of the continuous distributions we have covered in this module, the most suitable one would be a normal distribution: \[ X_1,\ldots,X_{8}\stackrel{i.i.d}\sim N(\mu,\sigma^2). \] We interpret \(\mu\) as the population mean mileage, and \(\sigma^2\) as the population variance of the mileages, for all cars of this model on the test route.
3.1.1 Objectives
We have described various scenarios where we are interested in some characteristic of a population (mean, variance, proportion), and we draw a sample from that population. In the remaining chapters, we will study
- how to estimate these population characteristics, once we have observed our data, via estimating parameters of probability distributions;
- how to assess the accuracy of our estimates, using confidence intervals;
- how to compare different populations, and test if they have different characteristics or not, using a variety of hypothesis tests.
3.1.2 Comment: infinite and finite populations
When a population is represented by a probability distribution, we are working with what we describe as an infinite population: each member of the population corresponds to a random draw from this distribution, and there is no limit on the number of possible random draws: there is no limit on the size of the population.
An alternative is the finite population approach, in which we define a (finite) list of all population members, and any sample of data is a random selection from this list. With finite population methods, we don’t actually need to assume a probability distribution for the population members: we can do inference for population means, proportions etc. without assuming one. The limitation with this is that there are many statistical inference problems where the sample can’t be thought of as a random selection from a list. For example, if the population of interest was all newborn babies, including babies born in the future, then some population members do not yet exist; we could not produce a list of all the population members.
In this module, we will cover infinite population approaches only. Finite population methods are covered in MAS370/MAS61003 Sampling Theory and Design of Experiments.