Chapter 3 Populations, samples and statistical models
We describe the problem of inferring characteristics of a population, such as a population mean or proportion, given a random sample drawn from that population. One general approach for learning about populations from samples is to model the sample observations as random draws from a probability distribution, with the probability distribution representing the whole population. The parameters of the distribution can then be interpreted as describing population characteristics, such as the mean of the population.
3.1 Statistical models
Given a general description of a problem, we have to decide what probability distribution to use a suitable model, and think how the parameter(s) of that distribution relates to our quantity of interest
Example 3.1 (Choosing probability distributions to represent data.)
Here are four examples.
- In a crime survey, we wish to estimate the proportion of households in a city that have been burgled in the last year. One hundred households are to be selected at random. Define X to be the number of households responding that they have been burgled. Choose a suitable probability distribution for X, and relate the parameter(s) in that distribution to the quantity of interest.
- The number of accidents occurring over a road network is to be investigated. The interest is in the mean number of accidents per week, averaged over ‘all’ weeks. Define X1,…,X10 to be the number of accidents that will be observed in each of the first 10 weeks. Choose a suitable probability distribution for X1,…,X10, and relate the parameter(s) in that distribution to the quantity of interest.
- In a call-centre, there is interest in the volume of incoming calls during peak opening times. Specifically, there is interest in the mean length of time (in seconds) between two successive incoming calls. Define X1,…,X100 to be the times that will be observed between calls for 100 successive pairs of calls. Choose a suitable probability distribution for X1,…,X100, and relate the parameter(s) in that distribution to the quantity of interest.
- For an electric car, we are interested in how many miles the car can be driven on a single charge of the battery, on a particular test journey route. The interest is in both what the mileage is ‘on average’, and how variable this mileage can be. Eight cars, all of the same model, are to be driven on the test route, all starting at the same time and day each week. Let X1,…,X8 denote the mileages that will be observed. Choose a suitable probability distribution for X1,…,X8, and relate the parameter(s) in that distribution to the quantities of interest.
3.1.1 Objectives
We have described various scenarios where we are interested in some characteristic of a population (mean, variance, proportion), and we draw a sample from that population. In the remaining chapters, we will study
- how to estimate these population characteristics, once we have observed our data, via estimating parameters of probability distributions;
- how to assess the accuracy of our estimates, using confidence intervals;
- how to compare different populations, and test if they have different characteristics or not, using a variety of hypothesis tests.
3.1.2 Comment: infinite and finite populations
When a population is represented by a probability distribution, we are working with what we describe as an infinite population: each member of the population corresponds to a random draw from this distribution, and there is no limit on the number of possible random draws: there is no limit on the size of the population.
An alternative is the finite population approach, in which we define a (finite) list of all population members, and any sample of data is a random selection from this list. With finite population methods, we don’t actually need to assume a probability distribution for the population members: we can do inference for population means, proportions etc. without assuming one. The limitation with this is that there are many statistical inference problems where the sample can’t be thought of as a random selection from a list. For example, if the population of interest was all newborn babies, including babies born in the future, then some population members do not yet exist; we could not produce a list of all the population members.
In this module, we will cover infinite population approaches only. Finite population methods are covered in MAS370/MAS61003 Sampling Theory and Design of Experiments.