Chapter 1
The need for more than one random-effect term when fitting a regression line
1.1 A data set with several observations of variable Y at each value of variable X
One of the commonest, and simplest, uses of statistical analysis is the fitting of a straight line, known for historical reasons as a regression line, to describe the relationship between an explanatory variable, X and a response variable, Y. The departure of the values of Y from this line is called the residual variation, and is regarded as random. It is natural to ask whether the part of the variation in Y that is explained by the relationship with X is more than could reasonably be expected by chance: or more formally, whether it is significant relative to the residual variation. This is a simple regression analysis, and for many data sets it is all that is required. However, in some cases, several observations of Y are taken at each value of X. The data then form natural groups, and it may no longer be appropriate to analyse them as though every observation were independent: observations of Y at the same value of X may lie at a similar distance from the line. We may then be able to recognize two sources of random variation, namely
- variation among groups
- variation among observations within each group.
This is one of the simplest situations in which it is necessary to consider the possibility that there may be more than a single stratum of random variation—or, in the language of mixed modelling, that a model with more than one random-effect term may be required. In this chapter, we will examine a data set of this type and explore how the usual regression analysis is modified by the fact that the data form natural groups.
We will explore this question in a data set that relates the prices of houses in England to their latitude. There is no doubt that houses cost more in the south of England than in the north: these data will not lead to any new conclusions, but they will illustrate this trend, and the methods used to explore it. The data are displayed in a spreadsheet in Table 1.1. The first cell in each column contains the name of the variable held in that column. The variables ‘latitude’ and ‘price_pounds’ are variates—lists of observations that can take any numerical value, the commonest kind of data for statistical analysis. However, the observations of the variable ‘town’ can take only certain permitted values—in this case, the names of the 11 towns under consideration. A variable of this type is called a factor, and the exclamation mark (!) after its name indicates that ‘town’ is a factor. The towns are the groups of observations: within each town, all the houses are at nearly the same latitude, and the latitude of the town is represented by a single value in this data set. In contrast, the price of each house is potentially unique. The conventions introduced here apply to all other spreadsheets displayed in this book.
Table 1.1 Prices of houses in a sample of English towns and their latitudes.
| 1 | town! | latitude | price_pounds | | 34 | Crewe | 53.0998 | 84950 |
| 2 | Bradford | 53.7947 | 39950 | | 35 | Crewe | 53.0998 | 112500 |
| 3 | Bradford | 53.7947 | 59950 | | 36 | Crewe | 53.0998 | 140000 |
| 4 | Bradford | 53.7947 | 79950 | | 37 | Durham | 54.7762 | 127950 |
| 5 | Bradford | 53.7947 | 79995 | | 38 | Durham | 54.7762 | 157000 |
| 6 | Bradford | 53.7947 | 82500 | | 39 | Durham | 54.7762 | 169950 |
| 7 | Bradford | 53.7947 | 105000 | | 40 | Newbury | 51.4037 | 172950 |
| 8 | Bradford | 53.7947 | 125000 | | 41 | Newbury | 51.4037 | 185000 |
| 9 | Bradford | 53.7947 | 139950 | | 42 | Newbury | 51.4037 | 189995 |
| 10 | Bradford | 53.7947 | 145000 | | 43 | Newbury | 51.4037 | 195000 |
| 11 | Buxton | 53.2591 | 120000 | | 44 | Newbury | 51.4037 | 295000 |
| 12 | Buxton | 53.2591 | 139950 | | 45 | Newbury | 51.4037 | 375000 |
| 13 | Buxton | 53.2591 | 149950 | | 46 | Newbury | 51.4037 | 400000 |
| 14 | Buxton | 53.2591 | 154950 | | 47 | Newbury | 51.4037 | 475000 |
| 15 | Buxton | 53.2591 | 159950 | | 48 | Ripon | 54.1356 | 140000 |
| 16 | Buxton | 53.2591 | 159950 | | 49 | Ripon | 54.1356 | 152000 |
| 17 | Buxton | 53.2591 | 175950 | | 50 | Ripon | 54.1356 | 187950 |
| 18 | Buxton | 53.2591 | 399950 | | 51 | Ripon | 54.1356 | 210000 |
| 19 | Carlisle | 54.8923 | 85000 | | 52 | Royal Leamington Spa | 52.2876 | 147000 |
| 20 | Carlisle | 54.8923 | 89950 | | 53 | Royal Leamington Spa | 52.2876 | 159950 |
| 21 | Carlisle | 54.8923 | 90000 | | 54 | Royal Leamington Spa | 52.2876 | 182500 |
| 22 | Carlisle | 54.8923 | 103000 | | 55 | Royal Leamington Spa | 52.2876 | 199950 |
| 23 | Carlisle | 54.8923 | 124950 | | 56 | Stoke-On-Trent | 53.0041 | 69950 |
| 24 | Carlisle | 54.8923 | 128500 | | 57 | Stoke-On-Trent | 53.0041 | 69950 |
| 25 | Carlisle | 54.8923 | 132500 | | 58 | Stoke-On-Trent | 53.0041 | 75950 |
| 26 | Carlisle | 54.8923 | 135000 | | 59 | Stoke-On-Trent | 53.0041 | 77500 |
| 27 | Carlisle | 54.8923 | 155000 | | 60 | Stoke-On-Trent | 53.0041 | 87950 |
| 28 | Carlisle | 54.8923 | 158000 | | 61 | Stoke-On-Trent | 53.0041 | 92000 |
| 29 | Carlisle | 54.8923 | 175000 | | 62 | Stoke-On-Trent | 53.0041 | 94950 |
| 30 | Chichester | 50.8377 | 199950 | | 63 | Witney | 51.7871 | 179950 |
| 31 | Chichester | 50.8377 | 299250 | | 64 | Witney | 51.7871 | 189950 |
| 32 | Chichester | 50.8377 | 350000 | | 65 | Witney | 51.7871 | 220000 |
Source: Data obtained from an estate agent's website in October 2004.
Before commencing a formal analysis of this data set, we should note its limitations. A thorough investigation of the relationship between latitude and house price would take into account many factors besides those recorded here—the number of bedrooms in each house, its state of repair and improvement, other observable indicators of the desirability of its location, and so on. To some extent such sources of variation have been eliminated from the present sample by the choice of houses that are broadly similar: they are all ‘ordinary’ houses (no flats, maisonettes, etc.) and all have three, four or five bedrooms. The remaining sources of variation in price will contribute to the residual variation among houses in each town, and will be treated accordingly. We should also consider in what sense we can think of the latitude of houses as ‘explaining’ the variation in their prices. The easily measurable variable ‘latitude’ is...